Analysis

Skip to main content
Skip to footer

Language selection

Français

Search and menus

Search and menus

Search

Skip to filters. View results.

What’s new on our website

Statistics Canada's Trust Centre

Results

All (13)

All (13) (0 to 10 of 13 results)

1. Measuring Social Capital at the Neighbourhood Level: Experimental Estimates of Sense of Belonging to the Local Community Measured at the Census Tract Level
Articles and reports: 11-633-X2021007
Description:
Statistics Canada continues to use a variety of data sources to provide neighbourhood-level variables across an expanding set of domains, such as sociodemographic characteristics, income, services and amenities, crime, and the environment. Yet, despite these advances, information on the social aspects of neighbourhoods is still unavailable. In this paper, answers to the Canadian Community Health Survey on respondents’ sense of belonging to their local community were pooled over the four survey years from 2016 to 2019. Individual responses were aggregated up to the census tract (CT) level.
Release date: 2021-11-16
2. Statistics Netherlands and AI Archived
Articles and reports: 11-522-X202100100011
Description: The ways in which AI may affect the world of official statistics are manifold and Statistics Netherlands (CBS) is actively exploring how it can use AI within its societal role. The paper describes a number of AI-related areas where CBS is currently active: use of AI for its own statistics production and statistical R&D, the development of a national AI monitor, the support of other government bodies with expertise on fair data and fair algorithms, data sharing under safe and secure conditions, and engaging in AI-related collaborations.
Key Words: Artificial Intelligence; Official Statistics; Data Sharing; Fair Algorithms; AI monitoring; Collaboration.
Release date: 2021-11-05
3. An Approximate Bayesian Approach to Improving Probability Sample Estimators Using a Supplementary Non-Probability Sample Archived
Articles and reports: 11-522-X202100100008
Description:
Non-probability samples are being increasingly explored by National Statistical Offices as a complement to probability samples. We consider the scenario where the variable of interest and auxiliary variables are observed in both a probability and non-probability sample. Our objective is to use data from the non-probability sample to improve the efficiency of survey-weighted estimates obtained from the probability sample. Recently, Sakshaug, Wisniowski, Ruiz and Blom (2019) and Wisniowski, Sakshaug, Ruiz and Blom (2020) proposed a Bayesian approach to integrating data from both samples for the estimation of model parameters. In their approach, non-probability sample data are used to determine the prior distribution of model parameters, and the posterior distribution is obtained under the assumption that the probability sampling design is ignorable (or not informative). We extend this Bayesian approach to the prediction of finite population parameters under non-ignorable (or informative) sampling by conditioning on appropriate survey-weighted statistics. We illustrate the properties of our predictor through a simulation study.
Key Words: Bayesian prediction; Gibbs sampling; Non-ignorable sampling; Statistical data integration.

Release date: 2021-10-29
4. Growing Regression Trees that Use Sampling Frame Covariates to Explore Response Burden for Use in Survey Design Archived
Articles and reports: 11-522-X202100100024
Description: The Economic Directorate of the U.S. Census Bureau is developing coordinated design and sample selection procedures for the Annual Integrated Economic Survey. The unified sample will replace the directorate’s existing practice of independently developing sampling frames and sampling procedures for a suite of separate annual surveys, which optimizes sample design features at the cost of increased response burden. Size attributes of business populations, e.g., revenues and employment, are highly skewed. A high percentage of companies operate in more than one industry. Therefore, many companies are sampled into multiple surveys compounding the response burden, especially for “medium sized” companies.
This component of response burden is reduced by selecting a single coordinated sample but will not be completely alleviated. Response burden is a function of several factors, including (1) questionnaire length and complexity, (2) accessibility of data, (3) expected number of repeated measures, and (4) frequency of collection. The sample design can have profound effects on the third and fourth factors. To help inform decisions about the integrated sample design, we use regression trees to identify covariates from the sampling frame that are related to response burden. Using historic frame and response data from four independently sampled surveys, we test a variety of algorithms, then grow regression trees that explain relationships between expected levels of response burden (as measured by response rate) and frame covariates common to more than one survey. We validate initial findings by cross-validation, examining results over time. Finally, we make recommendations on how to incorporate our robust findings into the coordinated sample design.
Release date: 2021-10-29
5. Administrative data for the estimation of population counts: Statistical learning from the first waves of Italian Permanent Population Census Archived
Articles and reports: 11-522-X202100100005
Description: The Permanent Census of Population and Housing is the new census strategy adopted in Italy in 2018: it is based on statistical registers combined with data collected through surveys specifically designed to improve registers quality and assure Census outputs. The register at the core of the Permanent Census is the Population Base Register (PBR), whose main administrative sources are the Local Population Registers. The population counts are determined correcting the PBR data with coefficients based on the coverage errors estimated with surveys data, but the need for additional administrative sources clearly emerged while processing the data collected with the first round of Permanent Census. The suspension of surveys due to global-pandemic emergency, together with a serious reduction in census budget for next years, makes more urgent a change in estimation process so to use administrative data as the main source. A thematic register has been set up to exploit all the additional administrative sources: knowledge discovery from this database is essential to extract relevant patterns and to build new dimensions called signs of life, useful for population estimation. The availability of the collected data of the two first waves of Census offers a unique and valuable set for statistical learning: association between surveys results and ‘signs of life’ could be used to build classification model to predict coverage errors in PBR. This paper present the results of the process to produce ‘signs of life’ that proved to be significant in population estimation.
Key Words: Administrative data; Population Census; Statistical Registers; Knowledge discovery from databases.
Release date: 2021-10-22
6. Creation of a composite quality indicator for administrative data-based estimates using clustering Archived
Articles and reports: 11-522-X202100100015
Description: National statistical agencies such as Statistics Canada have a responsibility to convey the quality of statistical information to users. The methods traditionally used to do this are based on measures of sampling error. As a result, they are not adapted to the estimates produced using administrative data, for which the main sources of error are not due to sampling. A more suitable approach to reporting the quality of estimates presented in a multidimensional table is described in this paper. Quality indicators were derived for various post-acquisition processing steps, such as linkage, geocoding and imputation, by estimation domain. A clustering algorithm was then used to combine domains with similar quality levels for a given estimate. Ratings to inform users of the relative quality of estimates across domains were assigned to the groups created. This indicator, called the composite quality indicator (CQI), was developed and experimented with in the Canadian Housing Statistics Program (CHSP), which aims to produce official statistics on the residential housing sector in Canada using multiple administrative data sources.
Keywords: Unsupervised machine learning, quality assurance, administrative data, data integration, clustering.
Release date: 2021-10-22
7. Harnessing Natural Language Processing and Machine Learning to Enhance Identification of Opioid-involved Health Outcomes in the National Hospital Care Survey Archived
Articles and reports: 11-522-X202100100016
Description: To build data capacity and address the U.S. opioid public health emergency, the National Center for Health Statistics received funding for two projects. The projects involve development of algorithms that use all available structured and unstructured data submitted for the 2016 National Hospital Care Survey (NHCS) to enhance identification of opioid-involvement and the presence of co-occurring disorders (coexistence of a substance use disorder and a mental health issue). A description of the algorithm development process is provided, and lessons learned from integrating data science methods like natural language processing to produce official statistics are presented. Efforts to make the algorithms and analytic datafiles accessible to researchers are also discussed.
Key Words: Opioids; Co-Occurring Disorders; Data Science; Natural Language Processing; Hospital Care
Release date: 2021-10-22
8. Applying the data science approach to COVID-19 epidemiological modelling to inform PPE demand and supply in Canada Archived
Articles and reports: 11-522-X202100100017
Description: The outbreak of the COVID-19 pandemic required the Government of Canada to provide relevant and timely information to support decision-making around a host of issues, including personal protective equipment (PPE) procurement and deployment. Our team built a compartmental epidemiological model from an existing code base to project PPE demand under a range of epidemiological scenarios. This model was further enhanced using data science techniques, which allowed for the rapid development and dissemination of model results to inform policy decisions.
Key Words: COVID-19; SARS-CoV-2; Epidemiological model; Data science; Personal Protective Equipment (PPE); SEIR
Release date: 2021-10-22
9. Responsible use of machine learning at Statistics Canada Archived
Articles and reports: 11-522-X202100100002
Description:
A framework for the responsible use of machine learning processes has been developed at Statistics Canada. The framework includes guidelines for the responsible use of machine learning and a checklist, which are organized into four themes: respect for people, respect for data, sound methods, and sound application. All four themes work together to ensure the ethical use of both the algorithms and results of machine learning. The framework is anchored in a vision that seeks to create a modern workplace and provide direction and support to those who use machine learning techniques. It applies to all statistical programs and projects conducted by Statistics Canada that use machine learning algorithms. This includes supervised and unsupervised learning algorithms. The framework and associated guidelines will be presented first. The process of reviewing projects that use machine learning, i.e., how the framework is applied to Statistics Canada projects, will then be explained. Finally, future work to improve the framework will be described.
Keywords: Responsible machine learning, explainability, ethics

Release date: 2021-10-15
10. Predicting transitions into and out of poverty using machine learning Archived
Articles and reports: 11-522-X202100100003
Description:
The increasing size and richness of digital data allow for modeling more complex relationships and interactions, which is the strongpoint of machine learning. Here we applied gradient boosting to the Dutch system of social statistical datasets to estimate transition probabilities into and out of poverty. Individual estimates are reasonable, but the main advantages of the approach in combination with SHAP and global surrogate models are the simultaneous ranking of hundreds of features by their importance, detailed insight into their relationship with the transition probabilities, and the data-driven identification of subpopulations with relatively high and low transition probabilities. In addition, we decompose the difference in feature importance between general and subpopulation into a frequency and a feature effect. We caution for misinterpretation and discuss future directions.
Key Words: Classification; Explainability; Gradient boosting; Life event; Risk factors; SHAP decomposition.

Release date: 2021-10-15

Stats in brief (0)

Stats in brief (0) (0 results)

No content available at this time.

Articles and reports (13)

Articles and reports (13) (0 to 10 of 13 results)

1. Measuring Social Capital at the Neighbourhood Level: Experimental Estimates of Sense of Belonging to the Local Community Measured at the Census Tract Level
Articles and reports: 11-633-X2021007
Description:
Statistics Canada continues to use a variety of data sources to provide neighbourhood-level variables across an expanding set of domains, such as sociodemographic characteristics, income, services and amenities, crime, and the environment. Yet, despite these advances, information on the social aspects of neighbourhoods is still unavailable. In this paper, answers to the Canadian Community Health Survey on respondents’ sense of belonging to their local community were pooled over the four survey years from 2016 to 2019. Individual responses were aggregated up to the census tract (CT) level.
Release date: 2021-11-16
2. Statistics Netherlands and AI Archived
Articles and reports: 11-522-X202100100011
Description: The ways in which AI may affect the world of official statistics are manifold and Statistics Netherlands (CBS) is actively exploring how it can use AI within its societal role. The paper describes a number of AI-related areas where CBS is currently active: use of AI for its own statistics production and statistical R&D, the development of a national AI monitor, the support of other government bodies with expertise on fair data and fair algorithms, data sharing under safe and secure conditions, and engaging in AI-related collaborations.
Key Words: Artificial Intelligence; Official Statistics; Data Sharing; Fair Algorithms; AI monitoring; Collaboration.
Release date: 2021-11-05
3. An Approximate Bayesian Approach to Improving Probability Sample Estimators Using a Supplementary Non-Probability Sample Archived
Articles and reports: 11-522-X202100100008
Description:
Non-probability samples are being increasingly explored by National Statistical Offices as a complement to probability samples. We consider the scenario where the variable of interest and auxiliary variables are observed in both a probability and non-probability sample. Our objective is to use data from the non-probability sample to improve the efficiency of survey-weighted estimates obtained from the probability sample. Recently, Sakshaug, Wisniowski, Ruiz and Blom (2019) and Wisniowski, Sakshaug, Ruiz and Blom (2020) proposed a Bayesian approach to integrating data from both samples for the estimation of model parameters. In their approach, non-probability sample data are used to determine the prior distribution of model parameters, and the posterior distribution is obtained under the assumption that the probability sampling design is ignorable (or not informative). We extend this Bayesian approach to the prediction of finite population parameters under non-ignorable (or informative) sampling by conditioning on appropriate survey-weighted statistics. We illustrate the properties of our predictor through a simulation study.
Key Words: Bayesian prediction; Gibbs sampling; Non-ignorable sampling; Statistical data integration.

Release date: 2021-10-29
4. Growing Regression Trees that Use Sampling Frame Covariates to Explore Response Burden for Use in Survey Design Archived
Articles and reports: 11-522-X202100100024
Description: The Economic Directorate of the U.S. Census Bureau is developing coordinated design and sample selection procedures for the Annual Integrated Economic Survey. The unified sample will replace the directorate’s existing practice of independently developing sampling frames and sampling procedures for a suite of separate annual surveys, which optimizes sample design features at the cost of increased response burden. Size attributes of business populations, e.g., revenues and employment, are highly skewed. A high percentage of companies operate in more than one industry. Therefore, many companies are sampled into multiple surveys compounding the response burden, especially for “medium sized” companies.
This component of response burden is reduced by selecting a single coordinated sample but will not be completely alleviated. Response burden is a function of several factors, including (1) questionnaire length and complexity, (2) accessibility of data, (3) expected number of repeated measures, and (4) frequency of collection. The sample design can have profound effects on the third and fourth factors. To help inform decisions about the integrated sample design, we use regression trees to identify covariates from the sampling frame that are related to response burden. Using historic frame and response data from four independently sampled surveys, we test a variety of algorithms, then grow regression trees that explain relationships between expected levels of response burden (as measured by response rate) and frame covariates common to more than one survey. We validate initial findings by cross-validation, examining results over time. Finally, we make recommendations on how to incorporate our robust findings into the coordinated sample design.
Release date: 2021-10-29
5. Administrative data for the estimation of population counts: Statistical learning from the first waves of Italian Permanent Population Census Archived
Articles and reports: 11-522-X202100100005
Description: The Permanent Census of Population and Housing is the new census strategy adopted in Italy in 2018: it is based on statistical registers combined with data collected through surveys specifically designed to improve registers quality and assure Census outputs. The register at the core of the Permanent Census is the Population Base Register (PBR), whose main administrative sources are the Local Population Registers. The population counts are determined correcting the PBR data with coefficients based on the coverage errors estimated with surveys data, but the need for additional administrative sources clearly emerged while processing the data collected with the first round of Permanent Census. The suspension of surveys due to global-pandemic emergency, together with a serious reduction in census budget for next years, makes more urgent a change in estimation process so to use administrative data as the main source. A thematic register has been set up to exploit all the additional administrative sources: knowledge discovery from this database is essential to extract relevant patterns and to build new dimensions called signs of life, useful for population estimation. The availability of the collected data of the two first waves of Census offers a unique and valuable set for statistical learning: association between surveys results and ‘signs of life’ could be used to build classification model to predict coverage errors in PBR. This paper present the results of the process to produce ‘signs of life’ that proved to be significant in population estimation.
Key Words: Administrative data; Population Census; Statistical Registers; Knowledge discovery from databases.
Release date: 2021-10-22
6. Creation of a composite quality indicator for administrative data-based estimates using clustering Archived
Articles and reports: 11-522-X202100100015
Description: National statistical agencies such as Statistics Canada have a responsibility to convey the quality of statistical information to users. The methods traditionally used to do this are based on measures of sampling error. As a result, they are not adapted to the estimates produced using administrative data, for which the main sources of error are not due to sampling. A more suitable approach to reporting the quality of estimates presented in a multidimensional table is described in this paper. Quality indicators were derived for various post-acquisition processing steps, such as linkage, geocoding and imputation, by estimation domain. A clustering algorithm was then used to combine domains with similar quality levels for a given estimate. Ratings to inform users of the relative quality of estimates across domains were assigned to the groups created. This indicator, called the composite quality indicator (CQI), was developed and experimented with in the Canadian Housing Statistics Program (CHSP), which aims to produce official statistics on the residential housing sector in Canada using multiple administrative data sources.
Keywords: Unsupervised machine learning, quality assurance, administrative data, data integration, clustering.
Release date: 2021-10-22
7. Harnessing Natural Language Processing and Machine Learning to Enhance Identification of Opioid-involved Health Outcomes in the National Hospital Care Survey Archived
Articles and reports: 11-522-X202100100016
Description: To build data capacity and address the U.S. opioid public health emergency, the National Center for Health Statistics received funding for two projects. The projects involve development of algorithms that use all available structured and unstructured data submitted for the 2016 National Hospital Care Survey (NHCS) to enhance identification of opioid-involvement and the presence of co-occurring disorders (coexistence of a substance use disorder and a mental health issue). A description of the algorithm development process is provided, and lessons learned from integrating data science methods like natural language processing to produce official statistics are presented. Efforts to make the algorithms and analytic datafiles accessible to researchers are also discussed.
Key Words: Opioids; Co-Occurring Disorders; Data Science; Natural Language Processing; Hospital Care
Release date: 2021-10-22
8. Applying the data science approach to COVID-19 epidemiological modelling to inform PPE demand and supply in Canada Archived
Articles and reports: 11-522-X202100100017
Description: The outbreak of the COVID-19 pandemic required the Government of Canada to provide relevant and timely information to support decision-making around a host of issues, including personal protective equipment (PPE) procurement and deployment. Our team built a compartmental epidemiological model from an existing code base to project PPE demand under a range of epidemiological scenarios. This model was further enhanced using data science techniques, which allowed for the rapid development and dissemination of model results to inform policy decisions.
Key Words: COVID-19; SARS-CoV-2; Epidemiological model; Data science; Personal Protective Equipment (PPE); SEIR
Release date: 2021-10-22
9. Responsible use of machine learning at Statistics Canada Archived
Articles and reports: 11-522-X202100100002
Description:
A framework for the responsible use of machine learning processes has been developed at Statistics Canada. The framework includes guidelines for the responsible use of machine learning and a checklist, which are organized into four themes: respect for people, respect for data, sound methods, and sound application. All four themes work together to ensure the ethical use of both the algorithms and results of machine learning. The framework is anchored in a vision that seeks to create a modern workplace and provide direction and support to those who use machine learning techniques. It applies to all statistical programs and projects conducted by Statistics Canada that use machine learning algorithms. This includes supervised and unsupervised learning algorithms. The framework and associated guidelines will be presented first. The process of reviewing projects that use machine learning, i.e., how the framework is applied to Statistics Canada projects, will then be explained. Finally, future work to improve the framework will be described.
Keywords: Responsible machine learning, explainability, ethics

Release date: 2021-10-15
10. Predicting transitions into and out of poverty using machine learning Archived
Articles and reports: 11-522-X202100100003
Description:
The increasing size and richness of digital data allow for modeling more complex relationships and interactions, which is the strongpoint of machine learning. Here we applied gradient boosting to the Dutch system of social statistical datasets to estimate transition probabilities into and out of poverty. Individual estimates are reasonable, but the main advantages of the approach in combination with SHAP and global surrogate models are the simultaneous ranking of hundreds of features by their importance, detailed insight into their relationship with the transition probabilities, and the data-driven identification of subpopulations with relatively high and low transition probabilities. In addition, we decompose the difference in feature importance between general and subpopulation into a frequency and a feature effect. We caution for misinterpretation and discuss future directions.
Key Words: Classification; Explainability; Gradient boosting; Life event; Risk factors; SHAP decomposition.

Release date: 2021-10-15

Journals and periodicals (0)

Journals and periodicals (0) (0 results)

No content available at this time.

Report a problem or mistake on this page

Date modified:: 2024-07-31