Keyword search
Filter results by
Search HelpKeyword(s)
Subject
Type
Results
All (15)
All (15) (0 to 10 of 15 results)
- Articles and reports: 11-633-X2021007Description:
Statistics Canada continues to use a variety of data sources to provide neighbourhood-level variables across an expanding set of domains, such as sociodemographic characteristics, income, services and amenities, crime, and the environment. Yet, despite these advances, information on the social aspects of neighbourhoods is still unavailable. In this paper, answers to the Canadian Community Health Survey on respondents’ sense of belonging to their local community were pooled over the four survey years from 2016 to 2019. Individual responses were aggregated up to the census tract (CT) level.
Release date: 2021-11-16 - 2. Construction of a Northern Market Basket Measure of poverty for Yukon and the Northwest TerritoriesArticles and reports: 75F0002M2021007Description:
This discussion paper describes the proposed methodology for a Northern Market Basket Measure (MBM-N) for Yukon and the Northwest Territories, as well as identifies research which could be conducted in preparation for the 2023 review. The paper presents initial MBM-N thresholds and provides preliminary poverty estimates for reference years 2018 and 2019. A review period will follow the release of this paper, during which time Statistics Canada and Employment and Social Development Canada will welcome feedback from interested parties and work with experts, stakeholders, indigenous organizations, federal, provincial and territorial officials to validate the results.
Release date: 2021-11-12 - Articles and reports: 11-522-X202100100010Description:
As part of processing for the 2021 Canadian Census, the write-in responses to 31 census questions must be coded. Up until, and including, 2016, this was a three stage process, including an “interactive (human) coding” step as the second stage. This human coding step is both lengthy and expensive, spanning many months and requiring the hiring and training of a large number of temporary employees. With this in mind, for 2021, this stage was either augmented with or replaced entirely by machine learning models using the "fastText" algorithm. This presentation will discuss the implementation of this algorithm and the challenges and decisions taken along the way.
Key Words: Natural Language Processing, Machine Learning, fastText, Coding
Release date: 2021-11-05 - 4. Statistics Netherlands and AI ArchivedArticles and reports: 11-522-X202100100011Description: The ways in which AI may affect the world of official statistics are manifold and Statistics Netherlands (CBS) is actively exploring how it can use AI within its societal role. The paper describes a number of AI-related areas where CBS is currently active: use of AI for its own statistics production and statistical R&D, the development of a national AI monitor, the support of other government bodies with expertise on fair data and fair algorithms, data sharing under safe and secure conditions, and engaging in AI-related collaborations.
Key Words: Artificial Intelligence; Official Statistics; Data Sharing; Fair Algorithms; AI monitoring; Collaboration.
Release date: 2021-11-05 - Articles and reports: 11-522-X202100100012Description: The modernization of price statistics by National Statistical Offices (NSO) such as Statistics Canada focuses on the adoption of alternative data sources that include the near-universe of all products sold in the country, a scale that requires machine learning classification of the data. The process of evaluating classifiers to select appropriate ones for production, as well as monitoring classifiers once in production, needs to be based on robust metrics to measure misclassification. As commonly utilized metrics, such as the Fß-score may not take into account key aspects applicable to prices statistics in all cases, such as unequal importance of categories, a careful consideration of the metric space is necessary to select appropriate methods to evaluate classifiers. This working paper provides insight on the metric space applicable to price statistics and proposes an operational framework to evaluate and monitor classifiers, focusing specifically on the needs of the Canadian Consumer Prices Index and demonstrating discussed metrics using a publicly available dataset.
Key Words: Consumer price index; supervised classification; evaluation metrics; taxonomy
Release date: 2021-11-05 - Articles and reports: 11-522-X202100100013Description: Statistics Canada’s Labour Force Survey (LFS) plays a fundamental role in the mandate of Statistics Canada. The labour market information provided by the LFS is among the most timely and important measures of the Canadian economy’s overall performance. An integral part of the LFS monthly data processing is the coding of respondent’s industry according to the North American Industrial Classification System (NAICS), occupation according to the National Occupational Classification System (NOC) and the Primary Class of Workers (PCOW). Each month, up to 20,000 records are coded manually. In 2020, Statistics Canada worked on developing Machine Learning models using fastText to code responses to the LFS questionnaire according to the three classifications mentioned previously. This article will provide an overview on the methodology developed and results obtained from a potential application of the use of fastText into the LFS coding process.
Key Words: Machine Learning; Labour Force Survey; Text classification; fastText.
Release date: 2021-11-05 - Articles and reports: 11-522-X202100100028Description:
Many Government of Canada groups are developing codes to process and visualize various kinds data, often duplicating each other’s efforts, with sub-optimal efficiency and limited level of code quality reviewing. This paper informally presents a working-level approach to addressing this technical problem. The idea is to collaboratively build a common repository of code and knowledgebase for use by anyone in the public sector to perform many common data science tasks, and, in doing that, help each other to master both the data science coding skills and the industry standard collaborative practices. The paper explains why R language is used as the language of choice for collaborative data science code development. It summaries R advantages and addresses its limitations, establishes the taxonomy of discussion topics of highest interested to the GC data scientists working with R, provides an overview of used collaborative platforms, and presents the results obtained to date. Even though the code knowledgebase is developed mainly in R, it is meant to be valuable also for data scientists coding in Python and other development environments. Key Words: Collaboration; Data science; Data Engineering; R; Open Government; Open Data; Open Science
Release date: 2021-10-29 - Articles and reports: 11-522-X202100100001Description:
We consider regression analysis in the context of data integration. To combine partial information from external sources, we employ the idea of model calibration which introduces a “working” reduced model based on the observed covariates. The working reduced model is not necessarily correctly specified but can be a useful device to incorporate the partial information from the external data. The actual implementation is based on a novel application of the empirical likelihood method. The proposed method is particularly attractive for combining information from several sources with different missing patterns. The proposed method is applied to a real data example combining survey data from Korean National Health and Nutrition Examination Survey and big data from National Health Insurance Sharing Service in Korea.
Key Words: Big data; Empirical likelihood; Measurement error models; Missing covariates.
Release date: 2021-10-15 - Articles and reports: 11-522-X202100100002Description:
A framework for the responsible use of machine learning processes has been developed at Statistics Canada. The framework includes guidelines for the responsible use of machine learning and a checklist, which are organized into four themes: respect for people, respect for data, sound methods, and sound application. All four themes work together to ensure the ethical use of both the algorithms and results of machine learning. The framework is anchored in a vision that seeks to create a modern workplace and provide direction and support to those who use machine learning techniques. It applies to all statistical programs and projects conducted by Statistics Canada that use machine learning algorithms. This includes supervised and unsupervised learning algorithms. The framework and associated guidelines will be presented first. The process of reviewing projects that use machine learning, i.e., how the framework is applied to Statistics Canada projects, will then be explained. Finally, future work to improve the framework will be described.
Keywords: Responsible machine learning, explainability, ethics
Release date: 2021-10-15 - Articles and reports: 11-522-X202100100003Description:
The increasing size and richness of digital data allow for modeling more complex relationships and interactions, which is the strongpoint of machine learning. Here we applied gradient boosting to the Dutch system of social statistical datasets to estimate transition probabilities into and out of poverty. Individual estimates are reasonable, but the main advantages of the approach in combination with SHAP and global surrogate models are the simultaneous ranking of hundreds of features by their importance, detailed insight into their relationship with the transition probabilities, and the data-driven identification of subpopulations with relatively high and low transition probabilities. In addition, we decompose the difference in feature importance between general and subpopulation into a frequency and a feature effect. We caution for misinterpretation and discuss future directions.
Key Words: Classification; Explainability; Gradient boosting; Life event; Risk factors; SHAP decomposition.
Release date: 2021-10-15
Data (0)
Data (0) (0 results)
No content available at this time.
Analysis (14)
Analysis (14) (0 to 10 of 14 results)
- Articles and reports: 11-633-X2021007Description:
Statistics Canada continues to use a variety of data sources to provide neighbourhood-level variables across an expanding set of domains, such as sociodemographic characteristics, income, services and amenities, crime, and the environment. Yet, despite these advances, information on the social aspects of neighbourhoods is still unavailable. In this paper, answers to the Canadian Community Health Survey on respondents’ sense of belonging to their local community were pooled over the four survey years from 2016 to 2019. Individual responses were aggregated up to the census tract (CT) level.
Release date: 2021-11-16 - 2. Construction of a Northern Market Basket Measure of poverty for Yukon and the Northwest TerritoriesArticles and reports: 75F0002M2021007Description:
This discussion paper describes the proposed methodology for a Northern Market Basket Measure (MBM-N) for Yukon and the Northwest Territories, as well as identifies research which could be conducted in preparation for the 2023 review. The paper presents initial MBM-N thresholds and provides preliminary poverty estimates for reference years 2018 and 2019. A review period will follow the release of this paper, during which time Statistics Canada and Employment and Social Development Canada will welcome feedback from interested parties and work with experts, stakeholders, indigenous organizations, federal, provincial and territorial officials to validate the results.
Release date: 2021-11-12 - Articles and reports: 11-522-X202100100010Description:
As part of processing for the 2021 Canadian Census, the write-in responses to 31 census questions must be coded. Up until, and including, 2016, this was a three stage process, including an “interactive (human) coding” step as the second stage. This human coding step is both lengthy and expensive, spanning many months and requiring the hiring and training of a large number of temporary employees. With this in mind, for 2021, this stage was either augmented with or replaced entirely by machine learning models using the "fastText" algorithm. This presentation will discuss the implementation of this algorithm and the challenges and decisions taken along the way.
Key Words: Natural Language Processing, Machine Learning, fastText, Coding
Release date: 2021-11-05 - 4. Statistics Netherlands and AI ArchivedArticles and reports: 11-522-X202100100011Description: The ways in which AI may affect the world of official statistics are manifold and Statistics Netherlands (CBS) is actively exploring how it can use AI within its societal role. The paper describes a number of AI-related areas where CBS is currently active: use of AI for its own statistics production and statistical R&D, the development of a national AI monitor, the support of other government bodies with expertise on fair data and fair algorithms, data sharing under safe and secure conditions, and engaging in AI-related collaborations.
Key Words: Artificial Intelligence; Official Statistics; Data Sharing; Fair Algorithms; AI monitoring; Collaboration.
Release date: 2021-11-05 - Articles and reports: 11-522-X202100100012Description: The modernization of price statistics by National Statistical Offices (NSO) such as Statistics Canada focuses on the adoption of alternative data sources that include the near-universe of all products sold in the country, a scale that requires machine learning classification of the data. The process of evaluating classifiers to select appropriate ones for production, as well as monitoring classifiers once in production, needs to be based on robust metrics to measure misclassification. As commonly utilized metrics, such as the Fß-score may not take into account key aspects applicable to prices statistics in all cases, such as unequal importance of categories, a careful consideration of the metric space is necessary to select appropriate methods to evaluate classifiers. This working paper provides insight on the metric space applicable to price statistics and proposes an operational framework to evaluate and monitor classifiers, focusing specifically on the needs of the Canadian Consumer Prices Index and demonstrating discussed metrics using a publicly available dataset.
Key Words: Consumer price index; supervised classification; evaluation metrics; taxonomy
Release date: 2021-11-05 - Articles and reports: 11-522-X202100100013Description: Statistics Canada’s Labour Force Survey (LFS) plays a fundamental role in the mandate of Statistics Canada. The labour market information provided by the LFS is among the most timely and important measures of the Canadian economy’s overall performance. An integral part of the LFS monthly data processing is the coding of respondent’s industry according to the North American Industrial Classification System (NAICS), occupation according to the National Occupational Classification System (NOC) and the Primary Class of Workers (PCOW). Each month, up to 20,000 records are coded manually. In 2020, Statistics Canada worked on developing Machine Learning models using fastText to code responses to the LFS questionnaire according to the three classifications mentioned previously. This article will provide an overview on the methodology developed and results obtained from a potential application of the use of fastText into the LFS coding process.
Key Words: Machine Learning; Labour Force Survey; Text classification; fastText.
Release date: 2021-11-05 - Articles and reports: 11-522-X202100100028Description:
Many Government of Canada groups are developing codes to process and visualize various kinds data, often duplicating each other’s efforts, with sub-optimal efficiency and limited level of code quality reviewing. This paper informally presents a working-level approach to addressing this technical problem. The idea is to collaboratively build a common repository of code and knowledgebase for use by anyone in the public sector to perform many common data science tasks, and, in doing that, help each other to master both the data science coding skills and the industry standard collaborative practices. The paper explains why R language is used as the language of choice for collaborative data science code development. It summaries R advantages and addresses its limitations, establishes the taxonomy of discussion topics of highest interested to the GC data scientists working with R, provides an overview of used collaborative platforms, and presents the results obtained to date. Even though the code knowledgebase is developed mainly in R, it is meant to be valuable also for data scientists coding in Python and other development environments. Key Words: Collaboration; Data science; Data Engineering; R; Open Government; Open Data; Open Science
Release date: 2021-10-29 - Articles and reports: 11-522-X202100100001Description:
We consider regression analysis in the context of data integration. To combine partial information from external sources, we employ the idea of model calibration which introduces a “working” reduced model based on the observed covariates. The working reduced model is not necessarily correctly specified but can be a useful device to incorporate the partial information from the external data. The actual implementation is based on a novel application of the empirical likelihood method. The proposed method is particularly attractive for combining information from several sources with different missing patterns. The proposed method is applied to a real data example combining survey data from Korean National Health and Nutrition Examination Survey and big data from National Health Insurance Sharing Service in Korea.
Key Words: Big data; Empirical likelihood; Measurement error models; Missing covariates.
Release date: 2021-10-15 - Articles and reports: 11-522-X202100100002Description:
A framework for the responsible use of machine learning processes has been developed at Statistics Canada. The framework includes guidelines for the responsible use of machine learning and a checklist, which are organized into four themes: respect for people, respect for data, sound methods, and sound application. All four themes work together to ensure the ethical use of both the algorithms and results of machine learning. The framework is anchored in a vision that seeks to create a modern workplace and provide direction and support to those who use machine learning techniques. It applies to all statistical programs and projects conducted by Statistics Canada that use machine learning algorithms. This includes supervised and unsupervised learning algorithms. The framework and associated guidelines will be presented first. The process of reviewing projects that use machine learning, i.e., how the framework is applied to Statistics Canada projects, will then be explained. Finally, future work to improve the framework will be described.
Keywords: Responsible machine learning, explainability, ethics
Release date: 2021-10-15 - Articles and reports: 11-522-X202100100003Description:
The increasing size and richness of digital data allow for modeling more complex relationships and interactions, which is the strongpoint of machine learning. Here we applied gradient boosting to the Dutch system of social statistical datasets to estimate transition probabilities into and out of poverty. Individual estimates are reasonable, but the main advantages of the approach in combination with SHAP and global surrogate models are the simultaneous ranking of hundreds of features by their importance, detailed insight into their relationship with the transition probabilities, and the data-driven identification of subpopulations with relatively high and low transition probabilities. In addition, we decompose the difference in feature importance between general and subpopulation into a frequency and a feature effect. We caution for misinterpretation and discuss future directions.
Key Words: Classification; Explainability; Gradient boosting; Life event; Risk factors; SHAP decomposition.
Release date: 2021-10-15
Reference (1)
Reference (1) ((1 result))
- Surveys and statistical programs – Documentation: 11-633-X2021005Description:
The Analytical Studies and Modelling Branch (ASMB) is the research arm of Statistics Canada mandated to provide high-quality, relevant and timely information on economic, health and social issues that are important to Canadians. The branch strategically makes use of expert knowledge and a broad range of data sources and modelling techniques to address the information needs of a broad range of government, academic and public sector partners and stakeholders through analysis and research, modeling and predictive analytics, and data development. The branch strives to deliver relevant, high-quality, timely, comprehensive, horizontal and integrated research and to enable the use of its research through capacity building and strategic dissemination to meet the user needs of policy makers, academics and the general public.
This Multi-year Consolidated Plan for Research, Modelling and Data Development outlines the priorities for the branch over the next two years.
Release date: 2021-08-12
- Date modified: