Disclosure control and data dissemination
Filter results by
Search HelpKeyword(s)
Type
Survey or statistical program
Results
All (87)
All (87) (0 to 10 of 87 results)
- Articles and reports: 11-522-X202500100001Description: Synthetic data generation (SDG) is increasingly applied across sectors for privacy-preserving data sharing, de-biasing and augmentation. Each use case requires a distinct set of evaluation metrics that must account for the stochasticity of the SDG process: membership and attribute disclosure vulnerability are critical for privacy; fidelity and downstream task utility apply more broadly; and fairness and diversity are relevant for de-biasing and augmentation, respectively. Presenting accumulated evidence and through exemplar case studies, it is shown that SDG can perform well across many of these use cases and our key learnings from our experiences with synthetic health data are shared.Release date: 2025-09-08
- Articles and reports: 11-522-X202500100005Description: The Physical Flow Account for Plastic Material (PFAPM) aims to enhance environmental-economic analysis by tracking plastic material flows within the Canadian economy. To help streamline this complex process, the project leveraged advanced natural language processing (NLP) such as large language models (LLM) techniques to automate sector classification and summarize the impact of COVID-19 from company reports. By integrating machine learning models and retrieval-augmented generation (RAG) methods, the manual workload was significantly reduced, improving data analysis efficiency, and leading to higher quality insights.Release date: 2025-09-08
- Articles and reports: 11-522-X202500100015Description: Currently, Statistics Canada has no official guidance on confidentiality rules for releasing small area estimate. In recent years, there has been increasing demand from Research Data Centre (RDC) researchers for comprehensive confidentiality guidelines such that they can publish small area estimates in their research. This confidentiality analysis applies to area-level small area estimation.Release date: 2025-09-08
- Articles and reports: 11-522-X202500100016Description: The adoption of synthetic data generation as a confidentiality measure is increasing in statistical agencies worldwide, including at Statistics Canada. This approach provides an alternative to the traditional dissemination of anonymized public microdata files, offering both privacy protection and data utility. However, the creation of synthetic data presents challenges in assessing and mitigating disclosure risks. This paper reviews the different types of disclosure risks, that being attribute, membership and identity disclosure, and presents some of the associated methods for measuring risk. The paper presents prominent risk assessment metrics and discusses practical methods for disclosure control in data synthesis. Methods for assessing disclosure risks usually produce a metric that can be used to gauge the risk, but there is little consensus on threshold values for these metrics. It is also important to focus on importance of balancing utility and confidentiality, which needs further discussion in context of these methods. The paper concludes by offering insights and recommendations about managing disclosure risk while creating synthetic data as well as providing some ideas on future directions for research and practical implications for managing disclosure risks in synthetic data.Release date: 2025-09-08
- 5. Exploration of Deep Learning Synthetic Data Generation for Sensitive Utility Data Sharing ArchivedArticles and reports: 11-522-X202500100017Description: Utilities hold crucial information about energy usage and building characteristics which can be utilized by government agencies to improve their corresponding analytics. However, this data is associated with private customer records and thus the building data and energy usage may be too sensitive to share. Often, high-level aggregated versions of this data are shared through robust contracts, limiting the statistics that can be derived. With the advancement of generative machine learning techniques, Statistics Canada and Natural Resources Canada have explored the feasibility of using these models to produce synthetic versions of utility data which may be shared in full to requesting organizations. These synthetic datasets can be created by a utility company through a locally run program and the outputs can be approved before being sent. This work has identified that certain generative models can feasibly be used by utilities to generate new versions of a dataset and has identified the issues which must be addressed prior to implementing this in practice. Both tabular and time-series models have been tested for different data sharing scenarios, where the TimeGAN model successfully captured the general energy peaks and valleys over a given day with reasonable computational requirements. Although this process takes days for annual energy amounts over thousands of customer records, this can enable new data sharing initiatives between utilities and National Statistical Offices while managing privacy risks. As work progresses in future phases with real utility partners, trust can be built for these approaches, and they can begin being tested on real data by actual data holders.Release date: 2025-09-08
- Articles and reports: 11-522-X202500100026Description: In 2022, Canada became the first country to release statistical information about its transgender and non-binary populations based on census data. Moreover, following a 2018 government-wide policy direction, Statistics Canada's surveys have been collecting and disseminating information about gender by default rather than sex at birth. Due to the small size of the transgender and non-binary populations, disseminating safe statistical information about them at detailed geographical levels poses a challenge.Release date: 2025-09-08
- Articles and reports: 12-001-X202400200008Description: When seeking to release public use files for confidential data, statistical agencies can generate fully synthetic data. We propose an approach for making fully synthetic data from surveys collected with complex sampling designs. Our approach adheres to the general strategy proposed by Rubin (1993). Specifically, we generate pseudo-populations by applying the weighted finite population Bayesian bootstrap to account for survey weights, take simple random samples from those pseudo-populations, estimate synthesis models using these simple random samples, and release simulated data drawn from the models as public use files. To facilitate variance estimation, we use the framework of multiple imputation with two data generation strategies. In the first, we generate multiple data sets from each simple random sample. In the second, we generate a single synthetic data set from each simple random sample. We present multiple imputation combining rules for each setting. We illustrate the repeated sampling properties of the combining rules via simulation studies, including comparisons with synthetic data generation based on pseudo-likelihood methods. We apply the proposed methods to a subset of data from the American Community Survey.Release date: 2024-12-20
- Articles and reports: 11-633-X2024002Description: Data ethics is a branch of ethics that raises questions about the appropriate use of data across its life cycle and identifies permissible practices and actions. This discipline is operationalized by Statistics Canada’s Data Ethics Secretariat (DES) through ethical reviews. The ethical review process is a direct consequence of the adoption of the Necessity and Proportionality Framework. The aim of this paper is to describe the foundations and the purpose of such reviews. This can help Canadians understand the work of the DES and how Statistics Canada justifies its data acquisitions.Release date: 2024-12-20
- Surveys and statistical programs – Documentation: 11-633-X2024004Description: The Longitudinal Immigration Database (IMDB) is a comprehensive source of data that plays a key role in the understanding of the economic behaviour of immigrants. It is the only annual Canadian dataset that allows users to study the characteristics of immigrants to Canada at the time of admission and their economic outcomes and regional (inter-provincial) mobility over a time span of more than 40 years.Release date: 2024-12-09
- Articles and reports: 11-522-X202200100007Description: With the availability of larger and more diverse data sources, Statistical Institutes in Europe are inclined to publish statistics on smaller groups than they used to do. Moreover, high impact global events like the Covid crisis and the situation in Ukraine may also ask for statistics on specific subgroups of the population. Publishing on small, targeted groups not only raises questions on statistical quality of the figures, it also raises issues concerning statistical disclosure risk. The principle of statistical disclosure control does not depend on the size of the groups the statistics are based on. However, the risk of disclosure does depend on the group size: the smaller a group, the higher the risk. Traditional ways to deal with statistical disclosure control and small group sizes include suppressing information and coarsening categories. These methods essentially increase the (mean) group sizes. More recent approaches include perturbative methods that have the intention to keep the group sizes small in order to preserve as much information as possible while reducing the disclosure risk sufficiently. In this paper we will mention some European examples of special focus group statistics and discuss the implications on statistical disclosure control. Additionally, we will discuss some issues that the use of perturbative methods brings along: its impact on disclosure risk and utility as well as the challenges in proper communication thereof.Release date: 2024-03-25
- Previous Go to previous page of All results
- 1 (current) Go to page 1 of All results
- 2 Go to page 2 of All results
- 3 Go to page 3 of All results
- 4 Go to page 4 of All results
- 5 Go to page 5 of All results
- 6 Go to page 6 of All results
- 7 Go to page 7 of All results
- 8 Go to page 8 of All results
- 9 Go to page 9 of All results
- Next Go to next page of All results
Data (1)
Data (1) ((1 result))
- 1. Historical Statistics of Canada ArchivedTable: 11-516-XDescription:
The second edition of Historical statistics of Canada was jointly produced by the Social Science Federation of Canada and Statistics Canada in 1983. This volume contains about 1,088 statistical tables on the social, economic and institutional conditions of Canada from the start of Confederation in 1867 to the mid-1970s. The tables are arranged in sections with an introduction explaining the content of each section, the principal sources of data for each table, and general explanatory notes regarding the statistics. In most cases, there is sufficient description of the individual series to enable the reader to use them without consulting the numerous basic sources referenced in the publication.
The electronic version of this historical publication is accessible on the Internet site of Statistics Canada as a free downloadable document: text as HTML pages and all tables as individual spreadsheets in a comma delimited format (CSV) (which allows online viewing or downloading).
Release date: 1999-07-29
Analysis (75)
Analysis (75) (0 to 10 of 75 results)
- Articles and reports: 11-522-X202500100001Description: Synthetic data generation (SDG) is increasingly applied across sectors for privacy-preserving data sharing, de-biasing and augmentation. Each use case requires a distinct set of evaluation metrics that must account for the stochasticity of the SDG process: membership and attribute disclosure vulnerability are critical for privacy; fidelity and downstream task utility apply more broadly; and fairness and diversity are relevant for de-biasing and augmentation, respectively. Presenting accumulated evidence and through exemplar case studies, it is shown that SDG can perform well across many of these use cases and our key learnings from our experiences with synthetic health data are shared.Release date: 2025-09-08
- Articles and reports: 11-522-X202500100005Description: The Physical Flow Account for Plastic Material (PFAPM) aims to enhance environmental-economic analysis by tracking plastic material flows within the Canadian economy. To help streamline this complex process, the project leveraged advanced natural language processing (NLP) such as large language models (LLM) techniques to automate sector classification and summarize the impact of COVID-19 from company reports. By integrating machine learning models and retrieval-augmented generation (RAG) methods, the manual workload was significantly reduced, improving data analysis efficiency, and leading to higher quality insights.Release date: 2025-09-08
- Articles and reports: 11-522-X202500100015Description: Currently, Statistics Canada has no official guidance on confidentiality rules for releasing small area estimate. In recent years, there has been increasing demand from Research Data Centre (RDC) researchers for comprehensive confidentiality guidelines such that they can publish small area estimates in their research. This confidentiality analysis applies to area-level small area estimation.Release date: 2025-09-08
- Articles and reports: 11-522-X202500100016Description: The adoption of synthetic data generation as a confidentiality measure is increasing in statistical agencies worldwide, including at Statistics Canada. This approach provides an alternative to the traditional dissemination of anonymized public microdata files, offering both privacy protection and data utility. However, the creation of synthetic data presents challenges in assessing and mitigating disclosure risks. This paper reviews the different types of disclosure risks, that being attribute, membership and identity disclosure, and presents some of the associated methods for measuring risk. The paper presents prominent risk assessment metrics and discusses practical methods for disclosure control in data synthesis. Methods for assessing disclosure risks usually produce a metric that can be used to gauge the risk, but there is little consensus on threshold values for these metrics. It is also important to focus on importance of balancing utility and confidentiality, which needs further discussion in context of these methods. The paper concludes by offering insights and recommendations about managing disclosure risk while creating synthetic data as well as providing some ideas on future directions for research and practical implications for managing disclosure risks in synthetic data.Release date: 2025-09-08
- 5. Exploration of Deep Learning Synthetic Data Generation for Sensitive Utility Data Sharing ArchivedArticles and reports: 11-522-X202500100017Description: Utilities hold crucial information about energy usage and building characteristics which can be utilized by government agencies to improve their corresponding analytics. However, this data is associated with private customer records and thus the building data and energy usage may be too sensitive to share. Often, high-level aggregated versions of this data are shared through robust contracts, limiting the statistics that can be derived. With the advancement of generative machine learning techniques, Statistics Canada and Natural Resources Canada have explored the feasibility of using these models to produce synthetic versions of utility data which may be shared in full to requesting organizations. These synthetic datasets can be created by a utility company through a locally run program and the outputs can be approved before being sent. This work has identified that certain generative models can feasibly be used by utilities to generate new versions of a dataset and has identified the issues which must be addressed prior to implementing this in practice. Both tabular and time-series models have been tested for different data sharing scenarios, where the TimeGAN model successfully captured the general energy peaks and valleys over a given day with reasonable computational requirements. Although this process takes days for annual energy amounts over thousands of customer records, this can enable new data sharing initiatives between utilities and National Statistical Offices while managing privacy risks. As work progresses in future phases with real utility partners, trust can be built for these approaches, and they can begin being tested on real data by actual data holders.Release date: 2025-09-08
- Articles and reports: 11-522-X202500100026Description: In 2022, Canada became the first country to release statistical information about its transgender and non-binary populations based on census data. Moreover, following a 2018 government-wide policy direction, Statistics Canada's surveys have been collecting and disseminating information about gender by default rather than sex at birth. Due to the small size of the transgender and non-binary populations, disseminating safe statistical information about them at detailed geographical levels poses a challenge.Release date: 2025-09-08
- Articles and reports: 12-001-X202400200008Description: When seeking to release public use files for confidential data, statistical agencies can generate fully synthetic data. We propose an approach for making fully synthetic data from surveys collected with complex sampling designs. Our approach adheres to the general strategy proposed by Rubin (1993). Specifically, we generate pseudo-populations by applying the weighted finite population Bayesian bootstrap to account for survey weights, take simple random samples from those pseudo-populations, estimate synthesis models using these simple random samples, and release simulated data drawn from the models as public use files. To facilitate variance estimation, we use the framework of multiple imputation with two data generation strategies. In the first, we generate multiple data sets from each simple random sample. In the second, we generate a single synthetic data set from each simple random sample. We present multiple imputation combining rules for each setting. We illustrate the repeated sampling properties of the combining rules via simulation studies, including comparisons with synthetic data generation based on pseudo-likelihood methods. We apply the proposed methods to a subset of data from the American Community Survey.Release date: 2024-12-20
- Articles and reports: 11-633-X2024002Description: Data ethics is a branch of ethics that raises questions about the appropriate use of data across its life cycle and identifies permissible practices and actions. This discipline is operationalized by Statistics Canada’s Data Ethics Secretariat (DES) through ethical reviews. The ethical review process is a direct consequence of the adoption of the Necessity and Proportionality Framework. The aim of this paper is to describe the foundations and the purpose of such reviews. This can help Canadians understand the work of the DES and how Statistics Canada justifies its data acquisitions.Release date: 2024-12-20
- Articles and reports: 11-522-X202200100007Description: With the availability of larger and more diverse data sources, Statistical Institutes in Europe are inclined to publish statistics on smaller groups than they used to do. Moreover, high impact global events like the Covid crisis and the situation in Ukraine may also ask for statistics on specific subgroups of the population. Publishing on small, targeted groups not only raises questions on statistical quality of the figures, it also raises issues concerning statistical disclosure risk. The principle of statistical disclosure control does not depend on the size of the groups the statistics are based on. However, the risk of disclosure does depend on the group size: the smaller a group, the higher the risk. Traditional ways to deal with statistical disclosure control and small group sizes include suppressing information and coarsening categories. These methods essentially increase the (mean) group sizes. More recent approaches include perturbative methods that have the intention to keep the group sizes small in order to preserve as much information as possible while reducing the disclosure risk sufficiently. In this paper we will mention some European examples of special focus group statistics and discuss the implications on statistical disclosure control. Additionally, we will discuss some issues that the use of perturbative methods brings along: its impact on disclosure risk and utility as well as the challenges in proper communication thereof.Release date: 2024-03-25
- 10. The continued impacts of the COVID-19 pandemic: Variations in the economic integration of new immigrants ArchivedStats in brief: 11-001-X202402237898Description: Release published in The Daily – Statistics Canada’s official release bulletinRelease date: 2024-01-22
- Previous Go to previous page of Analysis results
- 1 (current) Go to page 1 of Analysis results
- 2 Go to page 2 of Analysis results
- 3 Go to page 3 of Analysis results
- 4 Go to page 4 of Analysis results
- 5 Go to page 5 of Analysis results
- 6 Go to page 6 of Analysis results
- 7 Go to page 7 of Analysis results
- 8 Go to page 8 of Analysis results
- Next Go to next page of Analysis results
Reference (12)
Reference (12) (0 to 10 of 12 results)
- Surveys and statistical programs – Documentation: 11-633-X2024004Description: The Longitudinal Immigration Database (IMDB) is a comprehensive source of data that plays a key role in the understanding of the economic behaviour of immigrants. It is the only annual Canadian dataset that allows users to study the characteristics of immigrants to Canada at the time of admission and their economic outcomes and regional (inter-provincial) mobility over a time span of more than 40 years.Release date: 2024-12-09
- Surveys and statistical programs – Documentation: 11-633-X2024001Description: The Longitudinal Immigration Database (IMDB) is a comprehensive source of data that plays a key role in the understanding of the economic behaviour of immigrants. It is the only annual Canadian dataset that allows users to study the characteristics of immigrants to Canada at the time of admission and their economic outcomes and regional (inter-provincial) mobility over a time span of more than 35 years.Release date: 2024-01-22
- Surveys and statistical programs – Documentation: 32-26-0006Description: This report provides data quality information pertaining to the Agriculture–Population Linkage, such as sources of error, matching process, response rates, imputation rates, sampling, weighting, disclosure control methods and data quality indicators.Release date: 2023-08-25
- Surveys and statistical programs – Documentation: 11-633-X2022009Description: The Longitudinal Immigration Database (IMDB) is a comprehensive source of data that plays a key role in the understanding of the economic behaviour of immigrants. It is the only annual Canadian dataset that allows users to study the characteristics of immigrants to Canada at the time of admission and their economic outcomes and regional (inter-provincial) mobility over a time span of more than 35 years.
This report will discuss the IMDB data sources, concepts and variables, record linkage, data processing, dissemination, data evaluation and quality indicators, comparability with other immigration datasets, and the analyses possible with the IMDB.
Release date: 2022-12-05 - Geographic files and documentation: 12-572-XDescription:
The Standard Geographical Classification (SGC) provides a systematic classification structure that categorizes all of the geographic area of Canada. The SGC is the official classification used in the Census of Population and other Statistics Canada surveys.
The classification is organized in two volumes: Volume I, The Classification and Volume II, Reference Maps.
Volume II contains reference maps showing boundaries, names, codes and locations of the geographic areas in the classification. The reference maps show census subdivisions, census divisions, census metropolitan areas, census agglomerations, census metropolitan influenced zones and economic regions. Definitions for these terms are found in Volume I, The Classification. Volume I describes the classification and related standard geographic areas and place names.
The maps in Volume II can be downloaded in PDF format from our website.
Release date: 2022-02-09 - Surveys and statistical programs – Documentation: 11-522-X201300014285Description:
The 2011 National Household Survey (NHS) is a voluntary survey that replaced the traditional mandatory long-form questionnaire of the Canadian census of population. The NHS sampled about 30% of Canadian households and achieved a design-weighted response rate of 77%. In comparison, the last census long form was sent to 20% of households and achieved a response rate of 94%. Based on the long-form data, Statistics Canada traditionally produces two public use microdata files (PUMFs): the individual PUMF and the hierarchical PUMF. Both give information on individuals, but the hierarchical PUMF provides extra information on the household and family relationships between the individuals. To produce two PUMFs, based on the NHS data, that cover the whole country evenly and that do not overlap, we applied a special sub-sampling strategy. Difficulties in the confidentiality analyses have increased because of the numerous new variables, the more detailed geographic information and the voluntary nature of the NHS. This paper describes the 2011 PUMF methodology and how it balances the requirements for more information and for low risk of disclosure.
Release date: 2014-10-31 - 7. 2006 Census Dissemination Consultation Report ArchivedNotices and consultations: 92-132-XDescription:
This report describes the comments received as a result of the second round of the 2006 Census consultations. As with the previous 2006 Census consultation, this second round of consultations integrated discussions on the dissemination program, questionnaire content and census geography. However, the focus of this second round of consultations was placed on the 2001 Census of Population dissemination program and proposed directions for 2006 geography. Consultations were held from January to June 2004. Approximately 1,000 comments were captured through written submissions and the organization of over 40 meetings across Canada.
This report describes users' feedback on dissemination and geography issues received through this second round of consultations. In addition to user's comments, web metrics information serves as a valuable tool when evaluating the accessibility of public good data tables. Therefore, page view counts have been integrated in this report.
Some general planning assumptions that focus on the production and dissemination of 2006 Census products are also included in this report.
Release date: 2005-05-31 - 8. Consultation Guide - 2001 Census of Population Dissemination and Proposed Directions for 2006 Geography ArchivedNotices and consultations: 92-131-GDescription:
This guide has been developed to help users convey their ideas and suggestions to Statistics Canada regarding the 2001 Census products and services line. It contains a series of questions about specific dissemination issues and topics related to the 2001 Census dissemination strategy. The document covers many aspects of census dissemination. Readers are welcome to focus on sections of particular interest to them. In addition, users are welcome to provide comments on any other census-related issues during this consultation process.
Release date: 2004-04-08 - Surveys and statistical programs – Documentation: 75F0002M2003002Description:
This series provides detailed documentation on income developments, including survey design issues, data quality evaluation and exploratory research for the Survey of Labour and Income Dynamics in 2000.
Release date: 2003-06-11 - Surveys and statistical programs – Documentation: 75F0002M199303ADescription:
This paper is intended as an initial proposal for a strategy for the Survey of Labour and Income Dynamics (SLID) longitudinal microdata files.
Release date: 1995-12-30