Statistical techniques

Results

All (207)

All (207) (0 to 10 of 207 results)

1. Analytical Studies: Methods and References
Journals and periodicals: 11-633-X
Description: Papers in this series provide background discussions of the methods used to develop data for economic, health, and social analytical studies at Statistics Canada. They are intended to provide readers with information on the statistical methods, standards and definitions used to develop databases for research purposes. All papers in this series have undergone peer and institutional review to ensure that they conform to Statistics Canada's mandate and adhere to generally accepted standards of good professional practice.
Release date: 2026-04-24
2. Effects of model misspecification on small area estimators
Articles and reports: 12-001-X202500200001
Description: Nested error regression models are commonly used to incorporate unit specific auxiliary variables to improve small area estimates. When the mean structure of the model is misspecified, the design-based mean squared prediction error (MSPE) of Empirical Best Linear Unbiased Predictors (EBLUP) generally increases. The Observed Best Prediction (OBP) method has been proposed with the intent to improve on the design-based MSPE over EBLUP. In this paper, we conduct a Monte Carlo simulation experiments to understand the effect of misspsecification of mean structures on different small area estimators. Our findings suggest that the OBP using unit-level auxiliary variables does not outperform the EBLUP in terms of design-based MSPE, unless the number of small areas m is extremely large. Conversely, the performance of OBP significantly improves when area-level auxiliary variables are employed. This paper includes both analytical and numerical evidence to demonstrate these observations, providing practical insights for addressing model misspecification in small area estimation (SAE).
Release date: 2025-12-23
3. Integrating probability and non-probability samples through deep learning-based mass imputation
Articles and reports: 12-001-X202500200007
Description: Although probability samples have been regarded as the gold standard to collect information for population-based study, non-probability samples have been used frequently in practice due to low cost, convenience, and the lack of the sampling frame for the survey. Naïve estimates based on non-probability samples without any adjustments may be misleading due to selection bias. Recently, a valid data integration approach that includes mass imputation, propensity score weighting, and calibration has been used to improve the representativeness of non-probability samples. The effectiveness of the mass imputation approach depends on the underlying model assumptions. In this paper, we propose using deep learning for the mass imputation in the combining of probability and non-probability samples and compare it with several modern machine learning-based mass imputation approaches, including generalized additive modeling, regression tree, random forest, and XG-boosting. In the simulation study, deep learning-based approaches have been shown to be more robust and effective than other mass imputation approaches against the failure of underlying model assumptions under non-linearity scenarios.
Release date: 2025-12-23
4. Generalized regression estimation under misspecified sample design
Articles and reports: 12-001-X202500200008
Description: Classical design-based survey estimation relies on a properly specified sampling design for valid inference. We consider the properties of regression estimation under a misspecified sample design, in which the nominal and true inclusion probabilities do not necessarily match. This general misspecified sample design setting encompasses many challenges in the modern survey environment. Under this setting, an asymptotic analysis of the regression estimator, an expression of the bias, and an expression of the variance are presented. Further, a consistent variance estimator is derived and an expression which estimates the bias in-part or in-whole is discussed. This later expression may be used as an indicator of the presence of bias due to misspecification by a practitioner. A simulation study is conducted to support the presented theory.
Release date: 2025-12-23
5. Mapping location and co-location of industries at the neighborhood level: A spatial kernel density approach
Articles and reports: 18-001-X2025001
Description: This paper brings the analysis of business cluster to a more granular geographic scale by developing a methodology for identifying business clusters at the neighborhood level. The proposed method identifies clusters of businesses at the DB level, which is one of the most granular spatial units of analysis defined by Statistics Canada. The method is developed with an application to four census metropolitan areas (CMAs) of different sizes and for different industry cluster specifications, including simple 2-digit North American Industry Classification System (NAICS) groups as well as industry clusters resulting from groupings of NAICS codes, as defined by Delgado et al. (2014).
Release date: 2025-10-10
6. Efficient Record Linkage for Large Datasets by Business Names Archived
Articles and reports: 11-522-X202500100019
Description: Accurate and efficient record linkage is crucial for maintaining a comprehensive and current Statistical Business Register (SBR) at Statistics Canada. Linking external business lists to the SBR by name presents computational and methodological challenges, especially as data volumes grow. This paper describes a scalable methodology that employs blocking techniques to constrain the computational search space and integrates multiple similarity measures—from edit distances and n-gram overlaps to embedding-based methods using Sentence-BERT (SBERT)—to identify likely matches. By combining simple character-level comparisons with more advanced semantic embedding methods, the approach can adapt to various naming conventions and complexities. While it does not guarantee superior accuracy in all circumstances, it offers a pragmatic balance between computational feasibility and linkage quality.
Release date: 2025-09-08
7. Evaluating the Accuracy when Linking Records in Waves Archived
Articles and reports: 11-522-X202500100020
Description: At Statistics Canada, many data sets are linked with quasi-identifiers such as the first name, last name, or address. In such cases, linkage errors are a potential concern and must be measured. In that regard, previous studies have shown that the evaluation may be based on modeling the number of links from a given record while accounting for all the interactions among the linkage variables and dispensing with clerical reviews, so long as the decision to link two records does not involve other records. In this communication, the methodology is adapted for a class of practical strategies, which violate this constraint by linking the records in consecutive waves, where a given wave links a subset of the records that are not linked in previous waves. In particular, the linkage may be based on a deterministic wave followed by a probabilistic one.
Release date: 2025-09-08
8. Model-Based Threshold Selection for Agricultural Linkages Archived
Articles and reports: 11-522-X202500100021
Description: Optimal threshold selection is a critical challenge in probabilistic linkage, with significant implications for the accuracy and reliability of linked datasets. This paper analyzes the performance of the neighbour model, a recently proposed error model which models linkage errors by the number of links from each record. Three threshold selection algorithms utilizing the neighbour model were assessed, highlighting the strengths and limitations of each. Their performance was assessed through simulation studies, which demonstrated that methods using the neighbour model achieved lower relative bias compared to two established methods for threshold selection. Additionally, the practical utility was validated through goodness-of-fit tests conducted on four agricultural datasets, showing the potential of the model for use in real-world applications.
Release date: 2025-09-08
9. T1 Redesign: T1 Partnership Identification Process Archived
Articles and reports: 11-522-X202500100022
Description: In Canada, T1 Tax forms are used to report personal income, whether earned as an employee or through self-employment. Income from self-employment, or "T1 Business Income" is reported by sole proprietorships or partnerships. A T1 partnership involves two or more legal entities jointly filing for a shared business. T1 business data is received as individual filings, meaning partnerships are received separately for each partner. Internal record linkage within the T1 business database is performed to identify partnerships and prevent overcoverage within the final population of T1 businesses. This new T1 partnership identification process takes advantage of newer algorithms, such as DBSCAN numerical clustering fuzzy matching, to identify internal linkages. Graph theory is used to construct the list of partnerships from the row-pairs identified in the linkage process.
Release date: 2025-09-08
10. Development of Linkage-Adjusted Weights Accounting for Gender for the 2021 Canadian Census Health and Environment Cohort Archived
Articles and reports: 11-522-X202500100023
Description: The latest Canadian Census Health and Environment Cohort (CanCHEC) continues a series of population-based microdata linkages focused on population health research by demographic, social and economic characteristics. The 2021 CanCHEC consists of 95.5% of the 2021 Census long-form sample survey records. The records of survey respondents that could not be linked to the Derived Record Depository and those presumed to be duplicates account for the remaining 4.5%. Linkage-adjusted main and replicate weights allow researchers to estimate and evaluate the variance of summary measures about population health in the presence of missed linked pairs to better understand the experiences of diverse population groups.
Release date: 2025-09-08

Data (1)

Data (1) ((1 result))

1. Income divergence index (D-index) by census tract
Table: 11-10-0074-01
Geography: Census tract
Frequency: Occasional
Description:
The divergence index (D-index) describes the degree that families with different income levels are mixing together in neighbourhoods. It compares neighbourhood (census tract, CT) discrete income distributions to a base distribution, which is the income quintiles of the neighbourhood’s census metropolitan area (CMA).

Release date: 2020-06-22

Analysis (199)

Analysis (199) (50 to 60 of 199 results)

51. Measuring the undercoverage of two data sources with a nearly perfect coverage through capture and recapture in the presence of linkage errors Archived
Articles and reports: 11-522-X202100100006
Description:
In the context of its "admin-first" paradigm, Statistics Canada is prioritizing the use of non-survey sources to produce official statistics. This paradigm critically relies on non-survey sources that may have a nearly perfect coverage of some target populations, including administrative files or big data sources. Yet, this coverage must be measured, e.g., by applying the capture-recapture method, where they are compared to other sources with good coverage of the same populations, including a census. However, this is a challenging exercise in the presence of linkage errors, which arise inevitably when the linkage is based on quasi-identifiers, as is typically the case. To address the issue, a new methodology is described where the capture-recapture method is enhanced with a new error model that is based on the number of links adjacent to a given record. It is applied in an experiment with public census data.
Key Words: dual system estimation, data matching, record linkage, quality, data integration, big data.

Release date: 2021-10-22
52. Applying the data science approach to COVID-19 epidemiological modelling to inform PPE demand and supply in Canada Archived
Articles and reports: 11-522-X202100100017
Description: The outbreak of the COVID-19 pandemic required the Government of Canada to provide relevant and timely information to support decision-making around a host of issues, including personal protective equipment (PPE) procurement and deployment. Our team built a compartmental epidemiological model from an existing code base to project PPE demand under a range of epidemiological scenarios. This model was further enhanced using data science techniques, which allowed for the rapid development and dissemination of model results to inform policy decisions.

Key Words: COVID-19; SARS-CoV-2; Epidemiological model; Data science; Personal Protective Equipment (PPE); SEIR
Release date: 2021-10-22
53. Responsible use of machine learning at Statistics Canada Archived
Articles and reports: 11-522-X202100100002
Description:
A framework for the responsible use of machine learning processes has been developed at Statistics Canada. The framework includes guidelines for the responsible use of machine learning and a checklist, which are organized into four themes: respect for people, respect for data, sound methods, and sound application. All four themes work together to ensure the ethical use of both the algorithms and results of machine learning. The framework is anchored in a vision that seeks to create a modern workplace and provide direction and support to those who use machine learning techniques. It applies to all statistical programs and projects conducted by Statistics Canada that use machine learning algorithms. This includes supervised and unsupervised learning algorithms. The framework and associated guidelines will be presented first. The process of reviewing projects that use machine learning, i.e., how the framework is applied to Statistics Canada projects, will then be explained. Finally, future work to improve the framework will be described.
Keywords: Responsible machine learning, explainability, ethics

Release date: 2021-10-15
54. Predicting transitions into and out of poverty using machine learning Archived
Articles and reports: 11-522-X202100100003
Description:
The increasing size and richness of digital data allow for modeling more complex relationships and interactions, which is the strongpoint of machine learning. Here we applied gradient boosting to the Dutch system of social statistical datasets to estimate transition probabilities into and out of poverty. Individual estimates are reasonable, but the main advantages of the approach in combination with SHAP and global surrogate models are the simultaneous ranking of hundreds of features by their importance, detailed insight into their relationship with the transition probabilities, and the data-driven identification of subpopulations with relatively high and low transition probabilities. In addition, we decompose the difference in feature importance between general and subpopulation into a frequency and a feature effect. We caution for misinterpretation and discuss future directions.
Key Words: Classification; Explainability; Gradient boosting; Life event; Risk factors; SHAP decomposition.

Release date: 2021-10-15
55. Advances in the use of auxiliary information for estimation from nonprobability samples Archived
Articles and reports: 11-522-X202100100014
Description: Recent developments in questionnaire administration modes and data extraction have favored the use of nonprobability samples, which are often affected by selection bias that arises from the lack of a sample design or self-selection of the participants. This bias can be addressed by several adjustments, whose applicability depends on the type of auxiliary information available. Calibration weighting can be used when only population totals of auxiliary variables are available. If a reference survey that followed a probability sampling design is available, several methods can be applied, such as Propensity Score Adjustment, Statistical Matching or Mass Imputation, and doubly robust estimators. In the case where a complete census of the target population is available for some auxiliary covariates, estimators based in superpopulation models (often used in probability sampling) can be adapted to the nonprobability sampling case. We studied the combination of some of these methods in order to produce less biased and more efficient estimates, as well as the use of modern prediction techniques (such as Machine Learning classification and regression algorithms) in the modelling steps of the adjustments described. We also studied the use of variable selection techniques prior to the modelling step in Propensity Score Adjustment. Results show that adjustments based on the combination of several methods might improve the efficiency of the estimates, and the use of Machine Learning and variable selection techniques can contribute to reduce the bias and the variance of the estimators to a greater extent in several situations.

Key Words: nonprobability sampling; calibration; Propensity Score Adjustment; Matching.
Release date: 2021-10-15
56. Integration of data from probability surveys and big found data for finite population inference using mass imputation
Articles and reports: 12-001-X202100100004
Description: Multiple data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we consider an imputation approach to combining data from a probability survey and big found data. We focus on the case when the study variable is observed in the big data only, but the other auxiliary variables are commonly observed in both data. Unlike the usual imputation for missing data analysis, we create imputed values for all units in the probability sample. Such mass imputation is attractive in the context of survey data integration (Kim and Rao, 2012). We extend mass imputation as a tool for data integration of survey data and big non-survey data. The mass imputation methods and their statistical properties are presented. The matching estimator of Rivers (2007) is also covered as a special case. Variance estimation with mass-imputed data is discussed. The simulation results demonstrate the proposed estimators outperform existing competitors in terms of robustness and efficiency.
Release date: 2021-06-24
57. Framework for Responsible Machine Learning Processes at Statistics Canada Archived
Stats in brief: 89-20-00062021001
Description:
As Canada's national statistical organization, Statistics Canada is committed to sharing our knowledge and expertise to help all Canadians develop their data literacy skills. The goal is to provide learners with information on the basic concepts and skills with regard to a range of data literacy topics.
The training is aimed at those who are new to data or those who have some experience with data but may need a refresher or want to expand their knowledge. We invite you to check out our Learning catalogue to learn more about our offerings including a great collection of short videos. Be sure to check back regularly as we will be continuing to release new training.
Release date: 2021-05-03
58. Statistics 101: Proportions, ratios, and rates Archived
Stats in brief: 89-20-00062021003
Description:
In this video, viewers will learn the differences between three types of measure: proportions, ratios, and rates. In addition, viewers by the end of this video will be able to determine how each measure is calculated and when it is best to use one measure rather than the other.

Release date: 2021-05-03
59. Machine learning: An introduction Archived
Stats in brief: 89-20-00062021004
Description:
One important distinction we will make in this video is the differences between Data Science, Artificial Intelligence and Machine Learning. You'll learn what machine learning can be used for, how it works, and some different methods for doing it. And you'll also learn how to build and use machine learning processes responsibly.
This video is recommended for those who already have some familiarity with the concepts and techniques associated with computer programming and using algorithms to analyze data.
Release date: 2021-05-03
60. Telling the data story: How to create stories that matter Archived
Stats in brief: 89-20-00062021005
Description:
By the end of this video, you should have a deeper understanding of the fundamentals of using data to tell a story. We will go over some the principle components of storytelling including the data, the narrative and visualization, and discuss how they can be used to construct concise, informative and interesting messages your audience can trust. And then, you will learn the importance of a well planned data story, which includes learning who your audience will be, what they should know and how to best deliver that information.
Release date: 2021-05-03

Reference (7)

Reference (7) ((7 results))

1. Methods for Constructing Life Tables for Canada, Provinces and Territories
Surveys and statistical programs – Documentation: 84-538-X
Geography: Canada
Description: This electronic publication presents the methodology underlying the production of the life tables for Canada, provinces and territories.
Release date: 2023-08-28
2. Canadian Cancer Registry System Guide, 2007 Edition Archived
Surveys and statistical programs – Documentation: 82-225-X200701010508
Description:
The Record Linkage Overview describes the process used in annual internal record linkage of the Canadian Cancer Registry. The steps include: preparation; pre-processing; record linkage; post-processing; analysis and resolution; resolution entry; and, resolution processing.
Release date: 2008-01-18
3. Using the postal code as merge key for independent data files: matching data from the Canadian Census and an administrative file of school test scores in Quebec Archived
Surveys and statistical programs – Documentation: 11-522-X20050019476
Description:
The paper will show how, using data published by Statistics Canada and available from member libraries of the CREPUQ, a linkage approach using postal codes makes it possible to link the data from the outcomes file to a set of contextual variables. These variables could then contribute to producing, on an exploratory basis, a better index to explain the varied outcomes of students from schools. In terms of the impact, the proposed index could show more effectively the limitations of ranking students and schools when this information is not given sufficient weight.
Release date: 2007-03-02
4. The Integrated Approach to Economic Surveys in Canada Archived
Surveys and statistical programs – Documentation: 68-514-X
Description:
Statistics Canada's approach to gathering and disseminating economic data has developed over several decades into a highly integrated system for collection and estimation that feeds the framework of the Canadian System of National Accounts.
The key to this approach was creation of the Unified Enterprise Survey, the goal of which was to improve the consistency, coherence, breadth and depth of business survey data.
The UES did so by bringing many of Statistics Canada's individual annual business surveys under a common framework. This framework included a single survey frame, a sample design framework, conceptual harmonization of survey content, means of using relevant administrative data, common data collection, processing and analysis tools, and a common data warehouse.
Release date: 2006-11-20
5. The Longitudinal Administrative Databank (LAD) and the Longitudinal Immigration Database (IMDB): Building the LAD_IMDB - A Technical Paper Archived
Surveys and statistical programs – Documentation: 89-612-X
Description:
This paper describes the structure and linkage of two databases: the Longitudinal Administrative Databank (LAD), and the Longitudinal Immigration Database (IMDB). The combined data associate landed immigrant taxfilers on the LAD with their key characteristics upon immigration. The paper highlights how the combined information, referred to here as the LAD_IMDB, enhances and complements the existing separate databases. The paper compares the full IMDB file with the sample of immigrants to assess the representativeness of the sample file.
Release date: 2004-01-05
6. Linking Provincial Student Assessments with National and International Assessments Archived
Surveys and statistical programs – Documentation: 81-595-M2003005
Geography: Canada
Description:
This paper develops technical procedures that may enable ministries of education to link provincial tests with national and international tests in order to compare standards and report results on a common scale.
Release date: 2003-05-29
7. An Overview of the Issues Related to the Use of Personal Identifiers Archived
Surveys and statistical programs – Documentation: 85-602-X
Description:
The purpose of this report is to provide an overview of existing methods and techniques making use of personal identifiers to support record linkage. Record linkage can be loosely defined as a methodology for manipulating and / or transforming personal identifiers from individual data records from one or more operational databases and subsequently attempting to match these personal identifiers to create a composite record about an individual. Record linkage is not intended to uniquely identify individuals for operational purposes; however, it does provide probabilistic matches of varying degrees of reliability for use in statistical reporting. Techniques employed in record linkage may also be of use for investigative purposes to help narrow the field of search against existing databases when some form of personal identification information exists.
Release date: 2000-12-05

Date modified:: 2026-05-15

Language selection

WxT Language switcher

Search and menus

WxT Search form

Statistical techniques

Filter results by

Keyword(s)

Type

Geography

Survey or statistical program

Content

Results

All (207) (0 to 10 of 207 results)

Data (1) ((1 result))

Analysis (199) (50 to 60 of 199 results)

Reference (7) ((7 results))

Statistical techniques

Filter results by

Keyword(s)

Type

Geography

Survey or statistical program

Content

Results

All (207) (0 to 10 of 207 results)

Data (1) ((1 result))

Analysis (199) (50 to 60 of 199 results)

Reference (7) ((7 results))

How are the results ordered?

How are the results ordered?

How do I use the filters and the search box?

How do I refine my search?

How does the search work?