Statistical techniques

Results

All (208)

All (208) (0 to 10 of 208 results)

1. Quantitative impact analysis: A practical overview
Articles and reports: 36-28-0001202600500003
Description: This spotlight article outlines practical methods for assessing the economic impacts of public programs delivered by federal agencies and Crown corporations. It summarizes key steps in conducting quantitative impact analysis, including data linkage, cohort construction and implementation of quasi causal estimators.
Release date: 2026-05-27
2. Analytical Studies: Methods and References
Journals and periodicals: 11-633-X
Description: Papers in this series provide background discussions of the methods used to develop data for economic, health, and social analytical studies at Statistics Canada. They are intended to provide readers with information on the statistical methods, standards and definitions used to develop databases for research purposes. All papers in this series have undergone peer and institutional review to ensure that they conform to Statistics Canada's mandate and adhere to generally accepted standards of good professional practice.
Release date: 2026-05-27
3. Effects of model misspecification on small area estimators
Articles and reports: 12-001-X202500200001
Description: Nested error regression models are commonly used to incorporate unit specific auxiliary variables to improve small area estimates. When the mean structure of the model is misspecified, the design-based mean squared prediction error (MSPE) of Empirical Best Linear Unbiased Predictors (EBLUP) generally increases. The Observed Best Prediction (OBP) method has been proposed with the intent to improve on the design-based MSPE over EBLUP. In this paper, we conduct a Monte Carlo simulation experiments to understand the effect of misspsecification of mean structures on different small area estimators. Our findings suggest that the OBP using unit-level auxiliary variables does not outperform the EBLUP in terms of design-based MSPE, unless the number of small areas m is extremely large. Conversely, the performance of OBP significantly improves when area-level auxiliary variables are employed. This paper includes both analytical and numerical evidence to demonstrate these observations, providing practical insights for addressing model misspecification in small area estimation (SAE).
Release date: 2025-12-23
4. Integrating probability and non-probability samples through deep learning-based mass imputation
Articles and reports: 12-001-X202500200007
Description: Although probability samples have been regarded as the gold standard to collect information for population-based study, non-probability samples have been used frequently in practice due to low cost, convenience, and the lack of the sampling frame for the survey. Naïve estimates based on non-probability samples without any adjustments may be misleading due to selection bias. Recently, a valid data integration approach that includes mass imputation, propensity score weighting, and calibration has been used to improve the representativeness of non-probability samples. The effectiveness of the mass imputation approach depends on the underlying model assumptions. In this paper, we propose using deep learning for the mass imputation in the combining of probability and non-probability samples and compare it with several modern machine learning-based mass imputation approaches, including generalized additive modeling, regression tree, random forest, and XG-boosting. In the simulation study, deep learning-based approaches have been shown to be more robust and effective than other mass imputation approaches against the failure of underlying model assumptions under non-linearity scenarios.
Release date: 2025-12-23
5. Generalized regression estimation under misspecified sample design
Articles and reports: 12-001-X202500200008
Description: Classical design-based survey estimation relies on a properly specified sampling design for valid inference. We consider the properties of regression estimation under a misspecified sample design, in which the nominal and true inclusion probabilities do not necessarily match. This general misspecified sample design setting encompasses many challenges in the modern survey environment. Under this setting, an asymptotic analysis of the regression estimator, an expression of the bias, and an expression of the variance are presented. Further, a consistent variance estimator is derived and an expression which estimates the bias in-part or in-whole is discussed. This later expression may be used as an indicator of the presence of bias due to misspecification by a practitioner. A simulation study is conducted to support the presented theory.
Release date: 2025-12-23
6. Mapping location and co-location of industries at the neighborhood level: A spatial kernel density approach
Articles and reports: 18-001-X2025001
Description: This paper brings the analysis of business cluster to a more granular geographic scale by developing a methodology for identifying business clusters at the neighborhood level. The proposed method identifies clusters of businesses at the DB level, which is one of the most granular spatial units of analysis defined by Statistics Canada. The method is developed with an application to four census metropolitan areas (CMAs) of different sizes and for different industry cluster specifications, including simple 2-digit North American Industry Classification System (NAICS) groups as well as industry clusters resulting from groupings of NAICS codes, as defined by Delgado et al. (2014).
Release date: 2025-10-10
7. Efficient Record Linkage for Large Datasets by Business Names Archived
Articles and reports: 11-522-X202500100019
Description: Accurate and efficient record linkage is crucial for maintaining a comprehensive and current Statistical Business Register (SBR) at Statistics Canada. Linking external business lists to the SBR by name presents computational and methodological challenges, especially as data volumes grow. This paper describes a scalable methodology that employs blocking techniques to constrain the computational search space and integrates multiple similarity measures—from edit distances and n-gram overlaps to embedding-based methods using Sentence-BERT (SBERT)—to identify likely matches. By combining simple character-level comparisons with more advanced semantic embedding methods, the approach can adapt to various naming conventions and complexities. While it does not guarantee superior accuracy in all circumstances, it offers a pragmatic balance between computational feasibility and linkage quality.
Release date: 2025-09-08
8. Evaluating the Accuracy when Linking Records in Waves Archived
Articles and reports: 11-522-X202500100020
Description: At Statistics Canada, many data sets are linked with quasi-identifiers such as the first name, last name, or address. In such cases, linkage errors are a potential concern and must be measured. In that regard, previous studies have shown that the evaluation may be based on modeling the number of links from a given record while accounting for all the interactions among the linkage variables and dispensing with clerical reviews, so long as the decision to link two records does not involve other records. In this communication, the methodology is adapted for a class of practical strategies, which violate this constraint by linking the records in consecutive waves, where a given wave links a subset of the records that are not linked in previous waves. In particular, the linkage may be based on a deterministic wave followed by a probabilistic one.
Release date: 2025-09-08
9. Model-Based Threshold Selection for Agricultural Linkages Archived
Articles and reports: 11-522-X202500100021
Description: Optimal threshold selection is a critical challenge in probabilistic linkage, with significant implications for the accuracy and reliability of linked datasets. This paper analyzes the performance of the neighbour model, a recently proposed error model which models linkage errors by the number of links from each record. Three threshold selection algorithms utilizing the neighbour model were assessed, highlighting the strengths and limitations of each. Their performance was assessed through simulation studies, which demonstrated that methods using the neighbour model achieved lower relative bias compared to two established methods for threshold selection. Additionally, the practical utility was validated through goodness-of-fit tests conducted on four agricultural datasets, showing the potential of the model for use in real-world applications.
Release date: 2025-09-08
10. T1 Redesign: T1 Partnership Identification Process Archived
Articles and reports: 11-522-X202500100022
Description: In Canada, T1 Tax forms are used to report personal income, whether earned as an employee or through self-employment. Income from self-employment, or "T1 Business Income" is reported by sole proprietorships or partnerships. A T1 partnership involves two or more legal entities jointly filing for a shared business. T1 business data is received as individual filings, meaning partnerships are received separately for each partner. Internal record linkage within the T1 business database is performed to identify partnerships and prevent overcoverage within the final population of T1 businesses. This new T1 partnership identification process takes advantage of newer algorithms, such as DBSCAN numerical clustering fuzzy matching, to identify internal linkages. Graph theory is used to construct the list of partnerships from the row-pairs identified in the linkage process.
Release date: 2025-09-08

Data (1)

Data (1) ((1 result))

1. Income divergence index (D-index) by census tract
Table: 11-10-0074-01
Geography: Census tract
Frequency: Occasional
Description:
The divergence index (D-index) describes the degree that families with different income levels are mixing together in neighbourhoods. It compares neighbourhood (census tract, CT) discrete income distributions to a base distribution, which is the income quintiles of the neighbourhood’s census metropolitan area (CMA).

Release date: 2020-06-22

Analysis (200)

Analysis (200) (100 to 110 of 200 results)

101. A Long-run Version of the Bank of Canada Commodity Price Index, 1870 to 2015 Archived
Articles and reports: 11F0019M2017399
Description:
Canada is a trading nation that produces significant quantities of resource outputs. Consequently, the behaviour of resource prices that are important for Canada is germane to understanding the progress of real income growth and the prosperity of the country and the provinces. Demand and supply shocks or changes in monetary policy in international markets may exert significant influence on resource prices, and their fluctuations constitute an important avenue for the transmission of external shocks into the domestic economy. This paper develops historical estimates of the Bank of Canada commodity price index (BCPI) and links them to modern estimates. Using a collection of historical data sources, it estimates weights and prices sufficiently consistently to merit the construction of long-run estimates that may be linked to the modern Fisher BCPI.
Release date: 2017-10-11
102. Estimating Parental Leave in Canada Using Administrative Data Archived
Articles and reports: 11-633-X2017009
Description:
This document describes the procedures for using linked administrative data sources to estimate paid parental leave rates in Canada and the issues surrounding this use.
Release date: 2017-08-29
103. Zeno: A Tool for Calculating Confidence Intervals of Rates in Health Archived
Articles and reports: 11-633-X2017005
Description:
Hospitalization rates are among commonly reported statistics related to health-care service use. The variety of methods for calculating confidence intervals for these and other health-related rates suggests a need to classify, compare and evaluate these methods. Zeno is a tool developed to calculate confidence intervals of rates based on several formulas available in the literature. This report describes the contents of the main sheet of the Zeno Tool and indicates which formulas are appropriate, based on users’ assumptions and scope of analysis.
Release date: 2017-01-19
104. Reducing the response imbalance: Is the accuracy of the survey estimates improved? Archived
Articles and reports: 12-001-X201600214663
Description:
We present theoretical evidence that efforts during data collection to balance the survey response with respect to selected auxiliary variables will improve the chances for low nonresponse bias in the estimates that are ultimately produced by calibrated weighting. One of our results shows that the variance of the bias – measured here as the deviation of the calibration estimator from the (unrealized) full-sample unbiased estimator – decreases linearly as a function of the response imbalance that we assume measured and controlled continuously over the data collection period. An attractive prospect is thus a lower risk of bias if one can manage the data collection to get low imbalance. The theoretical results are validated in a simulation study with real data from an Estonian household survey.
Release date: 2016-12-20
105. Statistical inference based on judgment post-stratified samples in finite population Archived
Articles and reports: 12-001-X201600214664
Description:
This paper draws statistical inference for finite population mean based on judgment post stratified (JPS) samples. The JPS sample first selects a simple random sample and then stratifies the selected units into H judgment classes based on their relative positions (ranks) in a small set of size H. This leads to a sample with random sample sizes in judgment classes. Ranking process can be performed either using auxiliary variables or visual inspection to identify the ranks of the measured observations. The paper develops unbiased estimator and constructs confidence interval for population mean. Since judgment ranks are random variables, by conditioning on the measured observations we construct Rao-Blackwellized estimators for the population mean. The paper shows that Rao-Blackwellized estimators perform better than usual JPS estimators. The proposed estimators are applied to 2012 United States Department of Agriculture Census Data.
Release date: 2016-12-20
106. A cautionary note on Clark Winsorization Archived
Articles and reports: 12-001-X201600214676
Description:
Winsorization procedures replace extreme values with less extreme values, effectively moving the original extreme values toward the center of the distribution. Winsorization therefore both detects and treats influential values. Mulry, Oliver and Kaputa (2014) compare the performance of the one-sided Winsorization method developed by Clark (1995) and described by Chambers, Kokic, Smith and Cruddas (2000) to the performance of M-estimation (Beaumont and Alavi 2004) in highly skewed business population data. One aspect of particular interest for methods that detect and treat influential values is the range of values designated as influential, called the detection region. The Clark Winsorization algorithm is easy to implement and can be extremely effective. However, the resultant detection region is highly dependent on the number of influential values in the sample, especially when the survey totals are expected to vary greatly by collection period. In this note, we examine the effect of the number and magnitude of influential values on the detection regions from Clark Winsorization using data simulated to realistically reflect the properties of the population for the Monthly Retail Trade Survey (MRTS) conducted by the U.S. Census Bureau. Estimates from the MRTS and other economic surveys are used in economic indicators, such as the Gross Domestic Product (GDP).
Release date: 2016-12-20
107. Adaptive rectangular sampling: An easy, incomplete, neighbourhood-free adaptive cluster sampling design Archived
Articles and reports: 12-001-X201600214684
Description:
This paper introduces an incomplete adaptive cluster sampling design that is easy to implement, controls the sample size well, and does not need to follow the neighbourhood. In this design, an initial sample is first selected, using one of the conventional designs. If a cell satisfies a prespecified condition, a specified radius around the cell is sampled completely. The population mean is estimated using the \pi-estimator. If all the inclusion probabilities are known, then an unbiased \pi estimator is available; if, depending on the situation, the inclusion probabilities are not known for some of the final sample units, then they are estimated. To estimate the inclusion probabilities, a biased estimator is constructed. However, the simulations show that if the sample size is large enough, the error of the inclusion probabilities is negligible, and the relative \pi-estimator is almost unbiased. This design rivals adaptive cluster sampling because it controls the final sample size and is easy to manage. It rivals adaptive two-stage sequential sampling because it considers the cluster form of the population and reduces the cost of moving across the area. Using real data on a bird population and simulations, the paper compares the design with adaptive two-stage sequential sampling. The simulations show that the design has significant efficiency in comparison with its rival.
Release date: 2016-12-20
108. Study: The 2001 Canadian Census–Tax–Mortality Cohort: A 10-year follow-up Archived
Stats in brief: 11-001-X201630015325
Description: Release published in The Daily – Statistics Canada’s official release bulletin
Release date: 2016-10-26
109. The 2001 Canadian Census–Tax–Mortality Cohort: A 10-Year Follow-up Archived
Articles and reports: 11-633-X2016003
Description:
Large national mortality cohorts are used to estimate mortality rates for different socioeconomic and population groups, and to conduct research on environmental health. In 2008, Statistics Canada created a cohort linking the 1991 Census to mortality. The present study describes a linkage of the 2001 Census long-form questionnaire respondents aged 19 years and older to the T1 Personal Master File and the Amalgamated Mortality Database. The linkage tracks all deaths over a 10.6-year period (until the end of 2011, to date).
Release date: 2016-10-26
110. Linking the Canadian Immigrant Landing File to Hospital Data: A New Data Source for Immigrant Health Research Archived
Articles and reports: 11-633-X2016002
Description:
Immigrants comprise an ever-increasing percentage of the Canadian population—at more than 20%, which is the highest percentage among the G8 countries (Statistics Canada 2013a). This figure is expected to rise to 25% to 28% by 2031, when at least one in four people living in Canada will be foreign-born (Statistics Canada 2010).
This report summarizes the linkage of the Immigrant Landing File (ILF) for all provinces and territories, excluding Quebec, to hospital data from the Discharge Abstract Database (DAD), a national database containing information about hospital inpatient and day-surgery events. A deterministic exact-matching approach was used to link data from the 1980-to-2006 ILF and from the DAD (2006/2007, 2007/2008 and 2008/2009) with the 2006 Census, which served as a “bridge” file. This was a secondary linkage in that it used linkage keys created in two previous projects (primary linkages) that separately linked the ILF and the DAD to the 2006 Census. The ILF–DAD linked data were validated by means of a representative sample of 2006 Census records containing immigrant information previously linked to the DAD.
Release date: 2016-08-17

Reference (7)

Reference (7) ((7 results))

1. Methods for Constructing Life Tables for Canada, Provinces and Territories
Surveys and statistical programs – Documentation: 84-538-X
Geography: Canada
Description: This electronic publication presents the methodology underlying the production of the life tables for Canada, provinces and territories.
Release date: 2023-08-28
2. Canadian Cancer Registry System Guide, 2007 Edition Archived
Surveys and statistical programs – Documentation: 82-225-X200701010508
Description:
The Record Linkage Overview describes the process used in annual internal record linkage of the Canadian Cancer Registry. The steps include: preparation; pre-processing; record linkage; post-processing; analysis and resolution; resolution entry; and, resolution processing.
Release date: 2008-01-18
3. Using the postal code as merge key for independent data files: matching data from the Canadian Census and an administrative file of school test scores in Quebec Archived
Surveys and statistical programs – Documentation: 11-522-X20050019476
Description:
The paper will show how, using data published by Statistics Canada and available from member libraries of the CREPUQ, a linkage approach using postal codes makes it possible to link the data from the outcomes file to a set of contextual variables. These variables could then contribute to producing, on an exploratory basis, a better index to explain the varied outcomes of students from schools. In terms of the impact, the proposed index could show more effectively the limitations of ranking students and schools when this information is not given sufficient weight.
Release date: 2007-03-02
4. The Integrated Approach to Economic Surveys in Canada Archived
Surveys and statistical programs – Documentation: 68-514-X
Description:
Statistics Canada's approach to gathering and disseminating economic data has developed over several decades into a highly integrated system for collection and estimation that feeds the framework of the Canadian System of National Accounts.
The key to this approach was creation of the Unified Enterprise Survey, the goal of which was to improve the consistency, coherence, breadth and depth of business survey data.
The UES did so by bringing many of Statistics Canada's individual annual business surveys under a common framework. This framework included a single survey frame, a sample design framework, conceptual harmonization of survey content, means of using relevant administrative data, common data collection, processing and analysis tools, and a common data warehouse.
Release date: 2006-11-20
5. The Longitudinal Administrative Databank (LAD) and the Longitudinal Immigration Database (IMDB): Building the LAD_IMDB - A Technical Paper Archived
Surveys and statistical programs – Documentation: 89-612-X
Description:
This paper describes the structure and linkage of two databases: the Longitudinal Administrative Databank (LAD), and the Longitudinal Immigration Database (IMDB). The combined data associate landed immigrant taxfilers on the LAD with their key characteristics upon immigration. The paper highlights how the combined information, referred to here as the LAD_IMDB, enhances and complements the existing separate databases. The paper compares the full IMDB file with the sample of immigrants to assess the representativeness of the sample file.
Release date: 2004-01-05
6. Linking Provincial Student Assessments with National and International Assessments Archived
Surveys and statistical programs – Documentation: 81-595-M2003005
Geography: Canada
Description:
This paper develops technical procedures that may enable ministries of education to link provincial tests with national and international tests in order to compare standards and report results on a common scale.
Release date: 2003-05-29
7. An Overview of the Issues Related to the Use of Personal Identifiers Archived
Surveys and statistical programs – Documentation: 85-602-X
Description:
The purpose of this report is to provide an overview of existing methods and techniques making use of personal identifiers to support record linkage. Record linkage can be loosely defined as a methodology for manipulating and / or transforming personal identifiers from individual data records from one or more operational databases and subsequently attempting to match these personal identifiers to create a composite record about an individual. Record linkage is not intended to uniquely identify individuals for operational purposes; however, it does provide probabilistic matches of varying degrees of reliability for use in statistical reporting. Techniques employed in record linkage may also be of use for investigative purposes to help narrow the field of search against existing databases when some form of personal identification information exists.
Release date: 2000-12-05

Date modified:: 2026-06-17

Language selection

WxT Language switcher

Search and menus

WxT Search form

Statistical techniques

Filter results by

Keyword(s)

Type

Geography

Survey or statistical program

Content

Results

All (208) (0 to 10 of 208 results)

Data (1) ((1 result))

Analysis (200) (100 to 110 of 200 results)

Reference (7) ((7 results))

Statistical techniques

Filter results by

Keyword(s)

Type

Geography

Survey or statistical program

Content

Results

All (208) (0 to 10 of 208 results)

Data (1) ((1 result))

Analysis (200) (100 to 110 of 200 results)

Reference (7) ((7 results))

How are the results ordered?

How are the results ordered?

How do I use the filters and the search box?

How do I refine my search?

How does the search work?