Statistical methods

Skip to filters. View results.

Key indicators

Changing any selection will automatically update the page content.

Selected geographical area:Canada

Selected geographical area:Newfoundland and Labrador

Selected geographical area:Prince Edward Island

Selected geographical area:Nova Scotia

Selected geographical area:New Brunswick

Selected geographical area:Quebec

Selected geographical area:Ontario

Selected geographical area:Manitoba

Selected geographical area:Saskatchewan

Selected geographical area:Alberta

Selected geographical area:British Columbia

Selected geographical area:Yukon

Selected geographical area:Northwest Territories

Selected geographical area:Nunavut

Sort Help
entries

Results

All (2,478)

All (2,478) (50 to 60 of 2,478 results)

  • Articles and reports: 11-522-X202500100019
    Description: Accurate and efficient record linkage is crucial for maintaining a comprehensive and current Statistical Business Register (SBR) at Statistics Canada. Linking external business lists to the SBR by name presents computational and methodological challenges, especially as data volumes grow. This paper describes a scalable methodology that employs blocking techniques to constrain the computational search space and integrates multiple similarity measures—from edit distances and n-gram overlaps to embedding-based methods using Sentence-BERT (SBERT)—to identify likely matches. By combining simple character-level comparisons with more advanced semantic embedding methods, the approach can adapt to various naming conventions and complexities. While it does not guarantee superior accuracy in all circumstances, it offers a pragmatic balance between computational feasibility and linkage quality.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100020
    Description: At Statistics Canada, many data sets are linked with quasi-identifiers such as the first name, last name, or address. In such cases, linkage errors are a potential concern and must be measured. In that regard, previous studies have shown that the evaluation may be based on modeling the number of links from a given record while accounting for all the interactions among the linkage variables and dispensing with clerical reviews, so long as the decision to link two records does not involve other records. In this communication, the methodology is adapted for a class of practical strategies, which violate this constraint by linking the records in consecutive waves, where a given wave links a subset of the records that are not linked in previous waves. In particular, the linkage may be based on a deterministic wave followed by a probabilistic one.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100021
    Description: Optimal threshold selection is a critical challenge in probabilistic linkage, with significant implications for the accuracy and reliability of linked datasets. This paper analyzes the performance of the neighbour model, a recently proposed error model which models linkage errors by the number of links from each record. Three threshold selection algorithms utilizing the neighbour model were assessed, highlighting the strengths and limitations of each. Their performance was assessed through simulation studies, which demonstrated that methods using the neighbour model achieved lower relative bias compared to two established methods for threshold selection. Additionally, the practical utility was validated through goodness-of-fit tests conducted on four agricultural datasets, showing the potential of the model for use in real-world applications.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100022
    Description: In Canada, T1 Tax forms are used to report personal income, whether earned as an employee or through self-employment. Income from self-employment, or "T1 Business Income" is reported by sole proprietorships or partnerships. A T1 partnership involves two or more legal entities jointly filing for a shared business. T1 business data is received as individual filings, meaning partnerships are received separately for each partner. Internal record linkage within the T1 business database is performed to identify partnerships and prevent overcoverage within the final population of T1 businesses. This new T1 partnership identification process takes advantage of newer algorithms, such as DBSCAN numerical clustering fuzzy matching, to identify internal linkages. Graph theory is used to construct the list of partnerships from the row-pairs identified in the linkage process.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100023
    Description: The latest Canadian Census Health and Environment Cohort (CanCHEC) continues a series of population-based microdata linkages focused on population health research by demographic, social and economic characteristics. The 2021 CanCHEC consists of 95.5% of the 2021 Census long-form sample survey records. The records of survey respondents that could not be linked to the Derived Record Depository and those presumed to be duplicates account for the remaining 4.5%. Linkage-adjusted main and replicate weights allow researchers to estimate and evaluate the variance of summary measures about population health in the presence of missed linked pairs to better understand the experiences of diverse population groups.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100024
    Description: This paper explores a vision for the future of National Statistics Offices (NSOs). It analyses the history and role of NSOs before exploring current and future challenges and opportunities for NSOs, before finally outlining a future where NSOs become more agile, open, and collaborative while maintaining their high level of trust in the community, thereby allowing them to fulfil their new role as data stewards in a rapidly evolving data landscape.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100025
    Description: National statistical offices have increasingly adopted machine learning (ML) for its potential to improve survey estimates. ML techniques offer significant advantages, notably the ability to manage high-dimensional data and to capture complex, nonlinear relationships, thereby enhancing the overall quality of survey statistics. In this article, following the approach of Chernozhukov et al. (2018), we describe a double debiased machine learning framework that enables valid statistical inference when imputed estimators are derived from ML procedures. Simulation results suggest that the proposed framework performs well in a wide range of scenarios.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100026
    Description: In 2022, Canada became the first country to release statistical information about its transgender and non-binary populations based on census data. Moreover, following a 2018 government-wide policy direction, Statistics Canada's surveys have been collecting and disseminating information about gender by default rather than sex at birth. Due to the small size of the transgender and non-binary populations, disseminating safe statistical information about them at detailed geographical levels poses a challenge.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100027
    Description: Several challenges encountered when constructing U.S. administrative record-based (AR-based) population estimates for 2020 are identified. They include locational accuracy, person coverage and its consistency over time, filtering out non-residents and people not alive on the reference date, uncovering missing links across person and address records, and predicting demographic characteristics. Several ways to address these issues are discussed. Regression results illustrate how the challenges and solutions affect the AR-based county population estimates.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100028
    Description: The United Nations Sustainable Development Goals require detailed, disaggregated data, typically obtained through household surveys. However, surveys alone cannot meet these needs for granular statistics. To address this, National Statistical Institutes adopt small area methods, but these face challenges as auxiliary variables, often derived from surveys, introduce measurement errors into the models. The aim is the application of measurement error correction in classic Fay-Herriot area-level model. The results demonstrate the robustness of the standard approach and ignoring measurement error but show there are specific scenarios where correction for measurement errors is beneficial. The approach is applied to a case study utilizing Indonesian household survey data.
    Release date: 2025-09-08
Data (10)

Data (10) ((10 results))

  • Public use microdata: 89F0002X
    Description: The SPSD/M is a static microsimulation model designed to analyse financial interactions between governments and individuals in Canada. It can compute taxes paid to and cash transfers received from government. It is comprised of a database, a series of tax/transfer algorithms and models, analytical software and user documentation.
    Release date: 2026-02-12

  • Profile of a community or region: 46-26-0002
    Description: The National Address Register (NAR) is a list of commercial and residential addresses in Canada that are extracted from Statistics Canada's Building Register and deemed non-confidential.
    Release date: 2025-12-19

  • Table: 89-26-0006
    Description: PASSAGES is an open-source dynamic microsimulation model aimed at supporting policy analysis and research relating to Canadian retirement income system outcomes at the individual and family level. The publicly available version includes a synthetic starting database, a model, and documentation. A confidential starting database is also available.
    Release date: 2025-03-12

  • Data Visualization: 71-607-X2020010
    Description: The Canadian Statistical Geospatial Explorer empowers users to discover geo enabled data holdings of Statistics Canada at various levels of geography including at the neighbourhood level. Users are able to visualize, thematically map, spatially explore and analyze, export and consume data in various formats. Users can also view the data superimposed on satellite imagery, topographic and street layers.
    Release date: 2024-08-21

  • Table: 11-10-0074-01
    Geography: Census tract
    Frequency: Occasional
    Description:

    The divergence index (D-index) describes the degree that families with different income levels are mixing together in neighbourhoods. It compares neighbourhood (census tract, CT) discrete income distributions to a base distribution, which is the income quintiles of the neighbourhood’s census metropolitan area (CMA).

    Release date: 2020-06-22

  • Data Visualization: 71-607-X2019010
    Description: The Housing Data Viewer is a visualization tool that allows users to explore Statistics Canada data on a map. Users can use the tool to navigate, compare and export data.
    Release date: 2019-10-30

  • Table: 53-500-X
    Description:

    This report presents the results of a pilot survey conducted by Statistics Canada to measure the fuel consumption of on-road motor vehicles registered in Canada. This study was carried out in connection with the Canadian Vehicle Survey (CVS) which collects information on road activity such as distance traveled, number of passengers and trip purpose.

    Release date: 2004-10-21

  • Table: 13-220-X
    Description: In the 1997 edition, new and revised benchmarks were introduced for 1992 and 1988. The indicators are used to monitor supply, demand and employment for tourism in Canada on a timely basis. The annual tables are derived using the National Income and Expenditure Accounts (NIEA) and various industry and travel surveys. Tables providing actual data and percentage changes, for seasonally adjusted current and constant price estimates are included. In addition, an analytical section provides graphs, and time series of first differences, percentage changes, and seasonal factors for selected indicators. Data are published from 1987 and the publication will be available on the day of release. New data are included in the demand tables for non-tourism commodities produced by non-tourism industries and in the employment tables covering direct tourism employment generated by non-tourism industries. This product was commissioned by the Canadian Tourism Commission to provide annual updates for the Tourism Satellite Account.
    Release date: 2003-01-08

  • Table: 11-516-X
    Description:

    The second edition of Historical statistics of Canada was jointly produced by the Social Science Federation of Canada and Statistics Canada in 1983. This volume contains about 1,088 statistical tables on the social, economic and institutional conditions of Canada from the start of Confederation in 1867 to the mid-1970s. The tables are arranged in sections with an introduction explaining the content of each section, the principal sources of data for each table, and general explanatory notes regarding the statistics. In most cases, there is sufficient description of the individual series to enable the reader to use them without consulting the numerous basic sources referenced in the publication.

    The electronic version of this historical publication is accessible on the Internet site of Statistics Canada as a free downloadable document: text as HTML pages and all tables as individual spreadsheets in a comma delimited format (CSV) (which allows online viewing or downloading).

    Release date: 1999-07-29

  • Table: 82-567-X
    Description:

    The National Population Health Survey (NPHS) is designed to enhance the understanding of the processes affecting health. The survey collects cross-sectional as well as longitudinal data. In 1994/95 the survey interviewed a panel of 17,276 individuals, then returned to interview them a second time in 1996/97. The response rate for these individuals was 96% in 1996/97. Data collection from the panel will continue for up to two decades. For cross-sectional purposes, data were collected for a total of 81,000 household residents in all provinces (except people on Indian reserves or on Canadian Forces bases) in 1996/97.

    This overview illustrates the variety of information available by presenting data on perceived health, chronic conditions, injuries, repetitive strains, depression, smoking, alcohol consumption, physical activity, consultations with medical professionals, use of medications and use of alternative medicine.

    Release date: 1998-07-29
Analysis (2,036)

Analysis (2,036) (20 to 30 of 2,036 results)

  • Articles and reports: 75-005-M2025001
    Description: Since 2010, engaging Canadians to participate in the LFS has become more challenging due to a variety of social and technological changes. The decline in the LFS response rate accelerated in 2020, exacerbated by public health measures during the COVID-19 pandemic. This technical paper presents preliminary results of two collection initiatives implemented using an online first strategy to improve the LFS response rates by confirming respondent contact information and expanding the availability of online response. Through these and other planned initiatives, Statistics Canada is working to ensure that the LFS estimates continue to provide an accurate and representative portrait of the Canadian labour market.
    Release date: 2025-10-21

  • Articles and reports: 18-001-X2025001
    Description: This paper brings the analysis of business cluster to a more granular geographic scale by developing a methodology for identifying business clusters at the neighborhood level. The proposed method identifies clusters of businesses at the DB level, which is one of the most granular spatial units of analysis defined by Statistics Canada. The method is developed with an application to four census metropolitan areas (CMAs) of different sizes and for different industry cluster specifications, including simple 2-digit North American Industry Classification System (NAICS) groups as well as industry clusters resulting from groupings of NAICS codes, as defined by Delgado et al. (2014).
    Release date: 2025-10-10

  • Journals and periodicals: 12-206-X
    Description: This report summarizes the annual achievements of the Methodology Research and Development Program (MRDP) sponsored by the Modern Statistical Methods and Data Science Branch at Statistics Canada. This program covers research and development activities in statistical methods with potentially broad application in the agency’s statistical programs; these activities would otherwise be less likely to be carried out during the provision of regular methodology services to those programs. The MRDP also includes activities that provide support in the application of past successful developments in order to promote the use of the results of research and development work. Selected prospective research activities are also presented.
    Release date: 2025-10-10

  • Articles and reports: 11-522-X202500100001
    Description: Synthetic data generation (SDG) is increasingly applied across sectors for privacy-preserving data sharing, de-biasing and augmentation. Each use case requires a distinct set of evaluation metrics that must account for the stochasticity of the SDG process: membership and attribute disclosure vulnerability are critical for privacy; fidelity and downstream task utility apply more broadly; and fairness and diversity are relevant for de-biasing and augmentation, respectively. Presenting accumulated evidence and through exemplar case studies, it is shown that SDG can perform well across many of these use cases and our key learnings from our experiences with synthetic health data are shared.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100002
    Description: Under the consumer-merchant bipartite network, we apply the indirect sampling approach to estimate merchant payment acceptance through a consumer payment diary. The records of in-person transactions in the consumer diary provide both the merchant sample via consumer-merchant linkages, and the merchant acceptance via consumers' responses on methods of payments used and accepted. Among merchants receiving multiple transactions during the period of the diary, we show that the derived payment acceptance from the consumer reporting is high quality in terms of very few conflicts between usage and perception, and within perceptions. Therefore, consumers are leveraged to be both sampling and reporting units in our indirect sampling application to eliminate merchant response burden. Furthermore, the necessity to proceed to weight adjustment to account for the non-recorded-merchant bias due to the relatively shorter duration of the diary (i.e., 3 days) is shown. Finally, these indirect sampling estimates are compared to the ones from a direct sampling survey, and it is found that the results are aligning well.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100003
    Description: In-person data collection is critical for the success of many large government-sponsored surveys. Despite response rate declines and increasing costs, the mode remains the gold standard for meeting the most rigorous survey requirements for federal survey programs, particularly as part of a multimode data collection strategy (Schober, 2018). However, over the last ten years critical labor market and workforce changes, exacerbated by the pandemic, have made in-person data collection efforts prohibitive for all but the largest survey organizations. Shifting ideas about job flexibility and job satisfaction alongside the increasingly technical role and demanding nature of the job have impacted recruitment and retention for survey organizations across the U.S. and Europe (Charman et al., 2024). The trends in U.S. field data collector employment are summarized and it is outlined that there are promising practices in recruiting and retaining high quality field data collectors. Additionally, broader ways to structure the field data collector labor force for continued success are considered, including supplementing field data collection with multimode alternatives such as video interviewing and updating value propositions for respondents.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100004
    Description: The Survey of Household Spending (SHS) conducted by Statistics Canada collects paper diaries and shopping receipts as a source of household expenditure data. An auto-capturing algorithm was created for SHS 2023 to reduce statistical clerks' manual work of extracting important information from scanned receipts of common store brands. The algorithm used Tesseract optical character recognition (OCR) to extract text characters from images of receipts, and it identified store and product entities using regular expressions, also known as regex. The goal of this study was to enhance the current auto-capture algorithm by experimenting with more advanced OCR and machine learning methods. As a result, PaddleOCR, an open-source OCR toolkit, was selected as the new default OCR engine due to its overall performance in recognizing texts, especially digits, accurately across receipts of various qualities. Additionally, entity classifiers based on support vector machines were trained on historical SHS records and existing regex patterns. By using classifiers to categorize different elements present on receipts instead of relying solely on regex patterns, product and store recognition improved. It is expected that this new algorithm will be used for SHS 2025 to improve the auto-capture quality and reduce the manual burden associated with capturing receipt variables.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100005
    Description: The Physical Flow Account for Plastic Material (PFAPM) aims to enhance environmental-economic analysis by tracking plastic material flows within the Canadian economy. To help streamline this complex process, the project leveraged advanced natural language processing (NLP) such as large language models (LLM) techniques to automate sector classification and summarize the impact of COVID-19 from company reports. By integrating machine learning models and retrieval-augmented generation (RAG) methods, the manual workload was significantly reduced, improving data analysis efficiency, and leading to higher quality insights.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100006
    Description: Small area estimation is frequently used to produce estimates at a disaggregated level where direct survey estimation does not have sufficient sample to produce precise estimates. Often this is done using the area-level Fay-Herriot model, by assuming the direct estimates are independent under the design and have a known variance, and applying a smoothing process to the variance estimates of the direct estimates to better meet that last assumption. It is not rare that small area estimates are benchmarked/raked to aggregated level direct estimates. This article shows that wrongly assuming independence can have a big impact on the MSE of the raked estimates. Values of the covariances between direct estimates are thus required for good point and MSE estimates. Getting good estimates of those covariances is difficult given the small sample sizes in some areas. An original way of deriving values for those covariances, by reverse-engineering a hypothetical raking process, is presented.
    Release date: 2025-09-08

  • Articles and reports: 11-522-X202500100007
    Description: This paper employs the Pseudo Maximum Likelihood (PML) estimator to the non-probability two-phase sampling when relevant auxiliary information is available from both probability survey sample and non-probability survey sample. To accommodate various weight adjustments and estimates variance beyond totals and means such as medians and quantiles, a simplified pseudo-population bootstrap procedure is proposed to approximately estimate the second-phase variance. Specifically, the simplification ignores the second phase sampling variability (i.e., treated as fixed, while in fact it is random), if the first-phase sampling fraction of the non-probability sample is negligible. Using the Bank of Canada 2020 Cash Alternative Survey Wave 2, the performance of the proposed method is compared to alternative methods, which either do not explicitly model the selection probability (i.e., raking) or ignore the valuable information from Phase 1 (i.e., Phase-2-Only). The results show that the PML-based approach performs better than raking and Phase-2-Only estimates in terms of reducing the selection bias for both phases' payment-related variables, especially for the low-response youth group. Estimated variances of the PML-based estimates are stable.

    Release date: 2025-09-08
Reference (380)

Reference (380) (20 to 30 of 380 results)

  • Surveys and statistical programs – Documentation: 84-538-X
    Geography: Canada
    Description: This electronic publication presents the methodology underlying the production of the life tables for Canada, provinces and territories.
    Release date: 2023-08-28

  • Surveys and statistical programs – Documentation: 32-26-0006
    Description: This report provides data quality information pertaining to the Agriculture–Population Linkage, such as sources of error, matching process, response rates, imputation rates, sampling, weighting, disclosure control methods and data quality indicators.
    Release date: 2023-08-25

  • Surveys and statistical programs – Documentation: 98-20-00032021011
    Description: This video explains the key concepts of different levels of aggregation of income data such as household and family income; income concepts derived from key income variables such as adjusted income and equivalence scale; and statistics used for income data such as median and average income, quartiles, quintiles, deciles and percentiles.
    Release date: 2023-03-29

  • Surveys and statistical programs – Documentation: 98-20-00032021012
    Description: This video builds on concepts introduced in the other videos on income. It explains key low-income concepts - Market Basket Measure (MBM), Low income measure (LIM) and Low-income cut-offs (LICO) and the indicators associated with these concepts such as the low-income gap and the low-income ratio. These concepts are used in analysis of the economic well-being of the population.
    Release date: 2023-03-29

  • Surveys and statistical programs – Documentation: 11-633-X2022009
    Description: The Longitudinal Immigration Database (IMDB) is a comprehensive source of data that plays a key role in the understanding of the economic behaviour of immigrants. It is the only annual Canadian dataset that allows users to study the characteristics of immigrants to Canada at the time of admission and their economic outcomes and regional (inter-provincial) mobility over a time span of more than 35 years.

    This report will discuss the IMDB data sources, concepts and variables, record linkage, data processing, dissemination, data evaluation and quality indicators, comparability with other immigration datasets, and the analyses possible with the IMDB.

    Release date: 2022-12-05

  • Surveys and statistical programs – Documentation: 32-26-0002
    Description: This reference guide may be useful to both new and experienced users who wish to familiarize themselves with and find specific information about the Census of Agriculture.

    It provides an overview of the Census of Agriculture communications, content determination, collection, processing, data quality evaluation and dissemination activities. It also summarizes the key changes to the census and other useful information.

    Release date: 2022-04-14

  • Geographic files and documentation: 12-572-X
    Description:

    The Standard Geographical Classification (SGC) provides a systematic classification structure that categorizes all of the geographic area of Canada. The SGC is the official classification used in the Census of Population and other Statistics Canada surveys.

    The classification is organized in two volumes: Volume I, The Classification and Volume II, Reference Maps.

    Volume II contains reference maps showing boundaries, names, codes and locations of the geographic areas in the classification. The reference maps show census subdivisions, census divisions, census metropolitan areas, census agglomerations, census metropolitan influenced zones and economic regions. Definitions for these terms are found in Volume I, The Classification. Volume I describes the classification and related standard geographic areas and place names.

    The maps in Volume II can be downloaded in PDF format from our website.

    Release date: 2022-02-09

  • Surveys and statistical programs – Documentation: 11-633-X2021008
    Description: The Longitudinal Immigration Database (IMDB) is a comprehensive source of data that plays a key role in the understanding of the economic behaviour of immigrants. It is the only annual Canadian dataset that allows users to study the characteristics of immigrants to Canada at the time of admission and their economic outcomes and regional (inter-provincial) mobility over a time span of more than 35 years. The IMDB includes Immigration, Refugees and Citizenship Canada (IRCC) administrative records which contain exhaustive information about immigrants who were admitted to Canada since 1952. It also includes data about non-permanent residents who have been issued temporary resident permits since 1980. This report will discuss the IMDB data sources, concepts and variables, record linkage, data processing, dissemination, data evaluation and quality indicators, comparability with other immigration datasets, and the analyses possible with the IMDB.
    Release date: 2021-12-06

  • Surveys and statistical programs – Documentation: 12-004-X
    Description:

    Statistics: Power from Data! is a web resource that was created in 2001 to assist secondary students and teachers of Mathematics and Information Studies in getting the most from statistics. Over the past 20 years, this product has become one of Statistics Canada most popular references for students, teachers, and many other members of the general population. This product was last updated in 2021.

    Release date: 2021-09-02

  • Surveys and statistical programs – Documentation: 11-633-X2021005
    Description:

    The Analytical Studies and Modelling Branch (ASMB) is the research arm of Statistics Canada mandated to provide high-quality, relevant and timely information on economic, health and social issues that are important to Canadians. The branch strategically makes use of expert knowledge and a broad range of data sources and modelling techniques to address the information needs of a broad range of government, academic and public sector partners and stakeholders through analysis and research, modeling and predictive analytics, and data development. The branch strives to deliver relevant, high-quality, timely, comprehensive, horizontal and integrated research and to enable the use of its research through capacity building and strategic dissemination to meet the user needs of policy makers, academics and the general public.

    This Multi-year Consolidated Plan for Research, Modelling and Data Development outlines the priorities for the branch over the next two years.

    Release date: 2021-08-12