Analysis

Results

All (342)

All (342) (0 to 10 of 342 results)

1. A proposal for the problem of matching probabilities estimation in record linkage Archived
Articles and reports: 11-522-X202200100001
Description: Record linkage aims at identifying record pairs related to the same unit and observed in two different data sets, say A and B. Fellegi and Sunter (1969) suggest each record pair is tested whether generated from the set of matched or unmatched pairs. The decision function consists of the ratio between m(y) and u(y),probabilities of observing a comparison y of a set of k>3 key identifying variables in a record pair under the assumptions that the pair is a match or a non-match, respectively. These parameters are usually estimated by means of the EM algorithm using as data the comparisons on all the pairs of the Cartesian product ?=A×B. These observations (on the comparisons and on the pairs status as match or non-match) are assumed as generated independently of other pairs, assumption characterizing most of the literature on record linkage and implemented in software tools (e.g. RELAIS, Cibella et al. 2012). On the contrary, comparisons y and matching status in ? are deterministically dependent. As a result, estimates on m(y) and u(y) based on the EM algorithm are usually bad. This fact jeopardizes the effective application of the Fellegi-Sunter method, as well as automatic computation of quality measures and possibility to apply efficient methods for model estimation on linked data (e.g. regression functions), as in Chambers et al. (2015). We propose to explore ? by a set of samples, each one drawn so to preserve independence of comparisons among the selected record pairs. Simulations are encouraging.
Release date: 2024-03-25
2. A case study of using Splink: Census duplicate matching Archived
Articles and reports: 11-522-X202200100002
Description: The authors used the Splink probabilistic linkage package developed by the UK Ministry of Justice, to link census data from England and Wales to itself to find duplicate census responses. A large gold standard of confirmed census duplicates was available meaning that the results of the Splink implementation could be quality assured. This paper describes the implementation and features of Splink, gives details of the settings and parameters that we used to tune Splink for our particular project, and gives the results that we obtained.
Release date: 2024-03-25
3. Statistical disclosure control and special focus groups: A European perspective Archived
Articles and reports: 11-522-X202200100007
Description: With the availability of larger and more diverse data sources, Statistical Institutes in Europe are inclined to publish statistics on smaller groups than they used to do. Moreover, high impact global events like the Covid crisis and the situation in Ukraine may also ask for statistics on specific subgroups of the population. Publishing on small, targeted groups not only raises questions on statistical quality of the figures, it also raises issues concerning statistical disclosure risk. The principle of statistical disclosure control does not depend on the size of the groups the statistics are based on. However, the risk of disclosure does depend on the group size: the smaller a group, the higher the risk. Traditional ways to deal with statistical disclosure control and small group sizes include suppressing information and coarsening categories. These methods essentially increase the (mean) group sizes. More recent approaches include perturbative methods that have the intention to keep the group sizes small in order to preserve as much information as possible while reducing the disclosure risk sufficiently. In this paper we will mention some European examples of special focus group statistics and discuss the implications on statistical disclosure control. Additionally, we will discuss some issues that the use of perturbative methods brings along: its impact on disclosure risk and utility as well as the challenges in proper communication thereof.
Release date: 2024-03-25
4. Modelling intra-annual measurement in linked administrative and survey data Archived
Articles and reports: 11-522-X202200100012
Description: At Statistics Netherlands (SN) for some economic sectors two partly-independent intra-annual turnover index series are available: a monthly series based on survey data and a quarterly series based on value added tax data for the smaller units and re-used survey data for the other units. SN aims to benchmark the monthly turnover index series to the quarterly census data on a quarterly basis. This cannot currently be done because the tax data has a different quarterly pattern: the turnover is relatively large in the fourth quarter of the year and smaller in the first quarter. With the current study we aim to describe this deviating quarterly pattern at micro level. In the past we developed a mixture model using absolute turnover levels that could explain part of the quarterly patterns. Because the absolute turnover levels differ between the two series, in the current study we use a model based on relative quarterly turnover levels within a year.
Release date: 2024-03-25
5. From theory to practice: Lessons learned from implementing the Network Sampling with Memory method Archived
Articles and reports: 11-522-X202200100016
Description: To overcome the traditional drawbacks of chain sampling methods, the sampling method called “network sampling with memory” was developed. Its unique feature is to recreate, gradually in the field, a frame for the target population composed of individuals identified by respondents and to randomly draw future respondents from this frame, thereby minimizing selection bias. Tested for the first time in France between September 2020 and June 2021, for a survey among Chinese immigrants in Île-de-France (ChIPRe), this presentation describes the difficulties encountered during collection—sometimes contextual, due to the pandemic, but mostly inherent to the method.
Release date: 2024-03-25
6. Integration of existing data to develop an ethnicity indicator in the LSDDP Archived
Articles and reports: 11-522-X202200100018
Description: The Longitudinal Social Data Development Program (LSDDP) is a social data integration approach aimed at providing longitudinal analytical opportunities without imposing additional burden on respondents. The LSDDP uses a multitude of signals from different data sources for the same individual, which helps to better understand their interactions and track changes over time. This article looks at how the ethnicity status of people in Canada can be estimated at the most detailed disaggregated level possible using the results from a variety of business rules applied to linked data and to the LSDDP denominator. It will then show how improvements were obtained using machine learning methods, such as decision trees and random forest techniques.
Release date: 2024-03-25
7. A method for estimating the effect of classification errors on statistics for two domains
Articles and reports: 12-001-X202300200002
Description: Being able to quantify the accuracy (bias, variance) of published output is crucial in official statistics. Output in official statistics is nearly always divided into subpopulations according to some classification variable, such as mean income by categories of educational level. Such output is also referred to as domain statistics. In the current paper, we limit ourselves to binary classification variables. In practice, misclassifications occur and these contribute to the bias and variance of domain statistics. Existing analytical and numerical methods to estimate this effect have two disadvantages. The first disadvantage is that they require that the misclassification probabilities are known beforehand and the second is that the bias and variance estimates are biased themselves. In the current paper we present a new method, a Gaussian mixture model estimated by an Expectation-Maximisation (EM) algorithm combined with a bootstrap, referred to as the EM bootstrap method. This new method does not require that the misclassification probabilities are known beforehand, although it is more efficient when a small audit sample is used that yields a starting value for the misclassification probabilities in the EM algorithm. We compared the performance of the new method with currently available numerical methods: the bootstrap method and the SIMEX method. Previous research has shown that for non-linear parameters the bootstrap outperforms the analytical expressions. For nearly all conditions tested, the bias and variance estimates that are obtained by the EM bootstrap method are closer to their true values than those obtained by the bootstrap and SIMEX methods. We end this paper by discussing the results and possible future extensions of the method.
Release date: 2024-01-03
8. Targetted double control of burden in multiple surveys
Articles and reports: 12-001-X202300200010
Description: Sample coordination methods aim to increase (in positive coordination) or decrease (in negative coordination) the size of the overlap between samples. The samples considered can be from different occasions of a repeated survey and/or from different surveys covering a common population. Negative coordination is used to control the response burden in a given period, because some units do not respond to survey questionnaires if they are selected in many samples. Usually, methods for sample coordination do not take into account any measure of the response burden that a unit has already expended in responding to previous surveys. We introduce such a measure into a new method by adapting a spatially balanced sampling scheme, based on a generalization of Poisson sampling, together with a negative coordination method. The goal is to create a double control of the burden for these units: once by using a measure of burden during the sampling process and once by using a negative coordination method. We evaluate the approach using Monte-Carlo simulation and investigate its use for controlling for selection “hot-spots” in business surveys in Statistics Netherlands.
Release date: 2024-01-03
9. Comments by Carl-Erik Särndal on “Jean-Claude Deville’s contributions to survey theory and official statistics”
Articles and reports: 12-001-X202300200012
Description: In recent decades, many different uses of auxiliary information have enriched survey sampling theory and practice. Jean-Claude Deville contributed significantly to this progress. My comments trace some of the steps on the way to one important theory for the use of auxiliary information: Estimation by calibration.
Release date: 2024-01-03
10. Data Quality as Fitness for Use
Stats in brief: 89-20-00062023001
Description: This course is intended for Government of Canada employees who would like to learn about evaluating the quality of data for a particular use. Whether you are a new employee interested in learning the basics, or an experienced subject matter expert looking to refresh your skills, this course is here to help.
Release date: 2023-07-17

Stats in brief (2)

Stats in brief (2) ((2 results))

1. Data Quality as Fitness for Use
Stats in brief: 89-20-00062023001
Description: This course is intended for Government of Canada employees who would like to learn about evaluating the quality of data for a particular use. Whether you are a new employee interested in learning the basics, or an experienced subject matter expert looking to refresh your skills, this course is here to help.
Release date: 2023-07-17
2. Statistics 101: Confidence intervals
Stats in brief: 89-20-00062022003
Description:
By the end of this video you will understand what confidence intervals are, why we use them, and what factors have an impact on them.

Release date: 2022-05-24

Articles and reports (337)

Articles and reports (337) (50 to 60 of 337 results)

51. Social media as a data source for official statistics; the Dutch Consumer Confidence Index Archived
Articles and reports: 12-001-X201700254871
Description:
In this paper the question is addressed how alternative data sources, such as administrative and social media data, can be used in the production of official statistics. Since most surveys at national statistical institutes are conducted repeatedly over time, a multivariate structural time series modelling approach is proposed to model the series observed by a repeated surveys with related series obtained from such alternative data sources. Generally, this improves the precision of the direct survey estimates by using sample information observed in preceding periods and information from related auxiliary series. This model also makes it possible to utilize the higher frequency of the social media to produce more precise estimates for the sample survey in real time at the moment that statistics for the social media become available but the sample data are not yet available. The concept of cointegration is applied to address the question to which extent the alternative series represent the same phenomena as the series observed with the repeated survey. The methodology is applied to the Dutch Consumer Confidence Survey and a sentiment index derived from social media.
Release date: 2017-12-21
52. Comments on the Rao and Fuller (2017) paper by Chris Skinner Archived
Articles and reports: 12-001-X201700254897
Description:
This note by Chris Skinner presents a discussion of the paper “Sample survey theory and methods: Past, present, and future directions” where J.N.K. Rao and Wayne A. Fuller share their views regarding the developments in sample survey theory and methods covering the past 100 years.
Release date: 2017-12-21
53. Preparing the statistical system for the legalization of cannabis Archived
Articles and reports: 13-605-X201700114840
Description:
Statistics Canada is presently preparing the statistical system to be able to gauge the impact of the transition from illegal to legal non-medical cannabis use and to shed light on the social and economic activities related to the use of cannabis thereafter. While the system of social statistics captures some information on the use of cannabis, updates will be required to more accurately measure health effects and the impact on the judicial system. Current statistical infrastructure used to more comprehensively measure the use and impacts of substances such as tobacco and alcohol could be adapted to do the same for cannabis. However, available economic statistics are largely silent on the role illegal drugs play in the economy. Both social and economic statistics will need to be updated to reflect the legalization of cannabis and the challenge is especially great for economic statistics This paper provides a summary of the work that is now under way toward these ends.
Release date: 2017-09-15
54. The DYSEM Microsimulation Modelling Platform Archived
Articles and reports: 11-633-X2017008
Description:
The DYSEM microsimulation modelling platform provides a demographic and socioeconomic core that can be readily built upon to develop custom dynamic microsimulation models or applications. This paper describes DYSEM and provides an overview of its intended uses, as well as the methods and data used in its development.
Release date: 2017-07-28
55. Reducing the response imbalance: Is the accuracy of the survey estimates improved? Archived
Articles and reports: 12-001-X201600214663
Description:
We present theoretical evidence that efforts during data collection to balance the survey response with respect to selected auxiliary variables will improve the chances for low nonresponse bias in the estimates that are ultimately produced by calibrated weighting. One of our results shows that the variance of the bias – measured here as the deviation of the calibration estimator from the (unrealized) full-sample unbiased estimator – decreases linearly as a function of the response imbalance that we assume measured and controlled continuously over the data collection period. An attractive prospect is thus a lower risk of bias if one can manage the data collection to get low imbalance. The theoretical results are validated in a simulation study with real data from an Estonian household survey.
Release date: 2016-12-20
56. The 2001 Canadian Census–Tax–Mortality Cohort: A 10-Year Follow-up Archived
Articles and reports: 11-633-X2016003
Description:
Large national mortality cohorts are used to estimate mortality rates for different socioeconomic and population groups, and to conduct research on environmental health. In 2008, Statistics Canada created a cohort linking the 1991 Census to mortality. The present study describes a linkage of the 2001 Census long-form questionnaire respondents aged 19 years and older to the T1 Personal Master File and the Amalgamated Mortality Database. The linkage tracks all deaths over a 10.6-year period (until the end of 2011, to date).
Release date: 2016-10-26
57. Linking the Canadian Immigrant Landing File to Hospital Data: A New Data Source for Immigrant Health Research Archived
Articles and reports: 11-633-X2016002
Description:
Immigrants comprise an ever-increasing percentage of the Canadian population—at more than 20%, which is the highest percentage among the G8 countries (Statistics Canada 2013a). This figure is expected to rise to 25% to 28% by 2031, when at least one in four people living in Canada will be foreign-born (Statistics Canada 2010).
This report summarizes the linkage of the Immigrant Landing File (ILF) for all provinces and territories, excluding Quebec, to hospital data from the Discharge Abstract Database (DAD), a national database containing information about hospital inpatient and day-surgery events. A deterministic exact-matching approach was used to link data from the 1980-to-2006 ILF and from the DAD (2006/2007, 2007/2008 and 2008/2009) with the 2006 Census, which served as a “bridge” file. This was a secondary linkage in that it used linkage keys created in two previous projects (primary linkages) that separately linked the ILF and the DAD to the 2006 Census. The ILF–DAD linked data were validated by means of a representative sample of 2006 Census records containing immigrant information previously linked to the DAD.
Release date: 2016-08-17
58. Hiring and Layoff Rates by Economic Region of Residence: Data Quality, Concepts and Methods Archived
Articles and reports: 11-633-X2016001
Description:
Every year, thousands of workers lose their jobs as firms reduce the size of their workforce in response to growing competition, technological changes, changing trade patterns and numerous other factors. Thousands of workers also start a job with a new employer as new firms enter a product market and existing firms expand or replace employees who recently left. This worker reallocation process across employers is generally seen as contributing to productivity growth and rising living standards. To measure this labour reallocation process, labour market indicators such as hiring rates and layoff rates are needed. In response to growing demand for subprovincial labour market information and taking advantage of unique administrative datasets, Statistics Canada is producing hiring rates and layoff rates by economic region of residence. This document describes the data sources, conceptual and methodological issues, and other matters pertaining to these two indicators.
Release date: 2016-06-27
59. A generalized Fellegi-Holt paradigm for automatic error localization Archived
Articles and reports: 12-001-X201600114538
Description:
The aim of automatic editing is to use a computer to detect and amend erroneous values in a data set, without human intervention. Most automatic editing methods that are currently used in official statistics are based on the seminal work of Fellegi and Holt (1976). Applications of this methodology in practice have shown systematic differences between data that are edited manually and automatically, because human editors may perform complex edit operations. In this paper, a generalization of the Fellegi-Holt paradigm is proposed that can incorporate a large class of edit operations in a natural way. In addition, an algorithm is outlined that solves the resulting generalized error localization problem. It is hoped that this generalization may be used to increase the suitability of automatic editing in practice, and hence to improve the efficiency of data editing processes. Some first results on synthetic data are promising in this respect.
Release date: 2016-06-22
60. A short note on quantile and expectile estimation in unequal probability samples Archived
Articles and reports: 12-001-X201600114545
Description:
The estimation of quantiles is an important topic not only in the regression framework, but also in sampling theory. A natural alternative or addition to quantiles are expectiles. Expectiles as a generalization of the mean have become popular during the last years as they not only give a more detailed picture of the data than the ordinary mean, but also can serve as a basis to calculate quantiles by using their close relationship. We show, how to estimate expectiles under sampling with unequal probabilities and how expectiles can be used to estimate the distribution function. The resulting fitted distribution function estimator can be inverted leading to quantile estimates. We run a simulation study to investigate and compare the efficiency of the expectile based estimator.
Release date: 2016-06-22

Journals and periodicals (3)

Journals and periodicals (3) ((3 results))

1. Record Linkage Project Process Model
Journals and periodicals: 12-605-X
Description:
The Record Linkage Project Process Model (RLPPM) was developed by Statistics Canada to identify the processes and activities involved in record linkage. The RLPPM applies to linkage projects conducted at the individual and enterprise level using diverse data sources to create new data sources to meet analytical and operational needs.
Release date: 2017-06-05
2. Qualitative Testing of Aboriginal Identification Questions
Journals and periodicals: 89-639-X
Geography: Canada
Description:
Beginning in late 2006, the Social and Aboriginal Statistics Division of Statistics Canada embarked on the process of review of questions used in the Census and in surveys to produce data about Aboriginal peoples (North American Indian, Métis and Inuit). This process is essential to ensure that Aboriginal identification questions are valid measures of contemporary Aboriginal identification, in all its complexity. Questions reviewed included the following (from the Census 2B questionnaire):- the Ethnic origin / Aboriginal ancestry question;- the Aboriginal identity question;- the Treaty / Registered Indian question; and- the Indian band / First Nation Membership question.
Additional testing was conducted on Census questions with potential Aboriginal response options: the population group question (also known as visible minorities), and the Religion question. The review process to date has involved two major steps: regional discussions with data users and stakeholders, and qualitative testing. The regional discussions with over 350 users of Aboriginal data across Canada were held in early 2007 to examine the four questions used on the Census and other surveys of Statistics Canada. Data users included National Aboriginal organizations, Aboriginal Provincial and Territorial Organizations, Federal, Provincial and local governments, researchers and Aboriginal service organizations. User feedback showed that main areas of concern were data quality, undercoverage, the wording of questions, and the importance of comparability over time.
Release date: 2009-04-17
3. Report on Regional Discussions on Aboriginal Identification Questions Archived
Journals and periodicals: 89-629-X
Geography: Canada
Description:
This report summarizes the main issues raised in these meetings. Four questions used to identify Aboriginal people from the Census and surveys were considered in the discussions.Statistics Canada regularly reviews the questions used on the Census and other surveys to ensure that the resulting data are representative of the population. As a first step in the process to review the questions used to produce data about First Nations, Inuit and Métis populations, regional discussions were held with more than 350 users of Aboriginal data in over 40 locations across Canada during the winter, spring and early summer of 2007.
This report summarizes the main issues raised in these meetings. Four questions used to identify Aboriginal people from the Census and surveys were considered in the discussions.
Release date: 2008-05-27

Report a problem or mistake on this page

Date modified:: 2024-05-16

Language selection

Search and menus

Search

Analysis

Filter results by

Keyword(s)

Subject

Year of publication

Author(s)

Survey or statistical program

Content

Results

All (342) (0 to 10 of 342 results)

Stats in brief (2) ((2 results))

Articles and reports (337) (50 to 60 of 337 results)

Journals and periodicals (3) ((3 results))

Analysis

Filter results by

Keyword(s)

Subject

Year of publication

Author(s)

Survey or statistical program

Content

Results

All (342) (0 to 10 of 342 results)

Stats in brief (2) ((2 results))

Articles and reports (337) (50 to 60 of 337 results)

Journals and periodicals (3) ((3 results))

How do I use the filters and the search box?

How do I refine my search?

How does the search work?

How are the results ordered?

How are the results ordered?