Keyword search

Skip to main content
Skip to footer

Language selection

Français

Search and menus

Search and menus

Search

Results

All (19)

All (19) (0 to 10 of 19 results)

1. Advances in the use of auxiliary information for estimation from nonprobability samples Archived
Articles and reports: 11-522-X202100100014
Description: Recent developments in questionnaire administration modes and data extraction have favored the use of nonprobability samples, which are often affected by selection bias that arises from the lack of a sample design or self-selection of the participants. This bias can be addressed by several adjustments, whose applicability depends on the type of auxiliary information available. Calibration weighting can be used when only population totals of auxiliary variables are available. If a reference survey that followed a probability sampling design is available, several methods can be applied, such as Propensity Score Adjustment, Statistical Matching or Mass Imputation, and doubly robust estimators. In the case where a complete census of the target population is available for some auxiliary covariates, estimators based in superpopulation models (often used in probability sampling) can be adapted to the nonprobability sampling case. We studied the combination of some of these methods in order to produce less biased and more efficient estimates, as well as the use of modern prediction techniques (such as Machine Learning classification and regression algorithms) in the modelling steps of the adjustments described. We also studied the use of variable selection techniques prior to the modelling step in Propensity Score Adjustment. Results show that adjustments based on the combination of several methods might improve the efficiency of the estimates, and the use of Machine Learning and variable selection techniques can contribute to reduce the bias and the variance of the estimators to a greater extent in several situations.
Key Words: nonprobability sampling; calibration; Propensity Score Adjustment; Matching.
Release date: 2021-10-15
2. Integration of data from probability surveys and big found data for finite population inference using mass imputation
Articles and reports: 12-001-X202100100004
Description:
Multiple data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we consider an imputation approach to combining data from a probability survey and big found data. We focus on the case when the study variable is observed in the big data only, but the other auxiliary variables are commonly observed in both data. Unlike the usual imputation for missing data analysis, we create imputed values for all units in the probability sample. Such mass imputation is attractive in the context of survey data integration (Kim and Rao, 2012). We extend mass imputation as a tool for data integration of survey data and big non-survey data. The mass imputation methods and their statistical properties are presented. The matching estimator of Rivers (2007) is also covered as a special case. Variance estimation with mass-imputed data is discussed. The simulation results demonstrate the proposed estimators outperform existing competitors in terms of robustness and efficiency.
Release date: 2021-06-24
3. Data Quality Toolkit
Surveys and statistical programs – Documentation: 12-606-X
Description:
This is a toolkit intended to aid data producers and data users external to Statistics Canada.
Release date: 2017-09-27
4. An Overview of Selected International Business Record Linkage Programs Archived
Articles and reports: 18-001-X2016001
Description:
Although the record linkage of business data is not a completely new topic, the fact remains that the public and many data users are unaware of the programs and practices commonly used by statistical agencies across the world.
This report is a brief overview of the main practices, programs and challenges of record linkage of statistical agencies across the world who answered a short survey on this subject supplemented by publically available documentation produced by these agencies. The document shows that the linkage practices are similar between these statistical agencies; however the main differences are in the procedures in place to access to data along with regulatory policies that govern the record linkage permissions and the dissemination of data.
Release date: 2016-10-27
5. Statistical matching using fractional imputation Archived
Articles and reports: 12-001-X201600114539
Description:
Statistical matching is a technique for integrating two or more data sets when information available for matching records for individual participants across data sets is incomplete. Statistical matching can be viewed as a missing data problem where a researcher wants to perform a joint analysis of variables that are never jointly observed. A conditional independence assumption is often used to create imputed data for statistical matching. We consider a general approach to statistical matching using parametric fractional imputation of Kim (2011) to create imputed data under the assumption that the specified model is fully identified. The proposed method does not have a convergent EM sequence if the model is not identified. We also present variance estimators appropriate for the imputation procedure. We explain how the method applies directly to the analysis of data from split questionnaire designs and measurement error models.
Release date: 2016-06-22
6. Methodology of evaluating the quality of probabilistic linking Archived
Articles and reports: 11-522-X200600110401
Description:
The Australian Bureau of Statistics (ABS) will begin the formation of a Statistical Longitudinal Census Data Set (SLCD) by choosing a 5% sample of people from the 2006 population census to be linked probabilistically with subsequent censuses. A long-term aim is to use the power of the rich longitudinal demographic data provided by the SLCD to shed light on a variety of issues which cannot be addressed using cross-sectional data. The SLCD may be further enhanced by probabilistically linking it with births, deaths, immigration settlements or disease registers. This paper gives a brief description of recent developments in data linking at the ABS, outlines the data linking methodology and quality measures we have considered and summarises preliminary results using Census Dress Rehearsal data.
Release date: 2008-03-17
7. Using matched substitutes to improve imputations for geographically linked databases Archived
Articles and reports: 12-001-X20050018088
Description:
When administrative records are geographically linked to census block groups, local-area characteristics from the census can be used as contextual variables, which may be useful supplements to variables that are not directly observable from the administrative records. Often databases contain records that have insufficient address information to permit geographical links with census block groups; the contextual variables for these records are therefore unobserved. We propose a new method that uses information from "matched cases" and multivariate regression models to create multiple imputations for the unobserved variables. Our method outperformed alternative methods in simulation evaluations using census data, and was applied to the dataset for a study on treatment patterns for colorectal cancer patients.
Release date: 2005-07-21
8. Evaluating frame coverage and activity of micro-businesses in sawmilling in the UK Archived
Articles and reports: 11-522-X20010016248
Description:
This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.
The Sawmill Survey is a voluntary census of sawmills in Great Britain. It is limited to fixed mills using domestically-grown timber. Three approaches to assess the coverage of this survey are described:
(1) A sample survey of the sawmilling industry from the UK's business register, excluding businesses already sampled in the Sawmill Survey, is used to assess the undercoverage in the list of known sawmills; (2) A non-response follow-up using local knowledge of regional officers of the Forestry Commission, is used to estimate the sawmills that do not respond (mostly the smaller mills); and (3) A survey of small-scale sawmills and mobile sawmills (many of these businesses are micro-enterprises) is conducted to analyse their significance.
These three approaches are synthesized to give an estimate of the coverage of the original survey compared with the total activity identified, and to estimate the importance of micro-enterprises to the sawmilling industry in Great Britain.
Release date: 2002-09-12
9. Quality assurance challenges in the United States' Census 2000 Archived
Articles and reports: 11-522-X20010016249
Description:
This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.The United States' Census 2000 operations were more innovative and complex than ever before. State population totals were required to be produced within nine months and, using the coverage measurement survey, adjusted counts were expected within one year. Therefore, all operations had to be implemented and completed quickly with quality assurance (QA) that had both an effective and prompt turnaround. The QA challenges included: getting timely information to supervisors (such as enumerator re-interview information), performing prompt checks of "suspect" work (such as monitoring contractors to ensure accurate data capture), and providing reports to headquarters quickly. This paper presents these challenges and their solutions in detail, thus providing an overview of the Census 2000 QA program.
Release date: 2002-09-12
10. Using matched census-survey records to evaluate the quality of survey data Archived
Articles and reports: 11-522-X20010016270
Description:
This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.
Following the last three censuses in Britain, survey non-response on major government household surveys has been investigated by linking addresses sampled for surveys taking place around the time of the census to individual census records for the same addresses. This paper outlines the design of the 2001 British Census-linked Study of Survey Nonresponse. The study involves 10 surveys that vary significantly in design and response rates. The key feature of the study is the extensive use of auxiliary data and multilevel modelling to identify interviewer, household and area level effects.
Release date: 2002-09-12

Data (0)

Data (0) (0 results)

No content available at this time.

Analysis (13)

Analysis (13) (0 to 10 of 13 results)

1. Advances in the use of auxiliary information for estimation from nonprobability samples Archived
Articles and reports: 11-522-X202100100014
Description: Recent developments in questionnaire administration modes and data extraction have favored the use of nonprobability samples, which are often affected by selection bias that arises from the lack of a sample design or self-selection of the participants. This bias can be addressed by several adjustments, whose applicability depends on the type of auxiliary information available. Calibration weighting can be used when only population totals of auxiliary variables are available. If a reference survey that followed a probability sampling design is available, several methods can be applied, such as Propensity Score Adjustment, Statistical Matching or Mass Imputation, and doubly robust estimators. In the case where a complete census of the target population is available for some auxiliary covariates, estimators based in superpopulation models (often used in probability sampling) can be adapted to the nonprobability sampling case. We studied the combination of some of these methods in order to produce less biased and more efficient estimates, as well as the use of modern prediction techniques (such as Machine Learning classification and regression algorithms) in the modelling steps of the adjustments described. We also studied the use of variable selection techniques prior to the modelling step in Propensity Score Adjustment. Results show that adjustments based on the combination of several methods might improve the efficiency of the estimates, and the use of Machine Learning and variable selection techniques can contribute to reduce the bias and the variance of the estimators to a greater extent in several situations.
Key Words: nonprobability sampling; calibration; Propensity Score Adjustment; Matching.
Release date: 2021-10-15
2. Integration of data from probability surveys and big found data for finite population inference using mass imputation
Articles and reports: 12-001-X202100100004
Description:
Multiple data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we consider an imputation approach to combining data from a probability survey and big found data. We focus on the case when the study variable is observed in the big data only, but the other auxiliary variables are commonly observed in both data. Unlike the usual imputation for missing data analysis, we create imputed values for all units in the probability sample. Such mass imputation is attractive in the context of survey data integration (Kim and Rao, 2012). We extend mass imputation as a tool for data integration of survey data and big non-survey data. The mass imputation methods and their statistical properties are presented. The matching estimator of Rivers (2007) is also covered as a special case. Variance estimation with mass-imputed data is discussed. The simulation results demonstrate the proposed estimators outperform existing competitors in terms of robustness and efficiency.
Release date: 2021-06-24
3. An Overview of Selected International Business Record Linkage Programs Archived
Articles and reports: 18-001-X2016001
Description:
Although the record linkage of business data is not a completely new topic, the fact remains that the public and many data users are unaware of the programs and practices commonly used by statistical agencies across the world.
This report is a brief overview of the main practices, programs and challenges of record linkage of statistical agencies across the world who answered a short survey on this subject supplemented by publically available documentation produced by these agencies. The document shows that the linkage practices are similar between these statistical agencies; however the main differences are in the procedures in place to access to data along with regulatory policies that govern the record linkage permissions and the dissemination of data.
Release date: 2016-10-27
4. Statistical matching using fractional imputation Archived
Articles and reports: 12-001-X201600114539
Description:
Statistical matching is a technique for integrating two or more data sets when information available for matching records for individual participants across data sets is incomplete. Statistical matching can be viewed as a missing data problem where a researcher wants to perform a joint analysis of variables that are never jointly observed. A conditional independence assumption is often used to create imputed data for statistical matching. We consider a general approach to statistical matching using parametric fractional imputation of Kim (2011) to create imputed data under the assumption that the specified model is fully identified. The proposed method does not have a convergent EM sequence if the model is not identified. We also present variance estimators appropriate for the imputation procedure. We explain how the method applies directly to the analysis of data from split questionnaire designs and measurement error models.
Release date: 2016-06-22
5. Methodology of evaluating the quality of probabilistic linking Archived
Articles and reports: 11-522-X200600110401
Description:
The Australian Bureau of Statistics (ABS) will begin the formation of a Statistical Longitudinal Census Data Set (SLCD) by choosing a 5% sample of people from the 2006 population census to be linked probabilistically with subsequent censuses. A long-term aim is to use the power of the rich longitudinal demographic data provided by the SLCD to shed light on a variety of issues which cannot be addressed using cross-sectional data. The SLCD may be further enhanced by probabilistically linking it with births, deaths, immigration settlements or disease registers. This paper gives a brief description of recent developments in data linking at the ABS, outlines the data linking methodology and quality measures we have considered and summarises preliminary results using Census Dress Rehearsal data.
Release date: 2008-03-17
6. Using matched substitutes to improve imputations for geographically linked databases Archived
Articles and reports: 12-001-X20050018088
Description:
When administrative records are geographically linked to census block groups, local-area characteristics from the census can be used as contextual variables, which may be useful supplements to variables that are not directly observable from the administrative records. Often databases contain records that have insufficient address information to permit geographical links with census block groups; the contextual variables for these records are therefore unobserved. We propose a new method that uses information from "matched cases" and multivariate regression models to create multiple imputations for the unobserved variables. Our method outperformed alternative methods in simulation evaluations using census data, and was applied to the dataset for a study on treatment patterns for colorectal cancer patients.
Release date: 2005-07-21
7. Evaluating frame coverage and activity of micro-businesses in sawmilling in the UK Archived
Articles and reports: 11-522-X20010016248
Description:
This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.
The Sawmill Survey is a voluntary census of sawmills in Great Britain. It is limited to fixed mills using domestically-grown timber. Three approaches to assess the coverage of this survey are described:
(1) A sample survey of the sawmilling industry from the UK's business register, excluding businesses already sampled in the Sawmill Survey, is used to assess the undercoverage in the list of known sawmills; (2) A non-response follow-up using local knowledge of regional officers of the Forestry Commission, is used to estimate the sawmills that do not respond (mostly the smaller mills); and (3) A survey of small-scale sawmills and mobile sawmills (many of these businesses are micro-enterprises) is conducted to analyse their significance.
These three approaches are synthesized to give an estimate of the coverage of the original survey compared with the total activity identified, and to estimate the importance of micro-enterprises to the sawmilling industry in Great Britain.
Release date: 2002-09-12
8. Quality assurance challenges in the United States' Census 2000 Archived
Articles and reports: 11-522-X20010016249
Description:
This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.The United States' Census 2000 operations were more innovative and complex than ever before. State population totals were required to be produced within nine months and, using the coverage measurement survey, adjusted counts were expected within one year. Therefore, all operations had to be implemented and completed quickly with quality assurance (QA) that had both an effective and prompt turnaround. The QA challenges included: getting timely information to supervisors (such as enumerator re-interview information), performing prompt checks of "suspect" work (such as monitoring contractors to ensure accurate data capture), and providing reports to headquarters quickly. This paper presents these challenges and their solutions in detail, thus providing an overview of the Census 2000 QA program.
Release date: 2002-09-12
9. Using matched census-survey records to evaluate the quality of survey data Archived
Articles and reports: 11-522-X20010016270
Description:
This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.
Following the last three censuses in Britain, survey non-response on major government household surveys has been investigated by linking addresses sampled for surveys taking place around the time of the census to individual census records for the same addresses. This paper outlines the design of the 2001 British Census-linked Study of Survey Nonresponse. The study involves 10 surveys that vary significantly in design and response rates. The key feature of the study is the extensive use of auxiliary data and multilevel modelling to identify interviewer, household and area level effects.
Release date: 2002-09-12
10. Use of statistical matching techniques in calibration estimation Archived
Articles and reports: 12-001-X19980024354
Description:
This article deals with an attempt to cross-tabulate two categorical variables, which were separately collected from two large independent samples, and jointly collected from one small sample. It was assumed that the large samples have a large set of common variables. The proposed estimation technique can be considered a mix between calibration techniques and statistical matching. Through calibration techniques, it is possible to incorporate the complex designs of the samples in the estimation procedure, to fulfill some consistency requirements between estimates from various sources, and to obtain fairly unbiased estimates for the two-way table. Through the statistical matching techniques, it is possible to incorporate a relatively large set of common variables in the calibration estimation, by means of which the precision of the estimated two-way table can be improved. The estimation technique enables us to gain insight into the bias generally obtained, in estimating the two-way table, by sole use of the large samples. It is shown how the estimation technique can be useful to impute values of the one large sample (donor source) into the other large sample (host source). Although the technique is principally developed for catagorical variables Y and Z, with a minor modification, it is also applicable for continuous variables Y and Z.
Release date: 1999-01-14

Reference (6)

Reference (6) ((6 results))

1. Data Quality Toolkit
Surveys and statistical programs – Documentation: 12-606-X
Description:
This is a toolkit intended to aid data producers and data users external to Statistics Canada.
Release date: 2017-09-27
2. Overview of record linkage Archived
Surveys and statistical programs – Documentation: 11-522-X19990015660
Description:
There are many different situations in which one or more files need to be linked. With one file the purpose of the linkage would be to locate duplicates within the file. When there are two files, the linkage is done to identify the units that are the same on both files and thus create matched pairs. Often records that need to be linked do not have a unique identifier. Hierarchical record linkage, probabilistic record linkage and statistical matching are three methods that can be used when there is no unique identifier on the files that need to be linked. We describe the major differences between the methods. We consider how to choose variables to link, how to prepare files for linkage and how the links are identified. As well, we review tips and tricks used when linking files. Two examples, the probabilistic record linkage used in the reverse record check and the hierarchical record linkage of the Business Number (BN) master file to the Statistical Universe File (SUF) of unincorporated tax filers (T1) will be illustrated.
Release date: 2000-03-02
3. An evaluation of data fusion techniques Archived
Surveys and statistical programs – Documentation: 11-522-X19990015666
Description:
The fusion sample obtained by a statistical matching process can be considered a sample out of an artificial population. The distribution of this artificial population is derived. If the correlation between specific variables is the only focus the strong demand for conditional independence can be weakened. In a simulation study the effects of violations of some assumptions leading to the distribution of the artificial population are examined. Finally some ideas concerning the establishing of the claimed conditional independence by latent class analysis are presented.
Release date: 2000-03-02
4. Integrated media planning through statistical matching: Development and evaluation of the New Zealand panorama service Archived
Surveys and statistical programs – Documentation: 11-522-X19990015670
Description:
To reach their target audience efficiently, advertisers and media planners need information on which media their customers use. For instance, they may need to know what percentage of Diet Coke drinkers watch Baywatch, or how many AT&T customers have seen an advertisement for Sprint during the last week. All the relevant data could theoretically be collected from each respondent. However, obtaining full detailed and accurate information would be very expensive. It would also impose a heavy respondent burden under current data collection technology. This information is currently collected through separate surveys in New Zealand and in many other countries. Exposure to the major media is measured continuously, and product usage studies are common. Statistical matching techniques provide a way of combining these separate information sources. The New Zealand television ratings database was combined with a syndicated survey of print readership and product usage, using statistical matching. The resulting Panorama service meets the targeting information needs of advertisers and media planners. It has since been duplicated in Australia. This paper discusses the development of the statistical matching framework for combining these databases, and the heuristics and techniques used. These included an experiment conducted using a screening design to identify important matching variables. Studies evaluating and validating the combined results are also summarized. The following three major evaluation criteria were used; accuracy of combined results, statibility of combined results and the preservation of currency results from the component databases. The paper then discusses how the prerequisites for combining the databases were met. The biggest hurdle at this stage was the differences between the analysis techniques used on the two component databases. Finally, suggestions for developing similar statistical matching systems elsewhere will be given.
Release date: 2000-03-02
5. Fusion of data and estimation by entropy maximization Archived
Surveys and statistical programs – Documentation: 11-522-X19990015672
Description:
Data fusion as discussed here means to create a set of data on not jointly observed variables from two different sources. Suppose for instance that observations are available for (X,Z) on a set of individuals and for (Y,Z) on a different set of individuals. Each of X, Y and Z may be a vector variable. The main purpose is to gain insight into the joint distribution of (X,Y) using Z as a so-called matching variable. At first however, it is attempted to recover as much information as possible on the joint distribution of (X,Y,Z) from the distinct sets of data. Such fusions can only be done at the cost of implementing some distributional properties for the fused data. These are conditional independencies given the matching variables. Fused data are typically discussed from the point of view of how appropriate this underlying assumption is. Here we give a different perspective. We formulate the problem as follows: how can distributions be estimated in situations when only observations from certain marginal distributions are available. It can be solved by applying the maximum entropy criterium. We show in particular that data created by fusing different sources can be interpreted as a special case of this situation. Thus, we derive the needed assumption of conditional independence as a consequence of the type of data available.
Release date: 2000-03-02
6. Dual system estimation and the 2001 Census coverage surveys of the U.K. Archived
Surveys and statistical programs – Documentation: 11-522-X19990015682
Description:
The application of dual system estimation (DSE) to matched Census / Post Enumeration Survey (PES) data in order to measure net undercount is well understood (Hogan, 1993). However, this approach has so far not been used to measure net undercount in the UK. The 2001 PES in the UK will use this methodology. This paper presents the general approach to design and estimation for this PES (the 2001 Census Coverage Survey). The estimation combines DSE with standard ratio and regression estimation. A simulation study using census data from the 1991 Census of England and Wales demonstrates that the ratio model is in general more robust than the regression model.
Release date: 2000-03-02

Report a problem or mistake on this page

Date modified:: 2024-05-31