Editing and imputation

Skip to main content
Skip to footer

Language selection

Français

Search and menus

Search and menus

Search

Skip to filters. View results.

Results

All (93)

All (93) (0 to 10 of 93 results)

1. Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
Articles and reports: 12-001-X202200200009
Description:
Multiple imputation (MI) is a popular approach for dealing with missing data arising from non-response in sample surveys. Multiple imputation by chained equations (MICE) is one of the most widely used MI algorithms for multivariate data, but it lacks theoretical foundation and is computationally intensive. Recently, missing data imputation methods based on deep learning models have been developed with encouraging results in small studies. However, there has been limited research on evaluating their performance in realistic settings compared to MICE, particularly in big surveys. We conduct extensive simulation studies based on a subsample of the American Community Survey to compare the repeated sampling properties of four machine learning based MI methods: MICE with classification trees, MICE with random forests, generative adversarial imputation networks, and multiple imputation using denoising autoencoders. We find the deep learning imputation methods are superior to MICE in terms of computational time. However, with the default choice of hyperparameters in the common software packages, MICE with classification trees consistently outperforms, often by a large margin, the deep learning imputation methods in terms of bias, mean squared error, and coverage under a range of realistic settings.

Release date: 2022-12-15
2. Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Articles and reports: 12-001-X202200100008
Description:
The Multiple Imputation of Latent Classes (MILC) method combines multiple imputation and latent class analysis to correct for misclassification in combined datasets. Furthermore, MILC generates a multiply imputed dataset which can be used to estimate different statistics in a straightforward manner, ensuring that uncertainty due to misclassification is incorporated when estimating the total variance. In this paper, it is investigated how the MILC method can be adjusted to be applied for census purposes. More specifically, it is investigated how the MILC method deals with a finite and complete population register, how the MILC method can simultaneously correct misclassification in multiple latent variables and how multiple edit restrictions can be incorporated. A simulation study shows that the MILC method is in general able to reproduce cell frequencies in both low- and high-dimensional tables with low amounts of bias. In addition, variance can also be estimated appropriately, although variance is overestimated when cell frequencies are small.

Release date: 2022-06-21
3. Integration of data from probability surveys and big found data for finite population inference using mass imputation
Articles and reports: 12-001-X202100100004
Description:
Multiple data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we consider an imputation approach to combining data from a probability survey and big found data. We focus on the case when the study variable is observed in the big data only, but the other auxiliary variables are commonly observed in both data. Unlike the usual imputation for missing data analysis, we create imputed values for all units in the probability sample. Such mass imputation is attractive in the context of survey data integration (Kim and Rao, 2012). We extend mass imputation as a tool for data integration of survey data and big non-survey data. The mass imputation methods and their statistical properties are presented. The matching estimator of Rivers (2007) is also covered as a special case. Variance estimation with mass-imputed data is discussed. The simulation results demonstrate the proposed estimators outperform existing competitors in terms of robustness and efficiency.
Release date: 2021-06-24
4. A note on multiply robust predictive mean matching imputation with complex survey data
Articles and reports: 12-001-X202100100009
Description:
Predictive mean matching is a commonly used imputation procedure for addressing the problem of item nonresponse in surveys. The customary approach relies upon the specification of a single outcome regression model. In this note, we propose a novel predictive mean matching procedure that allows the user to specify multiple outcome regression models. The resulting estimator is multiply robust in the sense that it remains consistent if one of the specified outcome regression models is correctly specified. The results from a simulation study suggest that the proposed method performs well in terms of bias and efficiency.
Release date: 2021-06-24
5. Considerations for Displaying Data Using Graphs
19-22-0004
Description: One of the main objectives of statistics is to distill data into information which can be summarized and easily understood. Data visualizations, which include graphs and charts, are powerful ways of doing so. The purpose of this information session is to provide examples of common graphs and charts, highlight practical advice to help the audience choose the right display for their data, and identify what to avoid and why. An overall objective is to build capacity and increase understanding of fundamental techniques which foster accurate and effective dissemination of statistics and research findings.
https://www.statcan.gc.ca/en/wtc/information/19220004
Release date: 2020-10-30
6. A new double hot-deck imputation method for missing values under boundary conditions
Articles and reports: 12-001-X202000100006
Description:
In surveys, logical boundaries among variables or among waves of surveys make imputation of missing values complicated. We propose a new regression-based multiple imputation method to deal with survey nonresponses with two-sided logical boundaries. This imputation method automatically satisfies the boundary conditions without an additional acceptance/rejection procedure and utilizes the boundary information to derive an imputed value and to determine the suitability of the imputed value. Simulation results show that our new imputation method outperforms the existing imputation methods for both mean and quantile estimations regardless of missing rates, error distributions, and missing-mechanisms. We apply our method to impute the self-reported variable “years of smoking” in successive health screenings of Koreans.
Release date: 2020-06-30
7. Semiparametric quantile regression imputation for a complex survey with application to the Conservation Effects Assessment Project Archived
Articles and reports: 12-001-X201900200001
Description:
Development of imputation procedures appropriate for data with extreme values or nonlinear relationships to covariates is a significant challenge in large scale surveys. We develop an imputation procedure for complex surveys based on semiparametric quantile regression. We apply the method to the Conservation Effects Assessment Project (CEAP), a large-scale survey that collects data used in quantifying soil loss from crop fields. In the imputation procedure, we first generate imputed values from a semiparametric model for the quantiles of the conditional distribution of the response given a covariate. Then, we estimate the parameters of interest using the generalized method of moments (GMM). We derive the asymptotic distribution of the GMM estimators for a general class of complex survey designs. In simulations meant to represent the CEAP data, we evaluate variance estimators based on the asymptotic distribution and compare the semiparametric quantile regression imputation (QRI) method to fully parametric and nonparametric alternatives. The QRI procedure is more efficient than nonparametric and fully parametric alternatives, and empirical coverages of confidence intervals are within 1% of the nominal 95% level. An application to estimation of mean erosion indicates that QRI may be a viable option for CEAP.
Release date: 2019-06-27
8. Development of a small area estimation system at Statistics Canada Archived
Articles and reports: 12-001-X201900100009
Description:
The demand for small area estimates by users of Statistics Canada’s data has been steadily increasing over recent years. In this paper, we provide a summary of procedures that have been incorporated into a SAS based production system for producing official small area estimates at Statistics Canada. This system includes: procedures based on unit or area level models; the incorporation of the sampling design; the ability to smooth the design variance for each small area if an area level model is used; the ability to ensure that the small area estimates add up to reliable higher level estimates; and the development of diagnostic tools to test the adequacy of the model. The production system has been used to produce small area estimates on an experimental basis for several surveys at Statistics Canada that include: the estimation of health characteristics, the estimation of under-coverage in the census, the estimation of manufacturing sales and the estimation of unemployment rates and employment counts for the Labour Force Survey. Some of the diagnostics implemented in the system are illustrated using Labour Force Survey data along with administrative auxiliary data.
Release date: 2019-05-07
9. Variance estimation in multi-phase calibration Archived
Articles and reports: 12-001-X201700114823
Description:
The derivation of estimators in a multi-phase calibration process requires a sequential computation of estimators and calibrated weights of previous phases in order to obtain those of later ones. Already after two phases of calibration the estimators and their variances involve calibration factors from both phases and the formulae become cumbersome and uninformative. As a consequence the literature so far deals mainly with two phases while three phases or more are rarely being considered. The analysis in some cases is ad-hoc for a specific design and no comprehensive methodology for constructing calibrated estimators, and more challengingly, estimating their variances in three or more phases was formed. We provide a closed form formula for the variance of multi-phase calibrated estimators that holds for any number of phases. By specifying a new presentation of multi-phase calibrated weights it is possible to construct calibrated estimators that have the form of multi-variate regression estimators which enables a computation of a consistent estimator for their variance. This new variance estimator is not only general for any number of phases but also has some favorable characteristics. A comparison to other estimators in the special case of two-phase calibration and another independent study for three phases are presented.
Release date: 2017-06-22
10. Imputing Postal Codes to Analyze Ecological Variables in Longitudinal Cohorts: Exposure to Particulate Matter in the Canadian Census Health and Environment Cohort Database Archived
Articles and reports: 11-633-X2017006
Description:
This paper describes a method of imputing missing postal codes in a longitudinal database. The 1991 Canadian Census Health and Environment Cohort (CanCHEC), which contains information on individuals from the 1991 Census long-form questionnaire linked with T1 tax return files for the 1984-to-2011 period, is used to illustrate and validate the method. The cohort contains up to 28 consecutive fields for postal code of residence, but because of frequent gaps in postal code history, missing postal codes must be imputed. To validate the imputation method, two experiments were devised where 5% and 10% of all postal codes from a subset with full history were randomly removed and imputed.
Release date: 2017-03-13

Data (0)

Data (0) (0 results)

No content available at this time.

Analysis (85)

Analysis (85) (40 to 50 of 85 results)

41. Editing systematic unity measure errors through mixture modelling Archived
Articles and reports: 12-001-X20050018087
Description:
In Official Statistics, data editing process plays an important role in terms of timeliness, data accuracy, and survey costs. Techniques introduced to identify and eliminate errors from data are essentially required to consider all of these aspects simultaneously. Among others, a frequent and pervasive systematic error appearing in surveys collecting numerical data, is the unity measure error. It highly affects timeliness, data accuracy and costs of the editing and imputation phase. In this paper we propose a probabilistic formalisation of the problem based on finite mixture models. This setting allows us to deal with the problem in a multivariate context, and provides also a number of useful diagnostics for prioritising cases to be more deeply investigated through a clerical review. Prioritising units is important in order to increase data accuracy while avoiding waste of time due to the follow up of non-really critical units.
Release date: 2005-07-21
42. Using matched substitutes to improve imputations for geographically linked databases Archived
Articles and reports: 12-001-X20050018088
Description:
When administrative records are geographically linked to census block groups, local-area characteristics from the census can be used as contextual variables, which may be useful supplements to variables that are not directly observable from the administrative records. Often databases contain records that have insufficient address information to permit geographical links with census block groups; the contextual variables for these records are therefore unobserved. We propose a new method that uses information from "matched cases" and multivariate regression models to create multiple imputations for the unobserved variables. Our method outperformed alternative methods in simulation evaluations using census data, and was applied to the dataset for a study on treatment patterns for colorectal cancer patients.
Release date: 2005-07-21
43. Use of tax data: An application of goods and services tax (GST) data Archived
Articles and reports: 11-522-X20030017708
Description:
This article provides an overview of the work to date using GST data at Statistics Canada as direct replacement in imputation or estimation or as a data certification tool.
Release date: 2005-01-26
44. Inference for totals in cluster sampling under mean imputation for missing data Archived
Articles and reports: 11-522-X20030017722
Description:
This paper shows how to adapt design-based and model-based frameworks to the case of two-stage sampling.
Release date: 2005-01-26
45. New approaches to editing and imputation: Selected results and experiences from the EUREDIT project Archived
Articles and reports: 11-522-X20030017724
Description:
This document presents results for two edit and imputation applications, the UK Annual Business Inquiry and the UK Census 1% household data file (the SARS), and for a missing data application based on the Danish Labour Force Survey.
Release date: 2005-01-26
46. Calibrated imputation in surveys under a quasi-model-assisted approach Archived
Articles and reports: 11-522-X20030017725
Description:
This paper examines techniques for imputing missing survey information.
Release date: 2005-01-26
47. Multiple imputation of missing income data at the individual and family levels using sequential regression imputation: Application to the National Health Interview Survey Archived
Articles and reports: 11-522-X20020016715
Description:
This paper will describe the multiple imputation of income in the National Health Interview Survey and discuss the methodological issues involved. In addition, the paper will present empirical summaries of the imputations as well as results of a Monte Carlo evaluation of inferences based on multiply imputed income items.
Analysts of health data are often interested in studying relationships between income and health. The National Health Interview Survey, conducted by the National Center for Health Statistics of the U.S. Centers for Disease Control and Prevention, provides a rich source of data for studying such relationships. However, the nonresponse rates on two key income items, an individual's earned income and a family's total income, are over 20%. Moreover, these nonresponse rates appear to be increasing over time. A project is currently underway to multiply impute individual earnings and family income along with some other covariates for the National Health Interview Survey in 1997 and subsequent years.
There are many challenges in developing appropriate multiple imputations for such large-scale surveys. First, there are many variables of different types, with different skip patterns and logical relationships. Second, it is not known what types of associations will be investigated by the analysts of multiply imputed data. Finally, some variables, such as family income, are collected at the family level and others, such as earned income, are collected at the individual level. To make the imputations for both the family- and individual-level variables conditional on as many predictors as possible, and to simplify modelling, we are using a modified version of the sequential regression imputation method described in Raghunathan et al. ( Survey Methodology, 2001).
Besides issues related to the hierarchical nature of the imputations just described, there are other methodological issues of interest such as the use of transformations of the income variables, the imposition of restrictions on the values of variables, the general validity of sequential regression imputation and, even more generally, the validity of multiple-imputation inferences for surveys with complex sample designs.
Release date: 2004-09-13
48. Examples of multiple imputation in large-scale surveys Archived
Articles and reports: 11-522-X20020016716
Description:
Missing data are a constant problem in large-scale surveys. Such incompleteness is usually dealt with either by restricting the analysis to the cases with complete records or by imputing, for each missing item, an efficiently estimated value. The deficiencies of these approaches will be discussed in this paper, especially in the context of estimating a large number of quantities. The main part of the paper will describe two examples of analyses using multiple imputation.
In the first, the International Labour Organization (ILO) employment status is imputed in the British Labour Force Survey by a Bayesian bootstrap method. It is an adaptation of the hot-deck method, which seeks to fully exploit the auxiliary information. Important auxiliary information is given by the previous ILO status, when available, and the standard demographic variables.
Missing data can be interpreted more generally, as in the framework of the expectation maximization (EM) algorithm. The second example is from the Scottish House Condition Survey, and its focus is on the inconsistency of the surveyors. The surveyors assess the sampled dwelling units on a large number of elements or features of the dwelling, such as internal walls, roof and plumbing, that are scored and converted to a summarizing 'comprehensive repair cost.' The level of inconsistency is estimated from the discrepancies between the pairs of assessments of doubly surveyed dwellings. The principal research questions concern the amount of information that is lost as a result of the inconsistency and whether the naive estimators that ignore the inconsistency are unbiased. The problem is solved by multiple imputation, generating plausible scores for all the dwellings in the survey.
Release date: 2004-09-13
49. Variance estimation with Hot Deck imputation using a model Archived
Articles and reports: 12-001-X20040016994
Description:
When imputation is used to assign values for missing items in sample surveys, naïve methods of estimating the variances of survey estimates that treat the imputed values as if they were observed give biased variance estimates. This article addresses the problem of variance estimation for a linear estimator in which missing values are assigned by a single hot deck imputation (a form of imputation that is widely used in practice). We propose estimators of the variance of a linear hot deck imputed estimator using a decomposition of the total variance suggested by Särndal (1992). A conditional approach to variance estimation is developed that is applicable to both weighted and unweighted hot deck imputation. Estimation of the variance of a domain estimator is also examined.
Release date: 2004-07-14
50. Inference for partially synthetic, public use microdata sets Archived
Articles and reports: 12-001-X20030026785
Description:
To avoid disclosures, one approach is to release partially synthetic, public use microdata sets. These comprise the units originally surveyed, but some collected values, for example sensitive values at high risk of disclosure or values of key identifiers, are replaced with multiple imputations. Although partially synthetic approaches are currently used to protect public use data, valid methods of inference have not been developed for them. This article presents such methods. They are based on the concepts of multiple imputation for missing data but use different rules for combining point and variance estimates. The combining rules also differ from those for fully synthetic data sets developed by Raghunathan, Reiter and Rubin (2003). The validity of these new rules is illustrated in simulation studies.
Release date: 2004-01-27

Reference (7)

Reference (7) ((7 results))

1. Improvements in 2005 to the Labour Force Survey (LFS) Archived
Surveys and statistical programs – Documentation: 71F0031X2005002
Description:
This paper introduces and explains modifications made to the Labour Force Survey estimates in January 2005. Some of these modifications include the adjustment of all LFS estimates to reflect population counts based on the 2001 Census, updates to industry and occupation classification systems and sample redesign changes.
Release date: 2005-01-26
2. Unpaid Work, 2001 Census Technical Report (Reference Products: 2001 Census) Archived
Surveys and statistical programs – Documentation: 92-397-X
Description:
This report covers concepts and definitions, the imputation method and data quality for this variable. The 2001 Census collected information on three types of unpaid work performed during the week preceding the Census: looking after children, housework and caring for seniors. The 2001 data on unpaid work are compared with the 1996 Census data and with the data from the General Social Survey (use of time in 1998). The report also includes historical tables.
Release date: 2005-01-11
3. Occupation, 2001 Census Technical Report (Reference Products: 2001 Census) Archived
Surveys and statistical programs – Documentation: 92-388-X
Description:
This report contains basic conceptual and data quality information to help users interpret and make use of census occupation data. It gives an overview of the collection, coding (to the 2001 National Occupational Classification), edit and imputation of the occupation data from the 2001 Census. The report describes procedural changes between the 2001 and earlier censuses, and provides an analysis of the quality level of the 2001 Census occupation data. Finally, it details the revision of the 1991 Standard Occupational Classification used in the 1991 and 1996 Censuses to the 2001 National Occupational Classification for Statistics used in 2001. The historical comparability of data coded to the two classifications is discussed. Appendices to the report include a table showing historical data for the 1991, 1996 and 2001 Censuses.
Release date: 2004-07-15
4. Class of Worker, 2001 Census Technical Report (Reference Products: 2001 Census) Archived
Surveys and statistical programs – Documentation: 92-398-X
Description:
This report contains basic conceptual and data quality information intended to facilitate the use and interpretation of census class of worker data. It provides an overview of the class of worker processing cycle including elements such as regional office processing, and edit and imputation. The report concludes with summary tables that indicate the level of data quality in the 2001 Census class of worker data.
Release date: 2004-04-22
5. An Overview of the Issues Related to the Use of Personal Identifiers Archived
Surveys and statistical programs – Documentation: 85-602-X
Description:
The purpose of this report is to provide an overview of existing methods and techniques making use of personal identifiers to support record linkage. Record linkage can be loosely defined as a methodology for manipulating and / or transforming personal identifiers from individual data records from one or more operational databases and subsequently attempting to match these personal identifiers to create a composite record about an individual. Record linkage is not intended to uniquely identify individuals for operational purposes; however, it does provide probabilistic matches of varying degrees of reliability for use in statistical reporting. Techniques employed in record linkage may also be of use for investigative purposes to help narrow the field of search against existing databases when some form of personal identification information exists.
Release date: 2000-12-05
6. Impact of Edit and Imputation on Income Estimates: A Case Study Archived
Surveys and statistical programs – Documentation: 75F0002M1998012
Description:
This paper looks at the work of the task force responsible for reviewing Statistics Canada's household and family income statistics programs, and at one of associated program changes, namely, the integration of two major sources of annual income data in Canada, the Survey of Consumer Finances (SCF) and the Survey of Labour and Income Dynamics (SLID).
Release date: 1998-12-30
7. Survey of Labour and Income Dynamics: Processing Strategy for Wave 1 Income Data Archived
Surveys and statistical programs – Documentation: 75F0002M1997006
Description:
This report documents the edit and imputation approach taken in processing Wave 1 income data from the Survey of Labour and Income Dynamics (SLID).
Release date: 1997-12-31

Report a problem or mistake on this page

Date modified:: 2024-09-19