Editing and imputation

Skip to main content
Skip to footer

Language selection

Français

Search and menus

Search and menus

Search

Skip to filters. View results.

Results

All (93)

All (93) (0 to 10 of 93 results)

1. Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
Articles and reports: 12-001-X202200200009
Description:
Multiple imputation (MI) is a popular approach for dealing with missing data arising from non-response in sample surveys. Multiple imputation by chained equations (MICE) is one of the most widely used MI algorithms for multivariate data, but it lacks theoretical foundation and is computationally intensive. Recently, missing data imputation methods based on deep learning models have been developed with encouraging results in small studies. However, there has been limited research on evaluating their performance in realistic settings compared to MICE, particularly in big surveys. We conduct extensive simulation studies based on a subsample of the American Community Survey to compare the repeated sampling properties of four machine learning based MI methods: MICE with classification trees, MICE with random forests, generative adversarial imputation networks, and multiple imputation using denoising autoencoders. We find the deep learning imputation methods are superior to MICE in terms of computational time. However, with the default choice of hyperparameters in the common software packages, MICE with classification trees consistently outperforms, often by a large margin, the deep learning imputation methods in terms of bias, mean squared error, and coverage under a range of realistic settings.

Release date: 2022-12-15
2. Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Articles and reports: 12-001-X202200100008
Description:
The Multiple Imputation of Latent Classes (MILC) method combines multiple imputation and latent class analysis to correct for misclassification in combined datasets. Furthermore, MILC generates a multiply imputed dataset which can be used to estimate different statistics in a straightforward manner, ensuring that uncertainty due to misclassification is incorporated when estimating the total variance. In this paper, it is investigated how the MILC method can be adjusted to be applied for census purposes. More specifically, it is investigated how the MILC method deals with a finite and complete population register, how the MILC method can simultaneously correct misclassification in multiple latent variables and how multiple edit restrictions can be incorporated. A simulation study shows that the MILC method is in general able to reproduce cell frequencies in both low- and high-dimensional tables with low amounts of bias. In addition, variance can also be estimated appropriately, although variance is overestimated when cell frequencies are small.

Release date: 2022-06-21
3. Integration of data from probability surveys and big found data for finite population inference using mass imputation
Articles and reports: 12-001-X202100100004
Description:
Multiple data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we consider an imputation approach to combining data from a probability survey and big found data. We focus on the case when the study variable is observed in the big data only, but the other auxiliary variables are commonly observed in both data. Unlike the usual imputation for missing data analysis, we create imputed values for all units in the probability sample. Such mass imputation is attractive in the context of survey data integration (Kim and Rao, 2012). We extend mass imputation as a tool for data integration of survey data and big non-survey data. The mass imputation methods and their statistical properties are presented. The matching estimator of Rivers (2007) is also covered as a special case. Variance estimation with mass-imputed data is discussed. The simulation results demonstrate the proposed estimators outperform existing competitors in terms of robustness and efficiency.
Release date: 2021-06-24
4. A note on multiply robust predictive mean matching imputation with complex survey data
Articles and reports: 12-001-X202100100009
Description:
Predictive mean matching is a commonly used imputation procedure for addressing the problem of item nonresponse in surveys. The customary approach relies upon the specification of a single outcome regression model. In this note, we propose a novel predictive mean matching procedure that allows the user to specify multiple outcome regression models. The resulting estimator is multiply robust in the sense that it remains consistent if one of the specified outcome regression models is correctly specified. The results from a simulation study suggest that the proposed method performs well in terms of bias and efficiency.
Release date: 2021-06-24
5. Considerations for Displaying Data Using Graphs
19-22-0004
Description: One of the main objectives of statistics is to distill data into information which can be summarized and easily understood. Data visualizations, which include graphs and charts, are powerful ways of doing so. The purpose of this information session is to provide examples of common graphs and charts, highlight practical advice to help the audience choose the right display for their data, and identify what to avoid and why. An overall objective is to build capacity and increase understanding of fundamental techniques which foster accurate and effective dissemination of statistics and research findings.
https://www.statcan.gc.ca/en/wtc/information/19220004
Release date: 2020-10-30
6. A new double hot-deck imputation method for missing values under boundary conditions
Articles and reports: 12-001-X202000100006
Description:
In surveys, logical boundaries among variables or among waves of surveys make imputation of missing values complicated. We propose a new regression-based multiple imputation method to deal with survey nonresponses with two-sided logical boundaries. This imputation method automatically satisfies the boundary conditions without an additional acceptance/rejection procedure and utilizes the boundary information to derive an imputed value and to determine the suitability of the imputed value. Simulation results show that our new imputation method outperforms the existing imputation methods for both mean and quantile estimations regardless of missing rates, error distributions, and missing-mechanisms. We apply our method to impute the self-reported variable “years of smoking” in successive health screenings of Koreans.
Release date: 2020-06-30
7. Semiparametric quantile regression imputation for a complex survey with application to the Conservation Effects Assessment Project
Articles and reports: 12-001-X201900200001
Description:
Development of imputation procedures appropriate for data with extreme values or nonlinear relationships to covariates is a significant challenge in large scale surveys. We develop an imputation procedure for complex surveys based on semiparametric quantile regression. We apply the method to the Conservation Effects Assessment Project (CEAP), a large-scale survey that collects data used in quantifying soil loss from crop fields. In the imputation procedure, we first generate imputed values from a semiparametric model for the quantiles of the conditional distribution of the response given a covariate. Then, we estimate the parameters of interest using the generalized method of moments (GMM). We derive the asymptotic distribution of the GMM estimators for a general class of complex survey designs. In simulations meant to represent the CEAP data, we evaluate variance estimators based on the asymptotic distribution and compare the semiparametric quantile regression imputation (QRI) method to fully parametric and nonparametric alternatives. The QRI procedure is more efficient than nonparametric and fully parametric alternatives, and empirical coverages of confidence intervals are within 1% of the nominal 95% level. An application to estimation of mean erosion indicates that QRI may be a viable option for CEAP.
Release date: 2019-06-27
8. Development of a small area estimation system at Statistics Canada Archived
Articles and reports: 12-001-X201900100009
Description:
The demand for small area estimates by users of Statistics Canada’s data has been steadily increasing over recent years. In this paper, we provide a summary of procedures that have been incorporated into a SAS based production system for producing official small area estimates at Statistics Canada. This system includes: procedures based on unit or area level models; the incorporation of the sampling design; the ability to smooth the design variance for each small area if an area level model is used; the ability to ensure that the small area estimates add up to reliable higher level estimates; and the development of diagnostic tools to test the adequacy of the model. The production system has been used to produce small area estimates on an experimental basis for several surveys at Statistics Canada that include: the estimation of health characteristics, the estimation of under-coverage in the census, the estimation of manufacturing sales and the estimation of unemployment rates and employment counts for the Labour Force Survey. Some of the diagnostics implemented in the system are illustrated using Labour Force Survey data along with administrative auxiliary data.
Release date: 2019-05-07
9. Variance estimation in multi-phase calibration Archived
Articles and reports: 12-001-X201700114823
Description:
The derivation of estimators in a multi-phase calibration process requires a sequential computation of estimators and calibrated weights of previous phases in order to obtain those of later ones. Already after two phases of calibration the estimators and their variances involve calibration factors from both phases and the formulae become cumbersome and uninformative. As a consequence the literature so far deals mainly with two phases while three phases or more are rarely being considered. The analysis in some cases is ad-hoc for a specific design and no comprehensive methodology for constructing calibrated estimators, and more challengingly, estimating their variances in three or more phases was formed. We provide a closed form formula for the variance of multi-phase calibrated estimators that holds for any number of phases. By specifying a new presentation of multi-phase calibrated weights it is possible to construct calibrated estimators that have the form of multi-variate regression estimators which enables a computation of a consistent estimator for their variance. This new variance estimator is not only general for any number of phases but also has some favorable characteristics. A comparison to other estimators in the special case of two-phase calibration and another independent study for three phases are presented.
Release date: 2017-06-22
10. Imputing Postal Codes to Analyze Ecological Variables in Longitudinal Cohorts: Exposure to Particulate Matter in the Canadian Census Health and Environment Cohort Database Archived
Articles and reports: 11-633-X2017006
Description:
This paper describes a method of imputing missing postal codes in a longitudinal database. The 1991 Canadian Census Health and Environment Cohort (CanCHEC), which contains information on individuals from the 1991 Census long-form questionnaire linked with T1 tax return files for the 1984-to-2011 period, is used to illustrate and validate the method. The cohort contains up to 28 consecutive fields for postal code of residence, but because of frequent gaps in postal code history, missing postal codes must be imputed. To validate the imputation method, two experiments were devised where 5% and 10% of all postal codes from a subset with full history were randomly removed and imputed.
Release date: 2017-03-13

Data (0)

Data (0) (0 results)

No content available at this time.

Analysis (85)

Analysis (85) (20 to 30 of 85 results)

21. Bayesian multiple imputation for large-scale categorical data with structural zeros Archived
Articles and reports: 12-001-X201400114002
Description:
We propose an approach for multiple imputation of items missing at random in large-scale surveys with exclusively categorical variables that have structural zeros. Our approach is to use mixtures of multinomial distributions as imputation engines, accounting for structural zeros by conceiving of the observed data as a truncated sample from a hypothetical population without structural zeros. This approach has several appealing features: imputations are generated from coherent, Bayesian joint models that automatically capture complex dependencies and readily scale to large numbers of variables. We outline a Gibbs sampling algorithm for implementing the approach, and we illustrate its potential with a repeated sampling study using public use census microdata from the state of New York, U.S.A.
Release date: 2014-06-27
22. Automatic editing with hard and soft edits Archived
Articles and reports: 12-001-X201300111825
Description:
A considerable limitation of current methods for automatic data editing is that they treat all edits as hard constraints. That is to say, an edit failure is always attributed to an error in the data. In manual editing, however, subject-matter specialists also make extensive use of soft edits, i.e., constraints that identify (combinations of) values that are suspicious but not necessarily incorrect. The inability of automatic editing methods to handle soft edits partly explains why in practice many differences are found between manually edited and automatically edited data. The object of this article is to present a new formulation of the error localisation problem which can distinguish between hard and soft edits. Moreover, it is shown how this problem may be solved by an extension of the error localisation algorithm of De Waal and Quere (2003).
Release date: 2013-06-28
23. Imputation for nonmonotone nonresponse in the survey of industrial research and development Archived
Articles and reports: 12-001-X201200211753
Description:
Nonresponse in longitudinal studies often occurs in a nonmonotone pattern. In the Survey of Industrial Research and Development (SIRD), it is reasonable to assume that the nonresponse mechanism is past-value-dependent in the sense that the response propensity of a study variable at time point t depends on response status and observed or missing values of the same variable at time points prior to t. Since this nonresponse is nonignorable, the parametric likelihood approach is sensitive to the specification of parametric models on both the joint distribution of variables at different time points and the nonresponse mechanism. The nonmonotone nonresponse also limits the application of inverse propensity weighting methods. By discarding all observed data from a subject after its first missing value, one can create a dataset with a monotone ignorable nonresponse and then apply established methods for ignorable nonresponse. However, discarding observed data is not desirable and it may result in inefficient estimators when many observed data are discarded. We propose to impute nonrespondents through regression under imputation models carefully created under the past-value-dependent nonresponse mechanism. This method does not require any parametric model on the joint distribution of the variables across time points or the nonresponse mechanism. Performance of the estimated means based on the proposed imputation method is investigated through some simulation studies and empirical analysis of the SIRD data.
Release date: 2012-12-19
24. Multiple imputation with census data Archived
Articles and reports: 12-001-X201200211759
Description:
A benefit of multiple imputation is that it allows users to make valid inferences using standard methods with simple combining rules. Existing combining rules for multivariate hypothesis tests fail when the sampling error is zero. This paper proposes modified tests for use with finite population analyses of multiply imputed census data for the applications of disclosure limitation and missing data and evaluates their frequentist properties through simulation.
Release date: 2012-12-19
25. Combining synthetic data with subsampling to create public use microdata files for large scale surveys Archived
Articles and reports: 12-001-X201200111687
Description:
To create public use files from large scale surveys, statistical agencies sometimes release random subsamples of the original records. Random subsampling reduces file sizes for secondary data analysts and reduces risks of unintended disclosures of survey participants' confidential information. However, subsampling does not eliminate risks, so that alteration of the data is needed before dissemination. We propose to create disclosure-protected subsamples from large scale surveys based on multiple imputation. The idea is to replace identifying or sensitive values in the original sample with draws from statistical models, and release subsamples of the disclosure-protected data. We present methods for making inferences with the multiple synthetic subsamples.
Release date: 2012-06-27
26. Variance estimation under composite imputation: The methodology behind SEVANI Archived
Articles and reports: 12-001-X201100211605
Description:
Composite imputation is often used in business surveys. The term "composite" means that more than a single imputation method is used to impute missing values for a variable of interest. The literature on variance estimation in the presence of composite imputation is rather limited. To deal with this problem, we consider an extension of the methodology developed by Särndal (1992). Our extension is quite general and easy to implement provided that linear imputation methods are used to fill in the missing values. This class of imputation methods contains linear regression imputation, donor imputation and auxiliary value imputation, sometimes called cold-deck or substitution imputation. It thus covers the most common methods used by national statistical agencies for the imputation of missing values. Our methodology has been implemented in the System for the Estimation of Variance due to Nonresponse and Imputation (SEVANI) developed at Statistics Canada. Its performance is evaluated in a simulation study.
Release date: 2011-12-21
27. Imputation for nonmonotone last-value-dependent nonrespondents in longitudinal surveys Archived
Articles and reports: 12-001-X200800210756
Description:
In longitudinal surveys nonresponse often occurs in a pattern that is not monotone. We consider estimation of time-dependent means under the assumption that the nonresponse mechanism is last-value-dependent. Since the last value itself may be missing when nonresponse is nonmonotone, the nonresponse mechanism under consideration is nonignorable. We propose an imputation method by first deriving some regression imputation models according to the nonresponse mechanism and then applying nonparametric regression imputation. We assume that the longitudinal data follow a Markov chain with finite second-order moments. No other assumption is imposed on the joint distribution of longitudinal data and their nonresponse indicators. A bootstrap method is applied for variance estimation. Some simulation results and an example concerning the Current Employment Survey are presented.
Release date: 2008-12-23
28. Combining information from two surveys to improve on analyses of self-reported data in estimating measures of health Archived
Articles and reports: 11-522-X200600110408
Description:
Despite advances that have improved the health of the United States population, disparities in health remain among various racial/ethnic and socio-economic groups. Common data sources for assessing the health of a population of interest include large-scale surveys that often pose questions requiring a self-report, such as, "Has a doctor or other health professional ever told you that you have health condition of interest?" Answers to such questions might not always reflect the true prevalences of health conditions (for example, if a respondent does not have access to a doctor or other health professional). Similarly, self-reported data on quantities such as height and weight might be subject to reporting errors. Such "measurement error" in health data could affect inferences about measures of health and health disparities. In this work, we fit measurement-error models to data from the National Health and Nutrition Examination Survey, which asks self-report questions during an interview component and also obtains physical measurements during an examination component. We then develop methods for using the fitted models to improve on analyses of self-reported data from another survey that does not include an examination component. The methods, which involve multiply imputing examination-based data values for the survey that has only self-reported data, are applied to the National Health Interview Survey in examples involving diabetes, hypertension, and obesity. Preliminary results suggest that the adjustments for measurement error can result in non-negligible changes in estimates of measures of health.
Release date: 2008-03-17
29. The effect of model specification on multiply imputed data: lessons learned from Project DC-HOPE Archived
Articles and reports: 11-522-X200600110442
Description:
The District of Columbia Healthy Outcomes of Pregnancy Education (DC-HOPE) project is a randomized trial funded by the National Institute of Child Health and Human Development to test the effectiveness of an integrated education and counseling intervention (INT) versus usual care (UC) to reduce four risk behaviors among pregnant women. Participants were interviewed at baseline and three additional time points. Multiple imputation (MI) was used to estimate data for missing interviews. MI was done twice: once with all data imputed simultaneously, and once with data for women in the INT and UC groups imputed separately. Analyses of both imputed data sets and the pre-imputation data are compared.
Release date: 2008-03-17
30. Variance estimation for a ratio in the presence of imputed data Archived
Articles and reports: 12-001-X200700210493
Description:
In this paper, we study the problem of variance estimation for a ratio of two totals when marginal random hot deck imputation has been used to fill in missing data. We consider two approaches to inference. In the first approach, the validity of an imputation model is required. In the second approach, the validity of an imputation model is not required but response probabilities need to be estimated, in which case the validity of a nonresponse model is required. We derive variance estimators under two distinct frameworks: the customary two-phase framework and the reverse framework.
Release date: 2008-01-03

Reference (7)

Reference (7) ((7 results))

1. Improvements in 2005 to the Labour Force Survey (LFS) Archived
Surveys and statistical programs – Documentation: 71F0031X2005002
Description:
This paper introduces and explains modifications made to the Labour Force Survey estimates in January 2005. Some of these modifications include the adjustment of all LFS estimates to reflect population counts based on the 2001 Census, updates to industry and occupation classification systems and sample redesign changes.
Release date: 2005-01-26
2. Unpaid Work, 2001 Census Technical Report (Reference Products: 2001 Census) Archived
Surveys and statistical programs – Documentation: 92-397-X
Description:
This report covers concepts and definitions, the imputation method and data quality for this variable. The 2001 Census collected information on three types of unpaid work performed during the week preceding the Census: looking after children, housework and caring for seniors. The 2001 data on unpaid work are compared with the 1996 Census data and with the data from the General Social Survey (use of time in 1998). The report also includes historical tables.
Release date: 2005-01-11
3. Occupation, 2001 Census Technical Report (Reference Products: 2001 Census) Archived
Surveys and statistical programs – Documentation: 92-388-X
Description:
This report contains basic conceptual and data quality information to help users interpret and make use of census occupation data. It gives an overview of the collection, coding (to the 2001 National Occupational Classification), edit and imputation of the occupation data from the 2001 Census. The report describes procedural changes between the 2001 and earlier censuses, and provides an analysis of the quality level of the 2001 Census occupation data. Finally, it details the revision of the 1991 Standard Occupational Classification used in the 1991 and 1996 Censuses to the 2001 National Occupational Classification for Statistics used in 2001. The historical comparability of data coded to the two classifications is discussed. Appendices to the report include a table showing historical data for the 1991, 1996 and 2001 Censuses.
Release date: 2004-07-15
4. Class of Worker, 2001 Census Technical Report (Reference Products: 2001 Census) Archived
Surveys and statistical programs – Documentation: 92-398-X
Description:
This report contains basic conceptual and data quality information intended to facilitate the use and interpretation of census class of worker data. It provides an overview of the class of worker processing cycle including elements such as regional office processing, and edit and imputation. The report concludes with summary tables that indicate the level of data quality in the 2001 Census class of worker data.
Release date: 2004-04-22
5. An Overview of the Issues Related to the Use of Personal Identifiers Archived
Surveys and statistical programs – Documentation: 85-602-X
Description:
The purpose of this report is to provide an overview of existing methods and techniques making use of personal identifiers to support record linkage. Record linkage can be loosely defined as a methodology for manipulating and / or transforming personal identifiers from individual data records from one or more operational databases and subsequently attempting to match these personal identifiers to create a composite record about an individual. Record linkage is not intended to uniquely identify individuals for operational purposes; however, it does provide probabilistic matches of varying degrees of reliability for use in statistical reporting. Techniques employed in record linkage may also be of use for investigative purposes to help narrow the field of search against existing databases when some form of personal identification information exists.
Release date: 2000-12-05
6. Impact of Edit and Imputation on Income Estimates: A Case Study Archived
Surveys and statistical programs – Documentation: 75F0002M1998012
Description:
This paper looks at the work of the task force responsible for reviewing Statistics Canada's household and family income statistics programs, and at one of associated program changes, namely, the integration of two major sources of annual income data in Canada, the Survey of Consumer Finances (SCF) and the Survey of Labour and Income Dynamics (SLID).
Release date: 1998-12-30
7. Survey of Labour and Income Dynamics: Processing Strategy for Wave 1 Income Data Archived
Surveys and statistical programs – Documentation: 75F0002M1997006
Description:
This report documents the edit and imputation approach taken in processing Wave 1 income data from the Survey of Labour and Income Dynamics (SLID).
Release date: 1997-12-31

Report a problem or mistake on this page

Date modified:: 2024-06-08