Editing and imputation

Filter results by

Search Help
Currently selected filters that can be removed

Keyword(s)

Survey or statistical program

3 facets displayed. 0 facets selected.

Content

1 facets displayed. 0 facets selected.
Sort Help
entries

Results

All (85)

All (85) (0 to 10 of 85 results)

  • Articles and reports: 12-001-X202000100006
    Description:

    In surveys, logical boundaries among variables or among waves of surveys make imputation of missing values complicated. We propose a new regression-based multiple imputation method to deal with survey nonresponses with two-sided logical boundaries. This imputation method automatically satisfies the boundary conditions without an additional acceptance/rejection procedure and utilizes the boundary information to derive an imputed value and to determine the suitability of the imputed value. Simulation results show that our new imputation method outperforms the existing imputation methods for both mean and quantile estimations regardless of missing rates, error distributions, and missing-mechanisms. We apply our method to impute the self-reported variable “years of smoking” in successive health screenings of Koreans.

    Release date: 2020-06-30

  • Articles and reports: 12-001-X201900200001
    Description:

    Development of imputation procedures appropriate for data with extreme values or nonlinear relationships to covariates is a significant challenge in large scale surveys. We develop an imputation procedure for complex surveys based on semiparametric quantile regression. We apply the method to the Conservation Effects Assessment Project (CEAP), a large-scale survey that collects data used in quantifying soil loss from crop fields. In the imputation procedure, we first generate imputed values from a semiparametric model for the quantiles of the conditional distribution of the response given a covariate. Then, we estimate the parameters of interest using the generalized method of moments (GMM). We derive the asymptotic distribution of the GMM estimators for a general class of complex survey designs. In simulations meant to represent the CEAP data, we evaluate variance estimators based on the asymptotic distribution and compare the semiparametric quantile regression imputation (QRI) method to fully parametric and nonparametric alternatives. The QRI procedure is more efficient than nonparametric and fully parametric alternatives, and empirical coverages of confidence intervals are within 1% of the nominal 95% level. An application to estimation of mean erosion indicates that QRI may be a viable option for CEAP.

    Release date: 2019-06-27

  • Articles and reports: 12-001-X201900100009
    Description:

    The demand for small area estimates by users of Statistics Canada’s data has been steadily increasing over recent years. In this paper, we provide a summary of procedures that have been incorporated into a SAS based production system for producing official small area estimates at Statistics Canada. This system includes: procedures based on unit or area level models; the incorporation of the sampling design; the ability to smooth the design variance for each small area if an area level model is used; the ability to ensure that the small area estimates add up to reliable higher level estimates; and the development of diagnostic tools to test the adequacy of the model. The production system has been used to produce small area estimates on an experimental basis for several surveys at Statistics Canada that include: the estimation of health characteristics, the estimation of under-coverage in the census, the estimation of manufacturing sales and the estimation of unemployment rates and employment counts for the Labour Force Survey. Some of the diagnostics implemented in the system are illustrated using Labour Force Survey data along with administrative auxiliary data.

    Release date: 2019-05-07

  • Articles and reports: 12-001-X201700114823
    Description:

    The derivation of estimators in a multi-phase calibration process requires a sequential computation of estimators and calibrated weights of previous phases in order to obtain those of later ones. Already after two phases of calibration the estimators and their variances involve calibration factors from both phases and the formulae become cumbersome and uninformative. As a consequence the literature so far deals mainly with two phases while three phases or more are rarely being considered. The analysis in some cases is ad-hoc for a specific design and no comprehensive methodology for constructing calibrated estimators, and more challengingly, estimating their variances in three or more phases was formed. We provide a closed form formula for the variance of multi-phase calibrated estimators that holds for any number of phases. By specifying a new presentation of multi-phase calibrated weights it is possible to construct calibrated estimators that have the form of multi-variate regression estimators which enables a computation of a consistent estimator for their variance. This new variance estimator is not only general for any number of phases but also has some favorable characteristics. A comparison to other estimators in the special case of two-phase calibration and another independent study for three phases are presented.

    Release date: 2017-06-22

  • Articles and reports: 11-633-X2017006
    Description:

    This paper describes a method of imputing missing postal codes in a longitudinal database. The 1991 Canadian Census Health and Environment Cohort (CanCHEC), which contains information on individuals from the 1991 Census long-form questionnaire linked with T1 tax return files for the 1984-to-2011 period, is used to illustrate and validate the method. The cohort contains up to 28 consecutive fields for postal code of residence, but because of frequent gaps in postal code history, missing postal codes must be imputed. To validate the imputation method, two experiments were devised where 5% and 10% of all postal codes from a subset with full history were randomly removed and imputed.

    Release date: 2017-03-13

  • Articles and reports: 12-001-X201600214661
    Description:

    An example presented by Jean-Claude Deville in 2005 is subjected to three estimation methods: the method of moments, the maximum likelihood method, and generalized calibration. The three methods yield exactly the same results for the two non-response models. A discussion follows on how to choose the most appropriate model.

    Release date: 2016-12-20

  • Articles and reports: 12-001-X201600214676
    Description:

    Winsorization procedures replace extreme values with less extreme values, effectively moving the original extreme values toward the center of the distribution. Winsorization therefore both detects and treats influential values. Mulry, Oliver and Kaputa (2014) compare the performance of the one-sided Winsorization method developed by Clark (1995) and described by Chambers, Kokic, Smith and Cruddas (2000) to the performance of M-estimation (Beaumont and Alavi 2004) in highly skewed business population data. One aspect of particular interest for methods that detect and treat influential values is the range of values designated as influential, called the detection region. The Clark Winsorization algorithm is easy to implement and can be extremely effective. However, the resultant detection region is highly dependent on the number of influential values in the sample, especially when the survey totals are expected to vary greatly by collection period. In this note, we examine the effect of the number and magnitude of influential values on the detection regions from Clark Winsorization using data simulated to realistically reflect the properties of the population for the Monthly Retail Trade Survey (MRTS) conducted by the U.S. Census Bureau. Estimates from the MRTS and other economic surveys are used in economic indicators, such as the Gross Domestic Product (GDP).

    Release date: 2016-12-20

  • Articles and reports: 12-001-X201600114538
    Description:

    The aim of automatic editing is to use a computer to detect and amend erroneous values in a data set, without human intervention. Most automatic editing methods that are currently used in official statistics are based on the seminal work of Fellegi and Holt (1976). Applications of this methodology in practice have shown systematic differences between data that are edited manually and automatically, because human editors may perform complex edit operations. In this paper, a generalization of the Fellegi-Holt paradigm is proposed that can incorporate a large class of edit operations in a natural way. In addition, an algorithm is outlined that solves the resulting generalized error localization problem. It is hoped that this generalization may be used to increase the suitability of automatic editing in practice, and hence to improve the efficiency of data editing processes. Some first results on synthetic data are promising in this respect.

    Release date: 2016-06-22

  • Articles and reports: 11-522-X201700014715
    Description:

    In preparation for 2021 UK Census the ONS has committed to an extensive research programme exploring how linked administrative data can be used to support conventional statistical processes. Item-level edit and imputation (E&I) will play an important role in adjusting the 2021 Census database. However, uncertainty associated with the accuracy and quality of available administrative data renders the efficacy of an integrated census-administrative data approach to E&I unclear. Current constraints that dictate an anonymised ‘hash-key’ approach to record linkage to ensure confidentiality add to that uncertainty. Here, we provide preliminary results from a simulation study comparing the predictive and distributional accuracy of the conventional E&I strategy implemented in CANCEIS for the 2011 UK Census to that of an integrated approach using synthetic administrative data with systematically increasing error as auxiliary information. In this initial phase of research we focus on imputing single year of age. The aim of the study is to gain insight into whether auxiliary information from admin data can improve imputation estimates and where the different strategies fall on a continuum of accuracy.

    Release date: 2016-03-24

  • Articles and reports: 12-001-X201500114193
    Description:

    Imputed micro data often contain conflicting information. The situation may e.g., arise from partial imputation, where one part of the imputed record consists of the observed values of the original record and the other the imputed values. Edit-rules that involve variables from both parts of the record will often be violated. Or, inconsistency may be caused by adjustment for errors in the observed data, also referred to as imputation in Editing. Under the assumption that the remaining inconsistency is not due to systematic errors, we propose to make adjustments to the micro data such that all constraints are simultaneously satisfied and the adjustments are minimal according to a chosen distance metric. Different approaches to the distance metric are considered, as well as several extensions of the basic situation, including the treatment of categorical data, unit imputation and macro-level benchmarking. The properties and interpretations of the proposed methods are illustrated using business-economic data.

    Release date: 2015-06-29
Data (0)

Data (0) (0 results)

No content available at this time.

Analysis (78)

Analysis (78) (0 to 10 of 78 results)

  • Articles and reports: 12-001-X202000100006
    Description:

    In surveys, logical boundaries among variables or among waves of surveys make imputation of missing values complicated. We propose a new regression-based multiple imputation method to deal with survey nonresponses with two-sided logical boundaries. This imputation method automatically satisfies the boundary conditions without an additional acceptance/rejection procedure and utilizes the boundary information to derive an imputed value and to determine the suitability of the imputed value. Simulation results show that our new imputation method outperforms the existing imputation methods for both mean and quantile estimations regardless of missing rates, error distributions, and missing-mechanisms. We apply our method to impute the self-reported variable “years of smoking” in successive health screenings of Koreans.

    Release date: 2020-06-30

  • Articles and reports: 12-001-X201900200001
    Description:

    Development of imputation procedures appropriate for data with extreme values or nonlinear relationships to covariates is a significant challenge in large scale surveys. We develop an imputation procedure for complex surveys based on semiparametric quantile regression. We apply the method to the Conservation Effects Assessment Project (CEAP), a large-scale survey that collects data used in quantifying soil loss from crop fields. In the imputation procedure, we first generate imputed values from a semiparametric model for the quantiles of the conditional distribution of the response given a covariate. Then, we estimate the parameters of interest using the generalized method of moments (GMM). We derive the asymptotic distribution of the GMM estimators for a general class of complex survey designs. In simulations meant to represent the CEAP data, we evaluate variance estimators based on the asymptotic distribution and compare the semiparametric quantile regression imputation (QRI) method to fully parametric and nonparametric alternatives. The QRI procedure is more efficient than nonparametric and fully parametric alternatives, and empirical coverages of confidence intervals are within 1% of the nominal 95% level. An application to estimation of mean erosion indicates that QRI may be a viable option for CEAP.

    Release date: 2019-06-27

  • Articles and reports: 12-001-X201900100009
    Description:

    The demand for small area estimates by users of Statistics Canada’s data has been steadily increasing over recent years. In this paper, we provide a summary of procedures that have been incorporated into a SAS based production system for producing official small area estimates at Statistics Canada. This system includes: procedures based on unit or area level models; the incorporation of the sampling design; the ability to smooth the design variance for each small area if an area level model is used; the ability to ensure that the small area estimates add up to reliable higher level estimates; and the development of diagnostic tools to test the adequacy of the model. The production system has been used to produce small area estimates on an experimental basis for several surveys at Statistics Canada that include: the estimation of health characteristics, the estimation of under-coverage in the census, the estimation of manufacturing sales and the estimation of unemployment rates and employment counts for the Labour Force Survey. Some of the diagnostics implemented in the system are illustrated using Labour Force Survey data along with administrative auxiliary data.

    Release date: 2019-05-07

  • Articles and reports: 12-001-X201700114823
    Description:

    The derivation of estimators in a multi-phase calibration process requires a sequential computation of estimators and calibrated weights of previous phases in order to obtain those of later ones. Already after two phases of calibration the estimators and their variances involve calibration factors from both phases and the formulae become cumbersome and uninformative. As a consequence the literature so far deals mainly with two phases while three phases or more are rarely being considered. The analysis in some cases is ad-hoc for a specific design and no comprehensive methodology for constructing calibrated estimators, and more challengingly, estimating their variances in three or more phases was formed. We provide a closed form formula for the variance of multi-phase calibrated estimators that holds for any number of phases. By specifying a new presentation of multi-phase calibrated weights it is possible to construct calibrated estimators that have the form of multi-variate regression estimators which enables a computation of a consistent estimator for their variance. This new variance estimator is not only general for any number of phases but also has some favorable characteristics. A comparison to other estimators in the special case of two-phase calibration and another independent study for three phases are presented.

    Release date: 2017-06-22

  • Articles and reports: 11-633-X2017006
    Description:

    This paper describes a method of imputing missing postal codes in a longitudinal database. The 1991 Canadian Census Health and Environment Cohort (CanCHEC), which contains information on individuals from the 1991 Census long-form questionnaire linked with T1 tax return files for the 1984-to-2011 period, is used to illustrate and validate the method. The cohort contains up to 28 consecutive fields for postal code of residence, but because of frequent gaps in postal code history, missing postal codes must be imputed. To validate the imputation method, two experiments were devised where 5% and 10% of all postal codes from a subset with full history were randomly removed and imputed.

    Release date: 2017-03-13

  • Articles and reports: 12-001-X201600214661
    Description:

    An example presented by Jean-Claude Deville in 2005 is subjected to three estimation methods: the method of moments, the maximum likelihood method, and generalized calibration. The three methods yield exactly the same results for the two non-response models. A discussion follows on how to choose the most appropriate model.

    Release date: 2016-12-20

  • Articles and reports: 12-001-X201600214676
    Description:

    Winsorization procedures replace extreme values with less extreme values, effectively moving the original extreme values toward the center of the distribution. Winsorization therefore both detects and treats influential values. Mulry, Oliver and Kaputa (2014) compare the performance of the one-sided Winsorization method developed by Clark (1995) and described by Chambers, Kokic, Smith and Cruddas (2000) to the performance of M-estimation (Beaumont and Alavi 2004) in highly skewed business population data. One aspect of particular interest for methods that detect and treat influential values is the range of values designated as influential, called the detection region. The Clark Winsorization algorithm is easy to implement and can be extremely effective. However, the resultant detection region is highly dependent on the number of influential values in the sample, especially when the survey totals are expected to vary greatly by collection period. In this note, we examine the effect of the number and magnitude of influential values on the detection regions from Clark Winsorization using data simulated to realistically reflect the properties of the population for the Monthly Retail Trade Survey (MRTS) conducted by the U.S. Census Bureau. Estimates from the MRTS and other economic surveys are used in economic indicators, such as the Gross Domestic Product (GDP).

    Release date: 2016-12-20

  • Articles and reports: 12-001-X201600114538
    Description:

    The aim of automatic editing is to use a computer to detect and amend erroneous values in a data set, without human intervention. Most automatic editing methods that are currently used in official statistics are based on the seminal work of Fellegi and Holt (1976). Applications of this methodology in practice have shown systematic differences between data that are edited manually and automatically, because human editors may perform complex edit operations. In this paper, a generalization of the Fellegi-Holt paradigm is proposed that can incorporate a large class of edit operations in a natural way. In addition, an algorithm is outlined that solves the resulting generalized error localization problem. It is hoped that this generalization may be used to increase the suitability of automatic editing in practice, and hence to improve the efficiency of data editing processes. Some first results on synthetic data are promising in this respect.

    Release date: 2016-06-22

  • Articles and reports: 11-522-X201700014715
    Description:

    In preparation for 2021 UK Census the ONS has committed to an extensive research programme exploring how linked administrative data can be used to support conventional statistical processes. Item-level edit and imputation (E&I) will play an important role in adjusting the 2021 Census database. However, uncertainty associated with the accuracy and quality of available administrative data renders the efficacy of an integrated census-administrative data approach to E&I unclear. Current constraints that dictate an anonymised ‘hash-key’ approach to record linkage to ensure confidentiality add to that uncertainty. Here, we provide preliminary results from a simulation study comparing the predictive and distributional accuracy of the conventional E&I strategy implemented in CANCEIS for the 2011 UK Census to that of an integrated approach using synthetic administrative data with systematically increasing error as auxiliary information. In this initial phase of research we focus on imputing single year of age. The aim of the study is to gain insight into whether auxiliary information from admin data can improve imputation estimates and where the different strategies fall on a continuum of accuracy.

    Release date: 2016-03-24

  • Articles and reports: 12-001-X201500114193
    Description:

    Imputed micro data often contain conflicting information. The situation may e.g., arise from partial imputation, where one part of the imputed record consists of the observed values of the original record and the other the imputed values. Edit-rules that involve variables from both parts of the record will often be violated. Or, inconsistency may be caused by adjustment for errors in the observed data, also referred to as imputation in Editing. Under the assumption that the remaining inconsistency is not due to systematic errors, we propose to make adjustments to the micro data such that all constraints are simultaneously satisfied and the adjustments are minimal according to a chosen distance metric. Different approaches to the distance metric are considered, as well as several extensions of the basic situation, including the treatment of categorical data, unit imputation and macro-level benchmarking. The properties and interpretations of the proposed methods are illustrated using business-economic data.

    Release date: 2015-06-29
Reference (7)

Reference (7) ((7 results))

  • Surveys and statistical programs – Documentation: 71F0031X2005002
    Description:

    This paper introduces and explains modifications made to the Labour Force Survey estimates in January 2005. Some of these modifications include the adjustment of all LFS estimates to reflect population counts based on the 2001 Census, updates to industry and occupation classification systems and sample redesign changes.

    Release date: 2005-01-26

  • Surveys and statistical programs – Documentation: 92-397-X
    Description:

    This report covers concepts and definitions, the imputation method and data quality for this variable. The 2001 Census collected information on three types of unpaid work performed during the week preceding the Census: looking after children, housework and caring for seniors. The 2001 data on unpaid work are compared with the 1996 Census data and with the data from the General Social Survey (use of time in 1998). The report also includes historical tables.

    Release date: 2005-01-11

  • Surveys and statistical programs – Documentation: 92-388-X
    Description:

    This report contains basic conceptual and data quality information to help users interpret and make use of census occupation data. It gives an overview of the collection, coding (to the 2001 National Occupational Classification), edit and imputation of the occupation data from the 2001 Census. The report describes procedural changes between the 2001 and earlier censuses, and provides an analysis of the quality level of the 2001 Census occupation data. Finally, it details the revision of the 1991 Standard Occupational Classification used in the 1991 and 1996 Censuses to the 2001 National Occupational Classification for Statistics used in 2001. The historical comparability of data coded to the two classifications is discussed. Appendices to the report include a table showing historical data for the 1991, 1996 and 2001 Censuses.

    Release date: 2004-07-15

  • Surveys and statistical programs – Documentation: 92-398-X
    Description:

    This report contains basic conceptual and data quality information intended to facilitate the use and interpretation of census class of worker data. It provides an overview of the class of worker processing cycle including elements such as regional office processing, and edit and imputation. The report concludes with summary tables that indicate the level of data quality in the 2001 Census class of worker data.

    Release date: 2004-04-22

  • Surveys and statistical programs – Documentation: 85-602-X
    Description:

    The purpose of this report is to provide an overview of existing methods and techniques making use of personal identifiers to support record linkage. Record linkage can be loosely defined as a methodology for manipulating and / or transforming personal identifiers from individual data records from one or more operational databases and subsequently attempting to match these personal identifiers to create a composite record about an individual. Record linkage is not intended to uniquely identify individuals for operational purposes; however, it does provide probabilistic matches of varying degrees of reliability for use in statistical reporting. Techniques employed in record linkage may also be of use for investigative purposes to help narrow the field of search against existing databases when some form of personal identification information exists.

    Release date: 2000-12-05

  • Surveys and statistical programs – Documentation: 75F0002M1998012
    Description:

    This paper looks at the work of the task force responsible for reviewing Statistics Canada's household and family income statistics programs, and at one of associated program changes, namely, the integration of two major sources of annual income data in Canada, the Survey of Consumer Finances (SCF) and the Survey of Labour and Income Dynamics (SLID).

    Release date: 1998-12-30

  • Surveys and statistical programs – Documentation: 75F0002M1997006
    Description:

    This report documents the edit and imputation approach taken in processing Wave 1 income data from the Survey of Labour and Income Dynamics (SLID).

    Release date: 1997-12-31
Date modified: