12-539 Data Quality Guidelines

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

Survey steps >

Imputation

Scope and purpose

Imputation is the process used to determine and assign replacement values for missing, invalid or inconsistent data that have failed edits. This is done by changing some of the responses or assigning values when they are missing on the record being edited to ensure that estimates are of high quality and that a plausible, internally consistent record is created. Many of these problems would have been solved earlier through follow-up with the respondent or through review and manual correction of the questionnaire. However, it is generally impossible to resolve all problems at these early stages due to concerns of response burden, cost and timeliness. Since it is usually desirable to produce a complete and consistent microdata file containing imputed data, imputation is used to handle the remaining edit failures.

Although imputation can improve the quality of the final data by correcting for missing, invalid or inconsistent responses, care must be exercised in choosing an appropriate imputation methodology. Some methods of imputation do not preserve the relationships between variables and can actually distort underlying distributions. Therefore, imputation must be taken into account when producing estimates and their associated variance estimates.

Principles

Imputation is best done by those with full access to the microdata and in possession of good auxiliary information. It may be automated, manual or a combination of both. Good imputation attempts to limit the bias caused by not having observed all of the desired values, has an audit trail for evaluation purposes and ensures that imputed records are internally consistent. Good imputation processes are automated, objective, reproducible and efficient. Under the Fellegi-Holt principles (1976), changes are made to the minimum number of fields to ensure that the completed record passes all of the edits.

Imputation methods can be classified as either deterministic or stochastic, depending upon whether or not there is some degree of randomness in the imputed data (Kalton and Kasprzyk, 1986; Kovar and Whitridge, 1995). Deterministic imputation methods include logical imputation, historical imputation, mean imputation, ratio and regression imputation and single donor nearest-neighbour imputation. These methods can be further divided into methods that rely solely on deducing the imputed value from data available for the nonrespondent and other auxiliary data (logical and historical) and those that make use of the observed data for other responding units for the given survey. Use of observed data from responding units can be made directly by transferring data from a chosen donor record or by means of models (ratio and regression). Stochastic imputation methods include the hot deck, nearest neighbour imputation where a random selection is made from several “closest” nearest neighbours, regression with random residuals, and any other deterministic method with random residuals added.

Guidelines

Evaluate the type of nonresponse. That is, try to determine which auxiliary variables can explain the nonresponse mechanism(s) in order to use them to enrich the imputation method. Include such auxiliary variables in the imputation method.
Carefully develop and test the imputation approach. Study the quality and appropriateness of available variables to determine which ones to use as auxiliary variables, as matching variables or to build imputation classes. For this purpose, consult subject matter experts and use modelling techniques.
Take into account the type of estimates to be produced, such as level vs. change, high-level aggregates vs. small domains, and cross-sectional vs. longitudinal.
Try to have the imputed record closely resemble the failed edit record. This is achieved by imputing the minimum number of variables in some sense, thereby preserving as much respondent data as possible. The underlying assumption is that a respondent is more likely to make only one or two errors rather than several, although this is not always true in practice. Make imputed records internally consistent.
In some surveys, it is necessary to use several different types of imputation methods. This is usually achieved in an automated hierarchy of methods. Limit the number of such levels and carefully develop and test the methods used at each level of the hierarchy. Similarly, when collapsing imputation classes is required, carefully develop and test the imputation methods for the new classes.
When donor imputation is used, try to impute data for a record from as few donors as possible. Operationally, this may be interpreted as one donor per section of questionnaire, since it is virtually impossible to treat all variables at once for a large questionnaire. Also, based on available donors, allow equally good imputation actions an appropriate chance of being selected to avoid artificially inflating the size of certain groups in the population.
For large surveys, it may be necessary to process variables in two or more passes, rather than in a single pass, so as to reduce computational costs. As well, there may be extensive response errors on a record. Either of these conditions can make it difficult to follow the guidelines exactly: there may be cases where more than one donor is required, and more than the minimum number of variables are imputed.
During the development of the imputation methodology, note that there exist a number of generalized systems that implement a variety of algorithms, for either continuous or categorical data. The systems are usually simple to use once the edits are specified, and they include algorithms to determine which fields to impute. They are well documented and retain audit trails to allow evaluation of the imputation process. Two systems currently available at Statistics Canada are the Generalized Edit and Imputation System (GEIS/BANFF) (Kovar et al, 1988; Statistics Canada, 2000a) for quantitative economic variables and the Canadian Census Edit and Imputation System (CANCEIS) (Bankier et al, 1999) for qualitative and quantitative variables.
Flag imputed values and clearly identify the methods and sources of imputation. Retain the unimputed and imputed values of the record’s fields for evaluation purposes. Evaluate the degree and effects of imputation. Consider the use of techniques to adequately measure the sampling variance under imputation and to measure the added variance introduced by imputation (Lee et al, 2002). This information is required to satisfy Statistics Canada’s Policy on Informing Users of Data Quality and Methodology (Statistics Canada, 2000d).
Consider the degree and impact of imputation when analyzing data. The imputation methods used may have a significant impact on distributions of data. For example, it is possible that not very much has changed at the aggregate level, but that values in one domain have moved systematically up, while values in another domain have moved down by an offsetting amount. As well, even when the degree of imputation is low, changes to individual records may have a significant impact, for example when changes are made to large units or when large changes are made to a few units. In general, the greater the degree and impact of imputation, the more judicious the analyst needs to be in using the data. In such cases, analyses may be misleading if the imputed values are treated as observed data.
Note that the Imputation Bulletin produced by the Methodology Branch presents Statistics Canada’s software and practices in imputation as well as recent developments in the field of imputation. Also, the Committee on Practices in Imputation (CoPI) meets regularly to discuss issues related to imputation and specific implementations of imputation. Valuable comments and suggestions can be obtained from the CoPI when designing an imputation strategy.

Top of Page

References

Bankier, M., Lachance, M. and Poirier, P. (1999). A generic implementation of the New Imputation Methodology. Proceedings of the Survey Research Methods Section, American Statistical Association, 548-553.

Fellegi, I.P. and Holt, D. (1976). A systematic approach to automatic edit and imputation. Journal of the American Statistical Association, 71, 17-35.

Kalton, G. and Kasprzyk, D. (1986). The treatment of missing survey data. Survey Methodology, 12, 1-16.

Kovar, J.G., and Whitridge, P. (1995). Imputation of business survey data. In Business Survey Methods, B.G. Cox et al. (eds.), Wiley, New York, 403-423.

Kovar, J.G., MacMillan, J. and Whitridge, P. (1988). Overview and strategy for the Generalized Edit and Imputation System. Statistics Canada, Methodology Branch Working Paper No. BSMD 88-007 E/F.

Lee, H., Rancourt, E. and Särndal, C.-E. (2002). Variance estimation from survey data under single imputation. In Survey Nonresponse, R.M. Groves et al. (eds.), Wiley, New York, 315-328.

Statistics Canada (2000a). Functional description of the Generalized Edit and Imputation System. Statistics Canada technical report.

Statistics Canada (2000d). Policy on Informing Users of Data Quality and Methodology. Policy Manual, 2.3.

Home \| Search \| Contact Us \| Français
Date Modified: 2014-04-10	Important Notices