Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Section 5. Discussion
In this paper, the performance of the MILC method was investigated in a situation where misclassification was induced in a finite population setting. Here, an existing population census table was used as a starting point, and for three categorical variables present in this census table, two indicator variables were generated with 5% misclassification each, where one indicator also contains approximately 90% missing values. As a finite population was assumed, the estimated variance only contained a between variance component reflecting the differences between the imputations and thereby the uncertainty caused by the misclassification and missing values in the indicator variables.
The simulation results show that the method, regardless of the number of imputations, produces results with a low bias for marginal frequency distributions, cross-tables between imputed latent variables and covariates and even for the complete six-way cross-table. Striking is the amount of bias that is induced when the indicator observed via the register is used to calculate the cross-tables evaluated in comparison to when MILC is used. It is also shown that if these indicators are used, it is likely that impossible combinations of scores are produced as well, something that can be easily circumvented by specifying edit restrictions in the LC model. This simulation study once again shows that misclassification, even if it is non-systematic, can seriously bias results. In terms of variance, it was seen that if the MILC method is applied, variance estimates are appropriate in general. However, if cell frequencies are relatively small, the variance is overestimated. This problem is more severe if the complete frequency table is evaluated, because this large table contains many cells with low frequencies.
The current set-up of this simulation study knows two major limitations. The first is caused by the large amount of cells in the cross-table. Because of this, a latent class model containing only main effects was used. It was not feasible to use a saturated model as the number of parameters would be very large, and it would be likely that not every parameter is estimable in every bootstrap sample. This would limit the use of starting values, thereby increasing the computation time for the simulation study to an unfeasible amount.
A second limitation is that in our simulation set-up we only considered relatively simple sampling designs for the survey data: simple random sampling (MCAR conditions) and, essentially, stratified simple random sampling (MAR conditions). A future study could examine to what extent the MILC method can also correct for misclassification error with appropriate variance estimates when survey data are obtained by a complex sampling design that involves, for instance, cluster sampling, multistage sampling or sampling with unequal probabilities proportional to size. In the context of missing data it has been found that, although a generally accepted theory is still lacking, in practice multiple imputation often works reasonably well for complex samples, provided that design variables and/or survey weights are included in the imputation model; see, e.g., Rässler (2004, page 14) and the references listed there. It would be interesting to investigate whether this result also applies to multiple imputation in the context of correcting for measurement errors. As an alternative, Zhou, Elliott and Raghunathan (2016) proposed a Bayesian approach to incorporate survey design features into a multiple-imputation analysis.
The starting point of this simulation study was an existing population census table. A nice property here was that we could approach this as a finite and known population. Therefore, we did not have to include (within) sampling variance in our estimate of the total variance. It was insightful to evaluate cell frequencies of both univariate and multivariate cross-tables as results generally appeared to be related to cell-frequency.
References
Bakker, B. (2010). Micro-integration, state of the art. Paper presented at the joint UNECE-Eurostat expert group meeting on registered based censuses in The Hague, May 11, 2010. Retrieved from https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.41/2010/wp.10.e.pdf (date visited 2017.04.24).
Bakker, B. (2011). Micro-integration. statistical methods 201108. Statistics Netherlands.
Bakker, B., Van Rooijen, J. and Van Toor, L. (2014). The system of social statistical datasets of statistics netherlands: An integral approach to the production of register-based social statistics. Statistical Journal of the IAOS, 30(4), 411-424.
Bikker, R., Daalmans, J. and Mushkudiani, N. (2013). Benchmarking large accounting frameworks: A generalized multivariate model. Economic Systems Research, 25(4), 390-408.
Boeschoten, L., de Waal, T., and Vermunt, J.K. (2019). Estimating the number of serious road injuries per vehicle type in the netherlands by using multiple imputation of latent classes. Journal of the Royal Statistical Society: Series A (Statistics in Society), 182(4), 1463-1486. Retrieved from https://doi.org/10.1111/rssa.12471.
Boeschoten, L., Filipponi, D. and Varriale, R. (2021). Combining multiple imputation and hidden markov modeling to obtain consistent estimates of employment status. Journal of Survey Statistics and Methodology, 9(3), 549-573. Retrieved from https://doi.org/10.1111/rssa.12471.
Boeschoten, L., Oberski, D. and de Waal, T. (2017). Estimating classification errors under edit restrictions in composite survey-register data using Multiple Imputation Latent Class modelling (MILC). Journal of Official Statistics, 33(4), 921-962. Retrieved from https://doi.org/10.1515/jos-2017-0044.
Census Hub (2017, July). European Statistical System. Online, July 2017, (last visited 11/07/2017).
Daalmans, J. (2018). Divide-and-Conquer solutions for estimating large consistent table sets. Statistical Journal of the IAOS, 34(2), 223-233.
Daalmans, J. (2019). Pushing the Boundaries for Automated Data Reconciliation in Official Statistics. Tilburg University.
de Waal, T., Pannekoek, J. and Scholtus, S. (2011). Handbook of Statistical Data Editing and Imputation, New York: John Wiley & Sons, Inc., 563. (ISBN 0470904836, 9780470904831).
de Waal, T., van Delden, A. and Scholtus, S. (2020). Multi-source statistics: Basic situations and methods. International Statistical Review, 88(1), 203-228.
Di Fonzo, T., and Martini, M. (2003). Benchmarking systems of seasonally adjusted time series according.
European Commission (2008). Regulation (ec) no 763/2008 of the european parliament and of the council of 9 july 2008 on population and housing censuses. Official Journal of the European Union, (L218), 14-20.
European Commission (2009). Commission regulation (ec) no 1201/2009 of 30 november 2009 implementing regulation (ec) no 763/2008 of the european parliament and of the council on population and housing censuses as regards the technical specifications of the topics and of their breakdowns. Official Journal of the European Union, (L329), 29-68.
European Commission (2010). Commission regulation (eu) no 1151/2010 of 8 december 2010 implementing regulation (ec) no 763/2008 of the european parliament and of the council on population and housing censuses, as regards the modalities and structure of the quality reports and the technical format for data transmission. Official Journal of the European Union, (L324), 1-12.
Geerdinck, M., Goedhuys-van der Linden, M., Hoogbruin, E., De Rijk, A., Sluiter, N. and Verkleij, C. (2014). Monitor Kwaliteit Stelsel Van Basisregistraties: Nulmeting Van de Kwaliteit Van Basisregistraties in Samenhang, 2014 (13114th Ed.). Henri Faasdreef 312, 2492 JP Den Haag, Centraal Bureau voor de Statistiek. Retrieved from https://www.cbs.nl/-/media/pdf/2016/50/monitor-kwaliteit-stelsel-van-basisregistraties.pdf (date visited 2017.04.25).
Magnus, J.R., van Tongeren, J.W. and de Vos, A.F. (2000). National accounts estimation using indicator ratios. Review of Income and Wealth, 46(3), 329-350.
Mashreghi, Z., Haziza, D. and Léger, C. (2016). A survey of bootstrap methods in finite population sampling. Statistics Surveys, 10, 1-52.
Pankowska, P., Pavlopoulos, D., Bakker, B. and Oberski, D.L. (2020). Reconciliation of inconsistent data sources using hidden markov models. Statistical Journal of the IAOS, 36(4), 1261-1279.
Rässler, S. (2004). The impact of multiple imputation for DACSEIS. (DACSEIS Research Paper Series No. 5).
Rubin, D.B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons, Inc., 81. Retrieved from dx.doi.org//10.1002/9780470316696 (ISBN 9780471087052) doi: 10.1002/9780470316696.
Särndal, C.-E., Swensson, B. and Wretman, J. (2003). Model Assisted Survey Sampling. Springer Science & Business Media.
Schulte Nordholt, E., Van Zeijl, J. and Hoeksma, L. (2014). Dutch census 2011, analysis and methodology. Statistics Netherlands. Retrieved from https://www.cbs.nl/-/media/imported/documents/2014/44/2014-b57-pub.pdf (date visited 2017.04.25).
Sefton, J., and Weale, M. (1995). Reconciliation of National Income and Expenditure: Balanced Estimates of National Income for the United Kingdom, 1920-1990. Cambridge University Press, 7.
Stone, R., Champernowne, D.G. and Meade, J.E. (1942). The precision of national income estimates. The Review of Economic Studies, 9(2), 111-125.
The Economic and Social Council (2005). Ecosoc resolution 2005/13. 2010 World Population and Housing Census Programme. doi: http://www.un.org/en/ecosoc/docs/2005/resolution%202005-13.pdf.
van Rooijen, J., Bloemendal, C. and Krol, N. (2016). The added value of micro-integration: Data on laid-off employees. Statistical Journal of the IAOS, 32(4), 685-692.
Vermunt, J.K., and Magidson, J. (2013a). Latent GOLD 5.0 Upgrade Manual [Computer software manual]. Belmont, MA, Retrieved from https://www.statisticalinnovations.com/wp-content/uploads/LG5manual.pdf (date visited 2017.04.25).
Vermunt, J.K., and Magidson, J. (2013b). Technical guide for Latent GOLD 5.0: Basic, Advanced, and Syntax. Statistical Innovations Inc., Belmont, MA. Retrieved from https://www.statisticalinnovations.com/wp-content/uploads/LGtecnical.pdf (date visited 2017.04.25).
Vink, G., and van Buuren, S. (2014). Pooling multiple imputations when the sample happens to be the population. arXiv preprint arXiv:1409.8542, Retrieved from https://arxiv.org/abs/1409.8542.
Zhou, H., Elliott, M.R. and Raghunathan, T.E. (2016). A two-step semiparametric method to accommodate sampling weights in multiple imputation. Biometrics, 72, 242-252. doi: 10.1111/biom.12413.
- Date modified: