Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Section 1. Introduction
Official Statistics are increasingly often compiled from a combination of data sources, including surveys and administrative registers. The use of different sources poses multiple challenges. Different sources can be overlapping, meaning that more than one observation is obtained for the same person and variable. Often, it is observed that data sources are contaminated by errors and missing values. Therefore it can happen that two data sources provide two different values for the same unit and variable. Most of the data collected by statistical agencies have to be corrected or processed somehow to obtain consistent and publishable results. Several strategies are available to deal with multiple, overlapping data sources that are each contaminated by erroneous and missing values, see e.g. Pankowska, Pavlopoulos, Bakker and Oberski (2020). A first, and in practice often chosen strategy, is to ignore inconsistencies between data sources. This happens for instance if one data source is chosen that is believed to have the highest quality (de Waal, van Delden and Scholtus, 2020). When such strategies are chosen, the information in all available sources is not fully exploited.
A second strategy is to apply weighting techniques (Särndal, Swensson and Wretman, 2003). When weighting is used, survey records are calibrated towards the totals from a register source. Differences between data sources are fully explained from the selection effects of the sample. This approach ignores the fact that the register totals, as well as the sample surveys, might be subject to measurement error. An additional complication is that weighting does not always lead to fully consistent output, as it only achieves consistency with regard to the variables that are incorporated in the weighting model. The number of variables that can be included in a weighting model is however limited.
A third strategy to resolve inconsistencies between multiple sources is macro-integration, an approach that reconciles statistical output at aggregate level. This approach usually consists of two steps. First, differences with a known cause are resolved (i.e. bias). The remaining, mostly smaller, discrepancies that usually arise due to noise are corrected in a second step. Several mathematical methods have been developed for this purpose, e.g. Bikker, Daalmans and Mushkudiani (2013), Daalmans (2019), Di Fonzo and Martini (2003), Magnus, van Tongeren and de Vos (2000), Sefton and Weale (1995) and Stone, Champernowne and Meade (1942). A first drawback of macro integration is that the connection between the micro-data and the published results gets lost. The macro-integrated results cannot be computed by aggregation of the micro data. A second drawback is that the detailed micro data might not be fully exploited, as the corrections are made at the macro level.
Many of the issues arising when one of the previously discussed strategies is used can be circumvented by Multiple Imputation of Latent Class analysis (MILC) by Boeschoten, Oberski and de Waal (2017). This method combines multiple measures from different sources (population register and sample survey) at micro level. The different observations are considered indicators of a Latent Class (LC) model. The MILC-model corrects for misclassification while also taking edit restrictions into account. These are rules that identify logically impossible combinations of scores (e.g. pregnant men). After the LC model has been estimated, multiple imputed versions of the target variable are created, that are corrected for the estimated misclassification. Differences between imputed values reflect the uncertainty due to missing and conflicting values. The total variance can be estimated based on these differences. The method can be considered a model-based imputation method that requires the Missing At Random (MAR) assumption. A simulation study on the performance of this method showed that its performance is strongly related to the entropy value of the LC model; a measure which indicates how well the LC model can predict class membership based on the observed variables, or how well classes are separated.
After MILC was introduced, multiple studies have extended the method to broaden its scope of applicability. Boeschoten, de Waal and Vermunt (2019) extended the method to impute values that are missing by design, for example because they were not present in the sample, using a quasi-latent variable. More specifically, a quasi-latent variable is a latent variable that is restricted to have a perfect relationship with an observed variable that contains missing values. In that way, the relationship between the quasi-latent variable and all other variables specified in the model can be used to estimate the missing values. In addition, they investigated the performance of the method when two combined sources follow different missingness mechanisms. Furthermore, Boeschoten, Filipponi and Varriale (2021) investigated how the method can be extended for longitudinal situations and how unit missingness can be imputed in a situation of combined survey and register data.
Although these previous studies investigated a number of relevant issues, there are still cases for which it is unclear how the MILC-method can be applied. The aim of this paper is to further enhance the possibilities of MILC in terms of application and, with that, to further increase the capabilities of producing multi-source statistics.
Currently, the application of MILC has been limited to univariate problems. In practice, however, there is often a need to estimate multiple variables at once. The first important extension in this paper is to allow the simultaneous imputation of multiple latent variables. As population registers can contain misclassification, it is worthwhile to correct for the misclassification if possible. For multivariate problems, corrections should be performed simultaneously, which is more difficult than for one variable only.
Second, statistical agencies generally consider finite target populations (e.g. containing all registered inhabitants of a country). It is unclear if the MILC method can be applied directly to a finite population, or that adaptations to the method should be made.
The usefulness of the extensions in this paper is illustrated by an application to the Dutch virtual census; an application that would otherwise not be possible. For the census, a large number of tables have to be estimated from a population register and a sample survey. To the best of our knowledge, this is the first time that MILC has been applied to such a large estimation problem. Theoretically, it is already known that edit restrictions can be incorporated in an LC model to prevent the occurrence of logically impossible combinations of scores (Boeschoten et al., 2017). However, it is not trivial how the MILC method performs if edit restrictions are incorporated in such a way that they affect multiple cells in a population census table.
In Section 2, a description of the MILC method is given, tailored to handle the specific extensions discussed. In Section 3, a description of the simulation study is given. Simulation results are shown in Section 4 and Section 5 provides a discussion.
- Date modified: