Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Section 3. Simulation study
In this section, we describe a simulation study that is
performed to evaluate the extensions of the MILC method in Section 2. The
topic of this study is the estimation of a table from the Dutch Population and
Housing Census.
3.1 The Dutch Census
Population and housing censuses provide a picture about the
socio-demographic and socio-economic situation of a country and it is
ubiquitous that a census should cover the entire population of people and
dwellings that are present in a country. Every ten years the United Nations
Economic and Social Council (ECOSOC) adopts a resolution, urging Member States
to carry out a population and housing census and to disseminate census results
as an essential source of information, see e.g. The Economic and Social Council
(2005). In the EU, explicit agreements have been made about which variables
should be listed in the census, and also which cross-tables should be produced
(European Commission, 2008, 2009 and 2010).
The vast majority of countries produce census data by
conducting a traditional census, which entails interviewing inhabitants in a
complete enumeration, reaching every single household. An increasing number of
countries however have adopted a different, innovative approach, in the form of
a so-called virtual census. With a virtual census, census tables are compiled
using data sources that are already available at the statistical agency. These
are data sources that have not been primary collected for the census, but for
other purposes. Statistics Netherlands can rely on population registers as the
main source for most census tables. These registers are of relatively good
quality, including a very broad coverage (Geerdinck,
Goedhuys-van der Linden, Hoogbruin, De Rijk, Sluiter and
Verkleij, 2014). All register variables are available from Statistics
Netherlands’ system of social statistical data-sets (Bakker, Van Rooijen
and Van Toor, 2014). The backbone is the Central Population Register which
combines the population registers from municipalities. The population registers
are supplemented with variables originating from sample surveys, because not
all variables that are necessary according to the EU regulations can be found
in the population registers.
For the 2001 and 2011 Dutch censuses, only two variables
could not be measured from registers: Occupation and Educational Attainment (Schulte Nordholt,
Van Zeijl and Hoeksma, 2014). These two variables were observed from
combined Labour Force Surveys (LFSs). To obtain the required cross-tables for
the 2011 Dutch census, a procedure was used where all data sources were matched
on the unit level. Then, a micro-integration process was carried out.
Micro-integration brings together records from different micro-datasets and
subsequently resolves data inconsistencies. The goal is to improve the quality,
compatibility and scope of the data sets. The techniques that are used in micro
integration are: completing, harmonising and correcting for measurement errors.
Completing means that corrections are made for an under- or overcoverage of a
target population. Harmonisation refers to transformations such that data sets
fit to the concept that is supposed to be measured. Measurement correction
means that inconsistencies between sources are resolved (Bakker, 2011; van Rooijen,
Bloemendal and Krol, 2016). Also, inconsistencies between sources are removed,
by using formal rules that make clear what happens in case of inconsistencies,
e.g. which source is used (Bakker, 2010; de Waal, Pannekoek and Scholtus, 2011).
After micro-integration, two combined data sources were
obtained: one based on a combination of registers and the other one based on a
combination of sample surveys. All census tables that do not contain occupation
and educational attainment were entirely compiled from the combined registers.
The values in the cells of these tables were obtained by counting the
occurrence of the categories in the matched registers. The other census tables,
those with educational attainment and/or occupation, were estimated from the
combined sample surveys. To establish consistent results, a procedure was
applied based on weighting followed by macro integration (Daalmans, 2018;
Schulte Nordholt et al., 2014). In the first step, weights were
derived, such that the marginal totals of the weighted survey data comply with
the known totals from the registers. The different tables that are obtained in
this way are not necessarily consistent with each other, because different
weighting schemes apply to each table. To resolve this problem,
macro-integration is used. This step starts with initial estimates for each
census table, derived from the weighted survey data or from the integrally
counted register data. These initial estimates are adjusted, to arrive at fully
consistent census tables, that comply with the known register totals.
MILC has a couple of advantages over the current estimation
method. First, the assumption is often made that the population registers are
free of error. If a variable is measured both in the population register and in
a sample survey and the scores on these variables contradict each other, the
register score usually overrides the survey score because of this assumption.
In other words, sample survey data are ignored for the part that is also
observed in a register. Second, for the current procedure, it is not easy to
compute uncertainty measures that capture all steps of the estimation process,
including the uncertainty due to the missing and conflicting values in the
linked data-sets. For MILC on the other hand it is well-established how
variances can be properly estimated. Third, the data processing procedure that
is currently used contains a specific sequence of steps, where decisions made
at one step are influenced by decisions made at previous steps. For instance,
if there are two conflicting values for the same person, then one of these is
chosen in the “micro-integration” step. In the subsequent weighting and macro
integration steps only one value is used. Thus, the availability of the
different values is ignored in the final estimation of the census tables.
Basically, MILC exploits information provided by all observed values in
contrast to the current procedure.
3.2 The census table under investigation
The starting point of this simulation study is an existing
census table, which can be downloaded from Census Hub (Census Hub, 2017). This
table comprises 2,691,477 persons who where living in the region “Noord-Holland”
in the Netherlands in 2011. This census table is a cross-table between the
following six variables:
- Age in 21
categories: under 5 years; 5 to 9 years; 10 to 14 years; 15 to 19 years; 20 to
24 years; 25 to 29 years; 30 to 34 years; 35 to 39 years; 40 to 44 years; 45 to
49 years; 50 to 54 years; 55 to 59 years; 60 to 64 years; 65 to 69 years; 70 to
74 years; 75 to 79 years; 80 to 84 years; 85 to 89 years; 90 to 94 years; 95 to
99 years; 100 years and over.
- Marital
status in eight categories: never married; married; widowed; divorced;
registered partnership; widow of registered partner; divorced from registered
partner; not stated.
- Gender in
two categories: male; female.
- Place of
birth in five categories: the Netherlands; a country within the European Union;
a country outside the European Union; other; not stated.
- Type of
family nucleus in which a person lives in five categories: partners; lone
parents; sons/daughters; not stated; not applicable.
- Country
of citizenship in five categories: Dutch citizen; citizen of a country within
the European Union; citizen of a country outside the European Union; stateless;
not stated.
Thus, the census table consists of 42,000 cells.
3.3 Simulation setup
The goal of this simulation study is to replicate the
frequencies of the 42,000 cells in the cross-table using multiple indicators
contaminated with misclassification and missing values. Therefore, this
misclassification should be induced first.
We generate two indicator variables for three different
latent variables, all containing 5% random misclassification, which can be
considered a very high amount, especially for Dutch population registers. The
indicator variables are generated for the variables “Gender”, “Type of family
nucleus” and “Country of citizenship”. Misclassification is generated in such a
way that first, 5% of the cases are randomly selected. Second, their original
score is identified and third, a different score is assigned by sampling from
the observed frequency distribution of the other categories.
For the register indicators misclassification is generated only once, as
these indicator variables represent register variables for the complete and
finite population, there should not be any variability in misclassification
between replications in the simulation study for these variables. For the
survey indicators misclassification is newly generated for every
replication in the simulation study, followed by generating missing values
using either a Missing Completely At Random (MCAR) or Missing At Random (MAR)
missingness mechanism with approximately 90% missingness for both situations.
With a MCAR mechanism, the response probabilities for the respondents and
non-respondents is equal. With a MAR mechanism, the response probabilities are
related to other observed values (Rubin, 1976). These indicators represent survey variables for a
sample of the population.
Missingness is generated in such a way that it mimics a
situation that 10% of the population is included in the survey. Missingness is
generated under MCAR and MAR. Under MCAR, the probability of being missing
(i.e. not being included in the survey) is 90% and equal for every person in
the population. Under MAR, the probability of being missing depends on a
persons’ age and decreases as a person gets older. More specifically, the
probability of being missing is lowest for persons in the age category “100
years and older”, and is 80%. This percentage gradually increases with the
highest being 94% for the persons in the age category “under five years”. To
summarize, for each of the 500 iterations in the simulation study, a simple
random or stratified sample of the combined data-set is obtained that contains
approximately 269,147 persons (10% of the population), on which the LC model is
estimated.
3.4 Applying the MILC method
As discussed in Section 2, bootstrap samples are generated from the
combined dataset, and in this study the LC model is estimated only on the
complete set of observations of each bootstrap sample. Results are obtained
using and
In Figure 3.1, the graphical overview of the latent
class model can be found. Here, it can be seen that the latent variables “Gender”, “Family nucleus” and “Citizen” are all measured by two indicators.
The restriction on the relationship between “Age” and “Family nucleus” is denoted by “a” in
Figure 3.1. Here, we restricted that if someone is of age category “under
5 years”, “5 to 9 years” or “10 to 14 years”, it is impossible to be assigned
to the latent classes “partners” or “lone parents” for the latent variable “Family
nucleus”.

Description of Figure 3.1
Figure illustrating the Latent Class model. Variables Q (Age, Marital status and Place of birth) are the covariates, variables X (Gender, Family nucleus and Citizen) are the latent variables and variables Y are the observed target variables, i.e. the indicator variables. Edit restrictions are applied between the variables “Type of family nucleus” and “Age” (denoted in the model by “a”). Here, we restricted that if someone is of age category “under 5 years”, “5 to 9 years” or “10 to 14 years”, it is impossible to be assigned to the latent classes “partners” or “lone parents” for the latent variable “Family nucleus”.
To specify the LC model for response pattern we can fill in at equation 2.1 that and Note that here only has four latent classes, while the
variable “Family Nucleus” in the population census table has five categories.
Therefore, it would have made sense for to also have five latent classes. However,
there were no observations for the category “not applicable”, so therefore we
didn’t have to include a latent class for this category. The same holds for the
category “stateless” of
Next, multiple imputations can be created and estimates of
interest can be pooled as described in Sections 2.3 and 2.4. As the cells
of the frequency-tables of interest can become very small, a log-transformation
is used to ensure appropriate confidence intervals around these small cells.
Therefore, is not estimated as the variance of as in equation 2.7, but as the variance
of where refers to the number of units in cell in imputation
3.5 Evaluation
To evaluate the performance of the MILC method when trying
to construct the census table initially used to create the misclassified data,
it is useful to make comparisons to results obtained when the variable observed
in the register is used directly to create cross-tables. We refer to these
results as obtained using These results are equal over the 500 simulation
iterations and the bias here directly reflects the misclassification in this
indicator, which becomes more severe as the categories are more imbalanced in
size due to the misclassification mechanism. Furthermore, it would be difficult
to draw general conclusions from results obtained by only evaluating every
single of the 42,000 cells of the complete census table. Therefore, we
investigate some specific characteristics of this table separately. First, we
investigate whether the method is able to reconstruct the univariate marginal
cell frequencies of the latent variables specified. Second, we investigate if the
method is able to reconstruct the joint distribution of the three latent
variables. Third, we investigate if the method correctly incorporates edit
restrictions. At last, we investigate some features of the complete census
table.
First, we evaluate the cell-proportions of the previously
discussed cross-tables in terms of bias, by evaluating the average absolute
bias and the root mean squared error (RMSE) over the replications in the simulation study. More
specifically, the bias of a cell frequency is calculated as
Furthermore, the RMSE is calculated as
Second, results are evaluated in terms of variance. Here, it
is of interest to evaluate whether MILC correctly reflects uncertainty due to
missing and conflicting values in between imputation variance for both
univariate and multivariate cross-tables. Therefore, we investigate if the
average of the estimated standard errors is approximately equal to the standard
deviation over the 500 estimates obtained from the 500 simulation replications
by evaluating its ratio, which is calculated by
where SE is the square root of the estimate of the total
variance obtained after applying pooling rules (Rubin, 1976) and is calculated as
To account for small cell frequencies, and are considered on a log scale in equations 3.2,
3.3 and 3.4. To summarize, we denote the specific conditions evaluated in this
simulation study as MILC-MCAR-5, MILC-MCAR-10, MILC-MCAR-20, MILC-MAR-5,
MILC-MAR-10 and MILC-MAR-20.