Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Section 3. Simulation study

Table of contents

In this section, we describe a simulation study that is performed to evaluate the extensions of the MILC method in Section 2. The topic of this study is the estimation of a table from the Dutch Population and Housing Census.

3.1 The Dutch Census

Population and housing censuses provide a picture about the socio-demographic and socio-economic situation of a country and it is ubiquitous that a census should cover the entire population of people and dwellings that are present in a country. Every ten years the United Nations Economic and Social Council (ECOSOC) adopts a resolution, urging Member States to carry out a population and housing census and to disseminate census results as an essential source of information, see e.g. The Economic and Social Council (2005). In the EU, explicit agreements have been made about which variables should be listed in the census, and also which cross-tables should be produced (European Commission, 2008, 2009 and 2010).

The vast majority of countries produce census data by conducting a traditional census, which entails interviewing inhabitants in a complete enumeration, reaching every single household. An increasing number of countries however have adopted a different, innovative approach, in the form of a so-called virtual census. With a virtual census, census tables are compiled using data sources that are already available at the statistical agency. These are data sources that have not been primary collected for the census, but for other purposes. Statistics Netherlands can rely on population registers as the main source for most census tables. These registers are of relatively good quality, including a very broad coverage (Geerdinck, Goedhuys-van der Linden, Hoogbruin, De Rijk, Sluiter and Verkleij, 2014). All register variables are available from Statistics Netherlands’ system of social statistical data-sets (Bakker, Van Rooijen and Van Toor, 2014). The backbone is the Central Population Register which combines the population registers from municipalities. The population registers are supplemented with variables originating from sample surveys, because not all variables that are necessary according to the EU regulations can be found in the population registers.

For the 2001 and 2011 Dutch censuses, only two variables could not be measured from registers: Occupation and Educational Attainment (Schulte Nordholt, Van Zeijl and Hoeksma, 2014). These two variables were observed from combined Labour Force Surveys (LFSs). To obtain the required cross-tables for the 2011 Dutch census, a procedure was used where all data sources were matched on the unit level. Then, a micro-integration process was carried out. Micro-integration brings together records from different micro-datasets and subsequently resolves data inconsistencies. The goal is to improve the quality, compatibility and scope of the data sets. The techniques that are used in micro integration are: completing, harmonising and correcting for measurement errors. Completing means that corrections are made for an under- or overcoverage of a target population. Harmonisation refers to transformations such that data sets fit to the concept that is supposed to be measured. Measurement correction means that inconsistencies between sources are resolved (Bakker, 2011; van Rooijen, Bloemendal and Krol, 2016). Also, inconsistencies between sources are removed, by using formal rules that make clear what happens in case of inconsistencies, e.g. which source is used (Bakker, 2010; de Waal, Pannekoek and Scholtus, 2011).

After micro-integration, two combined data sources were obtained: one based on a combination of registers and the other one based on a combination of sample surveys. All census tables that do not contain occupation and educational attainment were entirely compiled from the combined registers. The values in the cells of these tables were obtained by counting the occurrence of the categories in the matched registers. The other census tables, those with educational attainment and/or occupation, were estimated from the combined sample surveys. To establish consistent results, a procedure was applied based on weighting followed by macro integration (Daalmans, 2018; Schulte Nordholt et al., 2014). In the first step, weights were derived, such that the marginal totals of the weighted survey data comply with the known totals from the registers. The different tables that are obtained in this way are not necessarily consistent with each other, because different weighting schemes apply to each table. To resolve this problem, macro-integration is used. This step starts with initial estimates for each census table, derived from the weighted survey data or from the integrally counted register data. These initial estimates are adjusted, to arrive at fully consistent census tables, that comply with the known register totals.

MILC has a couple of advantages over the current estimation method. First, the assumption is often made that the population registers are free of error. If a variable is measured both in the population register and in a sample survey and the scores on these variables contradict each other, the register score usually overrides the survey score because of this assumption. In other words, sample survey data are ignored for the part that is also observed in a register. Second, for the current procedure, it is not easy to compute uncertainty measures that capture all steps of the estimation process, including the uncertainty due to the missing and conflicting values in the linked data-sets. For MILC on the other hand it is well-established how variances can be properly estimated. Third, the data processing procedure that is currently used contains a specific sequence of steps, where decisions made at one step are influenced by decisions made at previous steps. For instance, if there are two conflicting values for the same person, then one of these is chosen in the “micro-integration” step. In the subsequent weighting and macro integration steps only one value is used. Thus, the availability of the different values is ignored in the final estimation of the census tables. Basically, MILC exploits information provided by all observed values in contrast to the current procedure.

3.2 The census table under investigation

The starting point of this simulation study is an existing census table, which can be downloaded from Census Hub (Census Hub, 2017). This table comprises 2,691,477 persons who where living in the region “Noord-Holland” in the Netherlands in 2011. This census table is a cross-table between the following six variables:

Age in 21 categories: under 5 years; 5 to 9 years; 10 to 14 years; 15 to 19 years; 20 to 24 years; 25 to 29 years; 30 to 34 years; 35 to 39 years; 40 to 44 years; 45 to 49 years; 50 to 54 years; 55 to 59 years; 60 to 64 years; 65 to 69 years; 70 to 74 years; 75 to 79 years; 80 to 84 years; 85 to 89 years; 90 to 94 years; 95 to 99 years; 100 years and over.
Marital status in eight categories: never married; married; widowed; divorced; registered partnership; widow of registered partner; divorced from registered partner; not stated.
Gender in two categories: male; female.
Place of birth in five categories: the Netherlands; a country within the European Union; a country outside the European Union; other; not stated.
Type of family nucleus in which a person lives in five categories: partners; lone parents; sons/daughters; not stated; not applicable.
Country of citizenship in five categories: Dutch citizen; citizen of a country within the European Union; citizen of a country outside the European Union; stateless; not stated.

Thus, the census table consists of 42,000 cells.

3.3 Simulation setup

The goal of this simulation study is to replicate the frequencies of the 42,000 cells in the cross-table using multiple indicators contaminated with misclassification and missing values. Therefore, this misclassification should be induced first.

We generate two indicator variables for three different latent variables, all containing 5% random misclassification, which can be considered a very high amount, especially for Dutch population registers. The indicator variables are generated for the variables “Gender”, “Type of family nucleus” and “Country of citizenship”. Misclassification is generated in such a way that first, 5% of the cases are randomly selected. Second, their original score is identified and third, a different score is assigned by sampling from the observed frequency distribution of the other categories.

For the register indicators $Y_{l_{v}, 1},$ misclassification is generated only once, as these indicator variables represent register variables for the complete and finite population, there should not be any variability in misclassification between replications in the simulation study for these variables. For the survey indicators $Y_{l_{v}, 2},$ misclassification is newly generated for every replication in the simulation study, followed by generating missing values using either a Missing Completely At Random (MCAR) or Missing At Random (MAR) missingness mechanism with approximately 90% missingness for both situations. With a MCAR mechanism, the response probabilities for the respondents and non-respondents is equal. With a MAR mechanism, the response probabilities are related to other observed values (Rubin, 1976). These $Y_{l_{v}, 2}$ indicators represent survey variables for a sample of the population.

Missingness is generated in such a way that it mimics a situation that 10% of the population is included in the survey. Missingness is generated under MCAR and MAR. Under MCAR, the probability of being missing (i.e. not being included in the survey) is 90% and equal for every person in the population. Under MAR, the probability of being missing depends on a persons’ age and decreases as a person gets older. More specifically, the probability of being missing is lowest for persons in the age category “100 years and older”, and is 80%. This percentage gradually increases with the highest being 94% for the persons in the age category “under five years”. To summarize, for each of the 500 iterations in the simulation study, a simple random or stratified sample of the combined data-set is obtained that contains approximately 269,147 persons (10% of the population), on which the LC model is estimated.

3.4 Applying the MILC method

As discussed in Section 2, $M$ bootstrap samples are generated from the combined dataset, and in this study the LC model is estimated only on the complete set of observations of each bootstrap sample. Results are obtained using $M = 5,$ $M = 10$ and $M = 20.$

In Figure 3.1, the graphical overview of the latent class model can be found. Here, it can be seen that the latent variables $X_{1}$ “Gender”, $X_{2}$ “Family nucleus” and $X_{3}$ “Citizen” are all measured by two indicators. The restriction on the relationship between $Q_{1}$ “Age” and $X_{2}$ “Family nucleus” is denoted by “a” in Figure 3.1. Here, we restricted that if someone is of age category “under 5 years”, “5 to 9 years” or “10 to 14 years”, it is impossible to be assigned to the latent classes “partners” or “lone parents” for the latent variable “Family nucleus”.

Description of Figure 3.1

Figure illustrating the Latent Class model. Variables Q (Age, Marital status and Place of birth) are the covariates, variables X (Gender, Family nucleus and Citizen) are the latent variables and variables Y are the observed target variables, i.e. the indicator variables. Edit restrictions are applied between the variables “Type of family nucleus” and “Age” (denoted in the model by “a”). Here, we restricted that if someone is of age category “under 5 years”, “5 to 9 years” or “10 to 14 years”, it is impossible to be assigned to the latent classes “partners” or “lone parents” for the latent variable “Family nucleus”.

To specify the LC model for response pattern $P (Y = y | Q = q)$ we can fill in at equation 2.1 that $v = 3,$ $K_{1} = 2,$ $K_{2} = 4,$ $K_{3} = 4,$ $L_{1} = 2,$ $L_{2} = 2$ and $L_{3} = 2.$ Note that $X_{2}$ here only has four latent classes, while the variable “Family Nucleus” in the population census table has five categories. Therefore, it would have made sense for $X_{2}$ to also have five latent classes. However, there were no observations for the category “not applicable”, so therefore we didn’t have to include a latent class for this category. The same holds for the category “stateless” of $X_{3} .$

Next, multiple imputations can be created and estimates of interest can be pooled as described in Sections 2.3 and 2.4. As the cells of the frequency-tables of interest can become very small, a log-transformation is used to ensure appropriate confidence intervals around these small cells. Therefore, ${VAR}_{{between}_{j}}$ is not estimated as the variance of ${\hat{θ}}_{i j},$ as in equation 2.7, but as the variance of $log ({\hat{θ}}_{i j}),$ where ${\hat{θ}}_{i j}$ refers to the number of units in cell $j$ in imputation $i .$

3.5 Evaluation

To evaluate the performance of the MILC method when trying to construct the census table initially used to create the misclassified data, it is useful to make comparisons to results obtained when the variable observed in the register is used directly to create cross-tables. We refer to these results as obtained using $Y_{v, 1} .$ These results are equal over the 500 simulation iterations and the bias here directly reflects the misclassification in this indicator, which becomes more severe as the categories are more imbalanced in size due to the misclassification mechanism. Furthermore, it would be difficult to draw general conclusions from results obtained by only evaluating every single of the 42,000 cells of the complete census table. Therefore, we investigate some specific characteristics of this table separately. First, we investigate whether the method is able to reconstruct the univariate marginal cell frequencies of the latent variables specified. Second, we investigate if the method is able to reconstruct the joint distribution of the three latent variables. Third, we investigate if the method correctly incorporates edit restrictions. At last, we investigate some features of the complete census table.

First, we evaluate the cell-proportions of the previously discussed cross-tables in terms of bias, by evaluating the average absolute bias and the root mean squared error (RMSE) over the $N_{i t} = 500$ replications in the simulation study. More specifically, the bias of a cell frequency $θ_{j}$ is calculated as

${bias}_{θ_{j}} = \frac{\sum_{i t =1}^{N_{i t}} (θ_{j} - {\hat{θ}}_{j_{i t}})}{N_{i t}} . (3.1)$

Furthermore, the RMSE is calculated as

$RMSE = \frac{\sqrt{{\sum_{i t =1}^{N_{i t}} (θ_{j} - {\hat{θ}}_{j_{i t}})}^{2}}}{N_{i t}} . (3.2)$

Second, results are evaluated in terms of variance. Here, it is of interest to evaluate whether MILC correctly reflects uncertainty due to missing and conflicting values in between imputation variance for both univariate and multivariate cross-tables. Therefore, we investigate if the average of the estimated standard errors is approximately equal to the standard deviation over the 500 estimates obtained from the 500 simulation replications by evaluating its ratio, which is calculated by

\frac{[\sum_{i t = 1}^{N_{i t}} SE ({\hat{θ}}_{j_{i t}}) / N_{i t}]}{SD ({\hat{θ}}_{j_{i t}})}, (3.3)

where SE is the square root of the estimate of the total variance obtained after applying pooling rules (Rubin, 1976) and $SD ({\hat{θ}}_{j_{i t}})$ is calculated as

$SD ({\hat{θ}}_{j_{i t}}) = \sqrt{\frac{\sum_{i t = 1}^{N_{i t}} {({\hat{θ}}_{j_{i t}} - {\bar{θ}}_{j_{i t}})}^{2}}{N_{i t}}} . (3.4)$

To account for small cell frequencies, ${\hat{θ}}_{j_{i t}}$ and ${\bar{θ}}_{j_{i t}}$ are considered on a log scale in equations 3.2, 3.3 and 3.4. To summarize, we denote the specific conditions evaluated in this simulation study as $Y_{v, 1},$ MILC-MCAR-5, MILC-MCAR-10, MILC-MCAR-20, MILC-MAR-5, MILC-MAR-10 and MILC-MAR-20.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-06-21

Language selection

Search and menus

Search

Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Section 3. Simulation study

3.1 The Dutch Census

3.2 The census table under investigation

3.3 Simulation setup

3.4 Applying the MILC method

3.5 Evaluation

Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources Section 3. Simulation study

3.1 The Dutch Census

3.2 The census table under investigation

3.3 Simulation setup

3.4 Applying the MILC method

3.5 Evaluation

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Section 3. Simulation study