Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Section 3. Simulation study

In this section, we describe a simulation study that is performed to evaluate the extensions of the MILC method in Section 2. The topic of this study is the estimation of a table from the Dutch Population and Housing Census.

3.1  The Dutch Census

Population and housing censuses provide a picture about the socio-demographic and socio-economic situation of a country and it is ubiquitous that a census should cover the entire population of people and dwellings that are present in a country. Every ten years the United Nations Economic and Social Council (ECOSOC) adopts a resolution, urging Member States to carry out a population and housing census and to disseminate census results as an essential source of information, see e.g. The Economic and Social Council (2005). In the EU, explicit agreements have been made about which variables should be listed in the census, and also which cross-tables should be produced (European Commission, 2008, 2009 and 2010).

The vast majority of countries produce census data by conducting a traditional census, which entails interviewing inhabitants in a complete enumeration, reaching every single household. An increasing number of countries however have adopted a different, innovative approach, in the form of a so-called virtual census. With a virtual census, census tables are compiled using data sources that are already available at the statistical agency. These are data sources that have not been primary collected for the census, but for other purposes. Statistics Netherlands can rely on population registers as the main source for most census tables. These registers are of relatively good quality, including a very broad coverage (Geerdinck, Goedhuys-van der Linden, Hoogbruin, De Rijk, Sluiter and Verkleij, 2014). All register variables are available from Statistics Netherlands’ system of social statistical data-sets (Bakker, Van Rooijen and Van Toor, 2014). The backbone is the Central Population Register which combines the population registers from municipalities. The population registers are supplemented with variables originating from sample surveys, because not all variables that are necessary according to the EU regulations can be found in the population registers.

For the 2001 and 2011 Dutch censuses, only two variables could not be measured from registers: Occupation and Educational Attainment (Schulte Nordholt, Van Zeijl and Hoeksma, 2014). These two variables were observed from combined Labour Force Surveys (LFSs). To obtain the required cross-tables for the 2011 Dutch census, a procedure was used where all data sources were matched on the unit level. Then, a micro-integration process was carried out. Micro-integration brings together records from different micro-datasets and subsequently resolves data inconsistencies. The goal is to improve the quality, compatibility and scope of the data sets. The techniques that are used in micro integration are: completing, harmonising and correcting for measurement errors. Completing means that corrections are made for an under- or overcoverage of a target population. Harmonisation refers to transformations such that data sets fit to the concept that is supposed to be measured. Measurement correction means that inconsistencies between sources are resolved (Bakker, 2011; van Rooijen, Bloemendal and Krol, 2016). Also, inconsistencies between sources are removed, by using formal rules that make clear what happens in case of inconsistencies, e.g. which source is used (Bakker, 2010; de Waal, Pannekoek and Scholtus, 2011).

After micro-integration, two combined data sources were obtained: one based on a combination of registers and the other one based on a combination of sample surveys. All census tables that do not contain occupation and educational attainment were entirely compiled from the combined registers. The values in the cells of these tables were obtained by counting the occurrence of the categories in the matched registers. The other census tables, those with educational attainment and/or occupation, were estimated from the combined sample surveys. To establish consistent results, a procedure was applied based on weighting followed by macro integration (Daalmans, 2018; Schulte Nordholt et al., 2014). In the first step, weights were derived, such that the marginal totals of the weighted survey data comply with the known totals from the registers. The different tables that are obtained in this way are not necessarily consistent with each other, because different weighting schemes apply to each table. To resolve this problem, macro-integration is used. This step starts with initial estimates for each census table, derived from the weighted survey data or from the integrally counted register data. These initial estimates are adjusted, to arrive at fully consistent census tables, that comply with the known register totals.

MILC has a couple of advantages over the current estimation method. First, the assumption is often made that the population registers are free of error. If a variable is measured both in the population register and in a sample survey and the scores on these variables contradict each other, the register score usually overrides the survey score because of this assumption. In other words, sample survey data are ignored for the part that is also observed in a register. Second, for the current procedure, it is not easy to compute uncertainty measures that capture all steps of the estimation process, including the uncertainty due to the missing and conflicting values in the linked data-sets. For MILC on the other hand it is well-established how variances can be properly estimated. Third, the data processing procedure that is currently used contains a specific sequence of steps, where decisions made at one step are influenced by decisions made at previous steps. For instance, if there are two conflicting values for the same person, then one of these is chosen in the “micro-integration” step. In the subsequent weighting and macro integration steps only one value is used. Thus, the availability of the different values is ignored in the final estimation of the census tables. Basically, MILC exploits information provided by all observed values in contrast to the current procedure.

3.2  The census table under investigation

The starting point of this simulation study is an existing census table, which can be downloaded from Census Hub (Census Hub, 2017). This table comprises 2,691,477 persons who where living in the region “Noord-Holland” in the Netherlands in 2011. This census table is a cross-table between the following six variables:

  1. Age in 21 categories: under 5 years; 5 to 9 years; 10 to 14 years; 15 to 19 years; 20 to 24 years; 25 to 29 years; 30 to 34 years; 35 to 39 years; 40 to 44 years; 45 to 49 years; 50 to 54 years; 55 to 59 years; 60 to 64 years; 65 to 69 years; 70 to 74 years; 75 to 79 years; 80 to 84 years; 85 to 89 years; 90 to 94 years; 95 to 99 years; 100 years and over.
  2. Marital status in eight categories: never married; married; widowed; divorced; registered partnership; widow of registered partner; divorced from registered partner; not stated.
  3. Gender in two categories: male; female.
  4. Place of birth in five categories: the Netherlands; a country within the European Union; a country outside the European Union; other; not stated.
  5. Type of family nucleus in which a person lives in five categories: partners; lone parents; sons/daughters; not stated; not applicable.
  6. Country of citizenship in five categories: Dutch citizen; citizen of a country within the European Union; citizen of a country outside the European Union; stateless; not stated.

Thus, the census table consists of 42,000 cells.

3.3  Simulation setup

The goal of this simulation study is to replicate the frequencies of the 42,000 cells in the cross-table using multiple indicators contaminated with misclassification and missing values. Therefore, this misclassification should be induced first.

We generate two indicator variables for three different latent variables, all containing 5% random misclassification, which can be considered a very high amount, especially for Dutch population registers. The indicator variables are generated for the variables “Gender”, “Type of family nucleus” and “Country of citizenship”. Misclassification is generated in such a way that first, 5% of the cases are randomly selected. Second, their original score is identified and third, a different score is assigned by sampling from the observed frequency distribution of the other categories.

For the register indicators Y l v ,1 , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGzbWaaSbaaSqaaiaadYgadaWgaa adbaGaamODaaqabaWccaGGSaGaaGjbVlaaigdaaeqaaOGaaiilaaaa @3891@  misclassification is generated only once, as these indicator variables represent register variables for the complete and finite population, there should not be any variability in misclassification between replications in the simulation study for these variables. For the survey indicators Y l v ,2 , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGzbWaaSbaaSqaaiaadYgadaWgaa adbaGaamODaaqabaWccaGGSaGaaGjbVlaaikdaaeqaaOGaaiilaaaa @3892@  misclassification is newly generated for every replication in the simulation study, followed by generating missing values using either a Missing Completely At Random (MCAR) or Missing At Random (MAR) missingness mechanism with approximately 90% missingness for both situations. With a MCAR mechanism, the response probabilities for the respondents and non-respondents is equal. With a MAR mechanism, the response probabilities are related to other observed values (Rubin, 1976). These Y l v ,2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGzbWaaSbaaSqaaiaadYgadaWgaa adbaGaamODaaqabaWccaGGSaGaaGjbVlaaikdaaeqaaaaa@37D8@  indicators represent survey variables for a sample of the population.

Missingness is generated in such a way that it mimics a situation that 10% of the population is included in the survey. Missingness is generated under MCAR and MAR. Under MCAR, the probability of being missing (i.e. not being included in the survey) is 90% and equal for every person in the population. Under MAR, the probability of being missing depends on a persons’ age and decreases as a person gets older. More specifically, the probability of being missing is lowest for persons in the age category “100 years and older”, and is 80%. This percentage gradually increases with the highest being 94% for the persons in the age category “under five years”. To summarize, for each of the 500 iterations in the simulation study, a simple random or stratified sample of the combined data-set is obtained that contains approximately 269,147 persons (10% of the population), on which the LC model is estimated.

3.4  Applying the MILC method

As discussed in Section 2, M MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGnbaaaa@3283@  bootstrap samples are generated from the combined dataset, and in this study the LC model is estimated only on the complete set of observations of each bootstrap sample. Results are obtained using M=5, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGnbGaaGjbVlabg2da9iaaysW7ca aI1aGaaiilaaaa@3812@   M=10 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGnbGaaGjbVlabg2da9iaaysW7ca aIXaGaaGimaaaa@3818@  and M=20. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGnbGaaGjbVlabg2da9iaaysW7ca aIYaGaaGimaiaac6caaaa@38CB@

In Figure 3.1, the graphical overview of the latent class model can be found. Here, it can be seen that the latent variables X 1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGybWaaSbaaSqaaiaaigdaaeqaaa aa@3375@  “Gender”, X 2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGybWaaSbaaSqaaiaaikdaaeqaaa aa@3376@  “Family nucleus” and X 3 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGybWaaSbaaSqaaiaaiodaaeqaaa aa@3377@  “Citizen” are all measured by two indicators. The restriction on the relationship between Q 1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGrbWaaSbaaSqaaiaaigdaaeqaaa aa@336E@  “Age” and X 2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGybWaaSbaaSqaaiaaikdaaeqaaa aa@3376@  “Family nucleus” is denoted by “a” in Figure 3.1. Here, we restricted that if someone is of age category “under 5 years”, “5 to 9 years” or “10 to 14 years”, it is impossible to be assigned to the latent classes “partners” or “lone parents” for the latent variable “Family nucleus”.

Description of Figure 3.1

Figure illustrating the Latent Class model. Variables Q (Age, Marital status and Place of birth) are the covariates, variables X (Gender, Family nucleus and Citizen) are the latent variables and variables Y are the observed target variables, i.e. the indicator variables. Edit restrictions are applied between the variables “Type of family nucleus” and “Age” (denoted in the model by “a”). Here, we restricted that if someone is of age category “under 5 years”, “5 to 9 years” or “10 to 14 years”, it is impossible to be assigned to the latent classes “partners” or “lone parents” for the latent variable “Family nucleus”.

To specify the LC model for response pattern P( Y=y| Q=q ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaqGqbGaaGjbVpaabmqabaGaaCywai aaysW7cqGH9aqpcaaMe8UaaCyEaiaaysW7daabbeqaaiaaysW7caWH rbGaaGjbVlabg2da9iaaysW7caWHXbaacaGLhWoaaiaawIcacaGLPa aaaaa@4642@  we can fill in at equation 2.1 that v=3, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWG2bGaaGjbVlabg2da9iaaysW7ca aIZaGaaiilaaaa@3839@   K 1 =2, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGlbWaaSbaaSqaaiaaigdaaeqaaO GaaGjbVlabg2da9iaaysW7caaIYaGaaiilaaaa@38FE@   K 2 =4, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGlbWaaSbaaSqaaiaaikdaaeqaaO GaaGjbVlabg2da9iaaysW7caaI0aGaaiilaaaa@3901@   K 3 =4, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGlbWaaSbaaSqaaiaaiodaaeqaaO GaaGjbVlabg2da9iaaysW7caaI0aGaaiilaaaa@3902@   L 1 =2, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGmbWaaSbaaSqaaiaaigdaaeqaaO GaaGjbVlabg2da9iaaysW7caaIYaGaaiilaaaa@38FF@   L 2 =2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGmbWaaSbaaSqaaiaaikdaaeqaaO GaaGjbVlabg2da9iaaysW7caaIYaaaaa@3850@  and L 3 =2. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGmbWaaSbaaSqaaiaaiodaaeqaaO GaaGjbVlabg2da9iaaysW7caaIYaGaaiOlaaaa@3903@  Note that X 2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGybWaaSbaaSqaaiaaikdaaeqaaa aa@3376@  here only has four latent classes, while the variable “Family Nucleus” in the population census table has five categories. Therefore, it would have made sense for X 2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGybWaaSbaaSqaaiaaikdaaeqaaa aa@3376@  to also have five latent classes. However, there were no observations for the category “not applicable”, so therefore we didn’t have to include a latent class for this category. The same holds for the category “stateless” of X 3 . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGybWaaSbaaSqaaiaaiodaaeqaaO GaaiOlaaaa@3433@

Next, multiple imputations can be created and estimates of interest can be pooled as described in Sections 2.3 and 2.4. As the cells of the frequency-tables of interest can become very small, a log-transformation is used to ensure appropriate confidence intervals around these small cells. Therefore, VAR between j MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaqGwbGaaeyqaiaabkfadaWgaaWcba GaaeOyaiaabwgacaqG0bGaae4DaiaabwgacaqGLbGaaeOBamaaBaaa meaacaWGQbaabeaaaSqabaaaaa@3BF5@  is not estimated as the variance of θ ^ ij , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacuaH4oqCgaqcamaaBaaaleaacaWGPb GaamOAaaqabaGccaGGSaaaaa@363A@  as in equation 2.7, but as the variance of log( θ ^ ij ), MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaqGSbGaae4BaiaabEgacaaMc8+aae WabeaacuaH4oqCgaqcamaaBaaaleaacaWGPbGaamOAaaqabaaakiaa wIcacaGLPaaacaGGSaaaaa@3C1A@  where θ ^ ij MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacuaH4oqCgaqcamaaBaaaleaacaWGPb GaamOAaaqabaaaaa@3580@  refers to the number of units in cell j MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGQbaaaa@32A0@  in imputation i. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGPbGaaiOlaaaa@3351@

3.5  Evaluation

To evaluate the performance of the MILC method when trying to construct the census table initially used to create the misclassified data, it is useful to make comparisons to results obtained when the variable observed in the register is used directly to create cross-tables. We refer to these results as obtained using Y v,1 . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGzbWaaSbaaSqaaiaadAhacaaISa GaaGjbVlaaigdaaeqaaOGaaiOlaaaa@3770@  These results are equal over the 500 simulation iterations and the bias here directly reflects the misclassification in this indicator, which becomes more severe as the categories are more imbalanced in size due to the misclassification mechanism. Furthermore, it would be difficult to draw general conclusions from results obtained by only evaluating every single of the 42,000 cells of the complete census table. Therefore, we investigate some specific characteristics of this table separately. First, we investigate whether the method is able to reconstruct the univariate marginal cell frequencies of the latent variables specified. Second, we investigate if the method is able to reconstruct the joint distribution of the three latent variables. Third, we investigate if the method correctly incorporates edit restrictions. At last, we investigate some features of the complete census table.

First, we evaluate the cell-proportions of the previously discussed cross-tables in terms of bias, by evaluating the average absolute bias and the root mean squared error (RMSE) over the N it =500 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGobWaaSbaaSqaaiaadMgacaWG0b aabeaakiaaysW7cqGH9aqpcaaMe8UaaGynaiaaicdacaaIWaaaaa@3AF4@  replications in the simulation study. More specifically, the bias of a cell frequency θ j MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacqaH4oqCdaWgaaWcbaGaamOAaaqaba aaaa@3482@  is calculated as

bias θ j = it=1 N it ( θ j θ ^ j it ) N it .(3.1) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaqGIbGaaeyAaiaabggacaqGZbWaaS baaSqaaiabeI7aXnaaBaaameaacaWGQbaabeaaaSqabaGccaaMe8Ua eyypa0JaaGjbVlaaysW7daWcaaqaamaaqadabaGaaGPaVpaabmqaba GaeqiUde3aaSbaaSqaaiaadQgaaeqaaOGaaGjbVlabgkHiTiaaysW7 cuaH4oqCgaqcamaaBaaaleaacaWGQbWaaSbaaeaacaWGPbGaamiDaa qabaaabeaaaOGaayjkaiaawMcaaaWcbaGaamyAaiaadshacaaI9aGa aGymaaqaaiaad6eadaWgaaadbaGaamyAaiaadshaaeqaaaqdcqGHri s5aaGcbaGaamOtamaaBaaaleaacaWGPbGaamiDaaqabaaaaOGaaGOl aiaaywW7caaMf8UaaGzbVlaaywW7caaMf8UaaiikaiaaiodacaGGUa GaaGymaiaacMcaaaa@6462@

Furthermore, the RMSE is calculated as

RMSE= it=1 N it ( θ j θ ^ j it ) 2 N it .(3.2) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaqGsbGaaeytaiaabofacaqGfbGaaG jbVlaaysW7caqG9aGaaGjbVlaaysW7daWcaaqaamaakaaabaGaaGjb VpaaqadabaGaaGPaVpaabmqabaGaeqiUde3aaSbaaSqaaiaadQgaae qaaOGaaGjbVlabgkHiTiaaysW7cuaH4oqCgaqcamaaBaaaleaacaWG QbWaaSbaaeaacaWGPbGaamiDaaqabaaabeaaaOGaayjkaiaawMcaaa WcbaGaamyAaiaadshacaaI9aGaaGymaaqaaiaad6eadaWgaaadbaGa amyAaiaadshaaeqaaaqdcqGHris5aOWaaWbaaSqabeaacaaIYaaaaa qabaaakeaacaWGobWaaSbaaSqaaiaadMgacaWG0baabeaaaaGccaaI UaGaaGzbVlaaywW7caaMf8UaaGzbVlaaywW7caGGOaGaaG4maiaac6 cacaaIYaGaaiykaaaa@64BF@

Second, results are evaluated in terms of variance. Here, it is of interest to evaluate whether MILC correctly reflects uncertainty due to missing and conflicting values in between imputation variance for both univariate and multivariate cross-tables. Therefore, we investigate if the average of the estimated standard errors is approximately equal to the standard deviation over the 500 estimates obtained from the 500 simulation replications by evaluating its ratio, which is calculated by

[ it=1 N it SE( θ ^ j it ) / N it ] SD( θ ^ j it ) ,(3.3) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaadaWcaaqaamaadmaabaWaaSGbaeaada aeWaqaaiaaykW7caqGtbGaaeyraiaaykW7daqadeqaaiqbeI7aXzaa jaWaaSbaaSqaaiaadQgadaWgaaadbaGaamyAaiaadshaaeqaaaWcbe aaaOGaayjkaiaawMcaaaWcbaGaamyAaiaadshacaaMe8Uaeyypa0Ja aGjbVlaaigdaaeaacaWGobWaaSbaaWqaaiaadMgacaWG0baabeaaa0 GaeyyeIuoakiaaykW7aeaacaaMc8UaamOtamaaBaaaleaacaWGPbGa amiDaaqabaaaaaGccaGLBbGaayzxaaaabaGaae4uaiaabseacaaMe8 +aaeWabeaacuaH4oqCgaqcamaaBaaaleaacaWGQbWaaSbaaWqaaiaa dMgacaWG0baabeaaaSqabaaakiaawIcacaGLPaaaaaGaaGilaiaayw W7caaMf8UaaGzbVlaaywW7caaMf8UaaiikaiaaiodacaGGUaGaaG4m aiaacMcaaaa@668A@

where SE is the square root of the estimate of the total variance obtained after applying pooling rules (Rubin, 1976) and SD( θ ^ j it ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaqGtbGaaeiraiaaysW7daqadeqaai qbeI7aXzaajaWaaSbaaSqaaiaadQgadaWgaaadbaGaamyAaiaadsha aeqaaaWcbeaaaOGaayjkaiaawMcaaaaa@3B6F@  is calculated as

SD( θ ^ j it )= it=1 N it ( θ ^ j it θ ¯ j it ) 2 N it .(3.4) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaqGtbGaaeiraiaaysW7daqadeqaai qbeI7aXzaajaWaaSbaaSqaaiaadQgadaWgaaadbaGaamyAaiaadsha aeqaaaWcbeaaaOGaayjkaiaawMcaaiaaysW7caaMe8Uaeyypa0JaaG jbVlaaysW7daGcaaqaaiaaysW7daWcaaqaamaaqadabaGaaGPaVpaa bmqabaGafqiUdeNbaKaadaWgaaWcbaGaamOAamaaBaaameaacaWGPb GaamiDaaqabaaaleqaaOGaaGjbVlabgkHiTiaaysW7cuaH4oqCgaqe amaaBaaaleaacaWGQbWaaSbaaWqaaiaadMgacaWG0baabeaaaSqaba aakiaawIcacaGLPaaadaahaaWcbeqaaiaaikdaaaaabaGaamyAaiaa dshacaaMe8UaaGypaiaaysW7caaIXaaabaGaamOtamaaBaaameaaca WGPbGaamiDaaqabaaaniabggHiLdaakeaacaWGobWaaSbaaSqaaiaa dMgacaWG0baabeaaaaaabeaakiaai6cacaaMf8UaaGzbVlaaywW7ca aMf8UaaGzbVlaacIcacaaIZaGaaiOlaiaaisdacaGGPaaaaa@70D5@

To account for small cell frequencies, θ ^ j it MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacuaH4oqCgaqcamaaBaaaleaacaWGQb WaaSbaaWqaaiaadMgacaWG0baabeaaaSqabaaaaa@36B1@  and θ ¯ j it MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacuaH4oqCgaqeamaaBaaaleaacaWGQb WaaSbaaWqaaiaadMgacaWG0baabeaaaSqabaaaaa@36B9@  are considered on a log scale in equations 3.2, 3.3 and 3.4. To summarize, we denote the specific conditions evaluated in this simulation study as Y v,1 , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGzbWaaSbaaSqaaiaadAhacaaISa GaaGjbVlaaigdaaeqaaOGaaiilaaaa@376E@  MILC-MCAR-5, MILC-MCAR-10, MILC-MCAR-20, MILC-MAR-5, MILC-MAR-10 and MILC-MAR-20.


Date modified: