Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Section 2. Methodology

Table of contents

When applying the MILC method, the starting point is a unit-linked combined dataset, which can consists of combinations of administrative population registries and survey samples. In order to account for uncertainty regarding the parameters of the LC model estimated at a later step in MILC, a non-parametric bootstrap procedure is applied on this dataset first (step 1). This involves creating $M$ bootstrap samples by drawing observations from the observed dataset with replacement. Subsequently, for each bootstrap sample, the LC model of interest is estimated (step 2) using Latent GOLD software (Vermunt and Magidson, 2013a). Here, model parameters are estimated by Maximum Likelihood using a combination of the Expectation-Maximization and Newton-Raphson algorithms. Note that here, by explicitly stating which cells should be restricted, constrained estimation is used. Next, $M$ imputations are created using the $M$ sets of parameter values obtained from the $M$ latent class models (step 3). If imputations would be created based on the maximum-likelihood estimates obtained directly using the original observed data, sampling uncertainty regarding the estimated parameters of the latent class model would be ignored.

In the following subsections, we explain each of the steps of MILC in more detail and present the extension for the estimation of multiple latent variables for a finite population from register and sample survey data.

2.1 Step 1: Creating bootstrap samples

We propose to use the “classical” bootstrap procedure here, which consists of repeatedly drawing samples with replacement from the original dataset, of the same size as the original dataset. A motivation for using this classical with-replacement bootstrap here, as opposed to an adapted bootstrap procedure for a finite population, is provided in Section 2.5 below.

The bootstrap should be applied to the dataset that is used to estimate the LC models. When register data and survey data are combined, the indicator variables from the survey will typically be missing for a large part (e.g., 90% or more) of the population. The LC models could then be estimated by two different approaches:

using only the subset of persons observed in both the survey and the register (complete cases);
using all available data, including cases with missing indicators.

Under the second approach, full information maximum likelihood can be used to handle missing values when estimating the LC models. This has the advantage of using all available information. Since this amounts to estimating the LC model on $M$ datasets with the size of the target population, a practical drawback of this approach is that it may be computationally demanding in terms of time and memory. Therefore, the first approach may be more attractive, in particular when the associations among the covariates and target variables are relatively weak. In the latter approach, the cases with missing survey data will contain relatively little information about the parameters of the LC model. Note that under both approaches, the estimated LC models are used to impute predictions of the latent classes throughout the population. Depending on which approach is chosen to estimate the LC models, bootstrapping is applied either to the subset of complete cases or to the target population. In the simulation study in this paper, the complete-case approach will be used.

2.2 Step 2: Estimating the latent class model

The second step performed is the estimation of the LC model. It is explained below how this is done for multiple latent variables. As described in the previous section, the LC model is typically estimated $M$ times using the $M$ bootstrapped datasets. In the situation under evaluation in this paper, the LC model is estimated $M$ times on $M$ subsets of complete observations coming from the $M$ bootstrap samples. An extensive discussion of the model and the assumptions made when using the model to correct for measurement error can be found in Boeschoten et al. (2017). Multiple latent variables can be estimated simultaneously in one model, which yields the following model structure for the joint probability of the response variables given covariate values, denoted by $P (Y = y | Q = q) .$ The number of latent variables is denoted as $v$ and $K_{h}$ is the number of classes of latent variable $X_{h}$ (scalar), where $(h = 1, \dots, v) .$ Furthermore, $Y$ are the observed target variables, i.e. the indicator variables, $L_{h}$ is the number of indicator variables for $X_{h}$ and $Q$ are the (also observed) covariate variables:

$\begin{array}{l} P (Y = y | Q = q) & = \sum_{x_{1} = 1}^{K_{1}} \dots \sum_{x_{v} = 1}^{K_{v}} P (X_{1} = x_{1}, \dots, X_{v} = x_{v} | Q = q) \\ \prod_{l_{1} = 1}^{L_{1}} P (Y_{l_{1}, 1} = y_{l_{1}, 1} | X_{1} = x_{1}) \\ \dots \\ \prod_{l_{v} = 1}^{L_{v}} P (Y_{l_{v}, v} = y_{l_{v}, v} | X_{v} = x_{v}) . (2.1) \end{array}$

Here, local independence is assumed as well as independence of covariates.

Constrained parameter estimation is used when certain cells within $P (X_{1} = x_{1}, \dots, X_{v} = x_{v} | Q = q)$ are restricted. This can be used to specify that certain combinations of scores between covariates and latent variables are logically impossible, or when a “quasi-latent” variable is used to create imputations for missing values in a variable (Vermunt and Magidson, 2013b).

2.3 Step 3: Multiple imputation

To be able to create multiple imputations, joint posterior membership probabilities are calculated for every person in the original dataset. They represent the probability that a unit is part of a combination of latent classes from the different latent variables, given its combination of scores on the indicators and covariates used in the LC model. These probabilities can be used to create multiple imputations of the latent variables which contain their “true scores”.

The joint posterior membership probabilities can be calculated by applying Bayes’ rule to the conditional response probabilities obtained from the $M$ LC models:

$P (X_{1} = x_{1}, \dots, X_{v} = x_{v} | Y = y, Q = q) = \frac{P (X_{1} = x_{1}, \dots, X_{v} = x_{v}, Y = y | Q = q)}{P (Y = y | Q = q)}, (2.2)$

where

$\begin{array}{l} P (X_{1} = x_{1}, \dots, X_{v} = x_{v}, Y = y | Q = q) & = P (X_{1} = x_{1}, \dots, X_{v} = x_{v} | Q = q) \\ \prod_{l_{1} = 1}^{L_{1}} P (Y_{l_{1}, 1} = y_{l_{1}, 1} | X_{1} = x_{1}) \\ \dots \\ \prod_{l_{v} = 1}^{L_{v}} P (Y_{l_{v}, v} = y_{l_{v}, v} | X_{v} = x_{v}) (2.3) \end{array}$

and $P (Y = y | Q = q)$ is defined in equation 2.1. For one profile (so one set of scores on all indicator and covariate variables), the joint posterior membership probabilities sum up to one.

To be able to include parameter uncertainty in our variance estimates, we perform the model estimation on $M$ bootstrap samples of the dataset, resulting in $M$ different LC models. We generate imputations in the original dataset accounting for the parameter uncertainty by using the resulting $M$ sets of bootstrap parameter estimates. More specifically, with each of these $M$ parameter sets we compute the posterior class membership probabilities for the original sample, and use these to generate the imputations. In other words, the $M$ imputations are based of $M$ different sets of posterior probabilities.

2.4 Step 4: Pooling

The next step is to obtain estimates of interest for every imputation, and to pool them using Rubin’s Rules (Rubin, 1987, page 76). For this research, the main interest is producing a frequency table. Therefore, the frequency table of interest is obtained for the $M$ imputations and they are pooled, which means taking the average over the imputations for every cell in the frequency table:

${\hat{θ}}_{j} = \frac{1}{M} \sum_{i = 1}^{M} {\hat{θ}}_{i j}, (2.4)$

where $j$ refers to a specific cell in the frequency table.

Next, an estimate of the uncertainty around these frequencies is of interest. In general, the variance of the pooled estimate $j$ can be estimated by Rubin’s total variance formula for multiple imputation (Rubin, 1987, page 76):

${VAR}_{{total}_{j}} = {\bar{VAR}}_{{within}_{j}} + {VAR}_{{between}_{j}} + \frac{{VAR}_{{between}_{j}}}{M} . (2.5)$

Here, ${VAR}_{{between}_{j}}$ can be estimated as

${VAR}_{{between}_{j}} = \frac{1}{M - 1} \sum_{i = 1}^{M} ({\hat{θ}}_{i j} - {\hat{θ}}_{j}) {({\hat{θ}}_{i j} - {\hat{θ}}_{j})}^{'} . (2.6)$

The within variance ${VAR}_{{within}_{j}}$ reflects the average sampling variance of $i j$ when the imputed values are treated as observed. In our application, as the population is finite and imputations are generated for the complete population, this within variance component is zero and can be mitigated (Vink and van Buuren, 2014). Note that this is a property of multiple imputation and is due to the fact that the complete population is imputed. This should not be confused with the decision to only use a sample for LC model estimation. Hence, formula (2.5) is reduced in this case to:

${VAR}_{{total}_{j}} = {VAR}_{{between}_{j}} + \frac{{VAR}_{{between}_{j}}}{M} . (2.7)$

2.5 A note on bootstrapping for multiple imputation in finite populations

The aim of a census is to estimate certain target parameters of a finite population (e.g., all persons currently living in the Netherlands). Hence, a natural idea might be to apply a finite-population bootstrap procedure in this context; see Mashreghi, Haziza and Léger (2016) for an overview of bootstrap methods for finite populations. However, when determining the appropriate bootstrap approach, it should be noted that the bootstrap in MILC is specifically implemented to account for the between imputation variance component of formula (2.5) in Section 2.4. In general, variability in the target parameters due to the fact that a sample was drawn from a finite population is incorporated in the within variance component of formula (2.5). As we use mass imputation here, the within variance component in fact reduces to zero; cf. formula (2.7). More generally, this component would be estimated separately from the bootstrap method at hand; see Boeschoten et al. (2017) for an example.

Furthermore, the reason for incorporating the bootstrap in the MILC approach is to account for uncertainty in the estimated parameters of the latent class model. Note that these parameters are not associated with a finite population, but with a model. Even if we had observed the entire finite population, there would still be uncertainty about the true parameter values of the latent class model. This uncertainty can be considered as drawing from an infinite distribution. Therefore, we select the classical with-replacement bootstrap. We argue that bootstrap methods for finite populations should not be used in this context. For large samples, such methods would result in a substantial underestimation of the variance when combined with the usual approach to multiple imputation. We also checked this empirically in the simulation study to be discussed in Section 3. As an example, when a pseudo-population bootstrap method for finite populations was used, the resulting se/sd ratios in Table 4.7 for the condition MAR, $M = 5$ were 0.7217, 0.7887, 0.7536 and 0.8607, respectively, all pointing to a non-negligible underestimation of the true variance.

In the simulation study in this paper, we will restrict attention to surveys based on simple random sampling and stratified simple random sampling. For more complex survey designs, e.g. involving cluster sampling or sampling with unequal probabilities, it is unclear whether the proposed bootstrap approach is always appropriate. It is possible that in some cases such complex design features could indirectly affect the uncertainty of estimated parameters of the latent class model and therefore become relevant for variance estimation. We will return to this point in the discussion section.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-06-21

Language selection

Search and menus

Search

Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Section 2. Methodology

2.1 Step 1: Creating bootstrap samples

2.2 Step 2: Estimating the latent class model

2.3 Step 3: Multiple imputation

2.4 Step 4: Pooling

2.5 A note on bootstrapping for multiple imputation in finite populations

Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources Section 2. Methodology

2.1 Step 1: Creating bootstrap samples

2.2 Step 2: Estimating the latent class model

2.3 Step 3: Multiple imputation

2.4 Step 4: Pooling

2.5 A note on bootstrapping for multiple imputation in finite populations

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Section 2. Methodology