Using Multiple Imputation of Latent Classes to construct population census tables with data from multiple sources
Section 2. Methodology
When applying the MILC method, the starting point is a
unit-linked combined dataset, which can consists of combinations of
administrative population registries and survey samples. In order to account
for uncertainty regarding the parameters of the LC model estimated at a later
step in MILC, a non-parametric bootstrap procedure is applied on this dataset
first (step 1). This involves creating bootstrap samples by drawing observations from
the observed dataset with replacement. Subsequently, for each bootstrap sample,
the LC model of interest is estimated (step 2) using Latent GOLD software (Vermunt
and Magidson, 2013a). Here, model parameters are estimated by Maximum
Likelihood using a combination of the Expectation-Maximization and
Newton-Raphson algorithms. Note that here, by explicitly stating which cells
should be restricted, constrained estimation is used. Next, imputations are created using the sets of parameter values obtained from the latent class models (step 3). If
imputations would be created based on the maximum-likelihood estimates obtained
directly using the original observed data, sampling uncertainty regarding the
estimated parameters of the latent class model would be ignored.
In the following subsections, we explain each of the steps
of MILC in more detail and present the extension for the estimation of multiple
latent variables for a finite population from register and sample survey data.
2.1 Step 1: Creating bootstrap samples
We propose to use the “classical” bootstrap procedure here,
which consists of repeatedly drawing samples with replacement from the original
dataset, of the same size as the original dataset. A motivation for using this
classical with-replacement bootstrap here, as opposed to an adapted bootstrap
procedure for a finite population, is provided in Section 2.5 below.
The bootstrap should be applied to the dataset that is used
to estimate the LC models. When register data and survey data are combined, the
indicator variables from the survey will typically be missing for a large part
(e.g., 90% or more) of the population. The LC models could then be estimated by
two different approaches:
- using only the subset of persons observed in both the survey and the register (complete cases);
- using all available data, including cases with missing indicators.
Under the second approach, full information maximum
likelihood can be used to handle missing values when estimating the LC models.
This has the advantage of using all available information. Since this amounts
to estimating the LC model on datasets with the size of the target
population, a practical drawback of this approach is that it may be
computationally demanding in terms of time and memory. Therefore, the first
approach may be more attractive, in particular when the associations among the
covariates and target variables are relatively weak. In the latter approach,
the cases with missing survey data will contain relatively little information
about the parameters of the LC model. Note that under both approaches, the
estimated LC models are used to impute predictions of the latent classes
throughout the population. Depending on which approach is chosen to estimate
the LC models, bootstrapping is applied either to the subset of complete cases
or to the target population. In the simulation study in this paper, the
complete-case approach will be used.
2.2 Step 2: Estimating the latent class model
The second step performed is the estimation of the LC model.
It is explained below how this is done for multiple latent variables. As
described in the previous section, the LC model is typically estimated times using the bootstrapped datasets. In the situation under
evaluation in this paper, the LC model is estimated times on subsets of complete observations coming from
the bootstrap samples. An extensive discussion of
the model and the assumptions made when using the model to correct for
measurement error can be found in Boeschoten et al. (2017). Multiple
latent variables can be estimated simultaneously in one model, which yields the
following model structure for the joint probability of the response variables
given covariate values, denoted by The number of latent variables is denoted as and is the number of classes of latent variable (scalar), where Furthermore, are the observed target variables, i.e. the
indicator variables, is the number of indicator variables for and are the (also observed) covariate variables:
Here, local independence is assumed as well as independence
of covariates.
Constrained parameter estimation is used when certain cells
within are restricted. This can be used to specify
that certain combinations of scores between covariates and latent variables are
logically impossible, or when a “quasi-latent” variable is used to create
imputations for missing values in a variable (Vermunt and Magidson, 2013b).
2.3 Step 3: Multiple imputation
To be able to create multiple imputations, joint posterior
membership probabilities are calculated for every person in the original
dataset. They represent the probability that a unit is part of a combination of
latent classes from the different latent variables, given its combination of
scores on the indicators and covariates used in the LC model. These probabilities
can be used to create multiple imputations of the latent variables which
contain their “true scores”.
The joint posterior membership probabilities can be
calculated by applying Bayes’ rule to the conditional response probabilities
obtained from the LC models:
where
and is defined in equation 2.1. For one
profile (so one set of scores on all indicator and covariate variables), the
joint posterior membership probabilities sum up to one.
To be able to include parameter uncertainty in our variance
estimates, we perform the model estimation on bootstrap samples of the dataset, resulting in
different LC models. We generate imputations
in the original dataset accounting for the parameter uncertainty by using the
resulting sets of bootstrap parameter estimates. More
specifically, with each of these parameter sets we compute the posterior class
membership probabilities for the original sample, and use these to generate the
imputations. In other words, the imputations are based of different sets of posterior probabilities.
2.4 Step 4: Pooling
The next step is to obtain estimates of interest for every
imputation, and to pool them using Rubin’s Rules (Rubin, 1987, page 76).
For this research, the main interest is producing a frequency table. Therefore,
the frequency table of interest is obtained for the imputations and they are pooled, which means
taking the average over the imputations for every cell in the frequency table:
where refers to a specific cell in the frequency
table.
Next, an estimate of the uncertainty around these
frequencies is of interest. In general, the variance of the pooled estimate can be estimated by Rubin’s total variance
formula for multiple imputation (Rubin, 1987, page 76):
Here, can be estimated as
The within variance reflects the average sampling variance of when the imputed values are treated as
observed. In our application, as the population is finite and imputations are
generated for the complete population, this within variance component is zero
and can be mitigated (Vink and van Buuren, 2014). Note that this is a
property of multiple imputation and is due to the fact that the complete
population is imputed. This should not be confused with the decision to only
use a sample for LC model estimation. Hence, formula (2.5) is reduced in
this case to:
2.5 A note on bootstrapping for multiple imputation
in finite populations
The aim of a census is to estimate certain target parameters
of a finite population (e.g., all persons currently living in the Netherlands).
Hence, a natural idea might be to apply a finite-population bootstrap procedure
in this context; see Mashreghi, Haziza and Léger (2016) for an overview of
bootstrap methods for finite populations. However, when determining the
appropriate bootstrap approach, it should be noted that the bootstrap in MILC
is specifically implemented to account for the between imputation variance component
of formula (2.5) in Section 2.4. In general, variability in the
target parameters due to the fact that a sample was drawn from a finite population
is incorporated in the within variance component of formula (2.5). As we
use mass imputation here, the within variance component in fact reduces to
zero; cf. formula (2.7). More generally, this component would be estimated
separately from the bootstrap method at hand; see Boeschoten et al. (2017)
for an example.
Furthermore, the reason for incorporating the bootstrap in
the MILC approach is to account for uncertainty in the estimated parameters of
the latent class model. Note that these parameters are not associated with a
finite population, but with a model. Even if we had observed the entire finite
population, there would still be uncertainty about the true parameter values of
the latent class model. This uncertainty can be considered as drawing from an
infinite distribution. Therefore, we select the classical with-replacement
bootstrap. We argue that bootstrap methods for finite populations should not be
used in this context. For large samples, such methods would result in a
substantial underestimation of the variance when combined with the usual
approach to multiple imputation. We also checked this empirically in the
simulation study to be discussed in Section 3. As an example, when a
pseudo-population bootstrap method for finite populations was used, the
resulting se/sd ratios in Table 4.7 for the condition MAR, were 0.7217, 0.7887, 0.7536 and 0.8607, respectively,
all pointing to a non-negligible underestimation of the true variance.
In the simulation study in this paper, we will restrict
attention to surveys based on simple random sampling and stratified simple
random sampling. For more complex survey designs, e.g. involving cluster
sampling or sampling with unequal probabilities, it is unclear whether the
proposed bootstrap approach is always appropriate. It is possible that in some
cases such complex design features could indirectly affect the uncertainty of
estimated parameters of the latent class model and therefore become relevant
for variance estimation. We will return to this point in the discussion
section.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa