Multiple imputation of missing values in household data with structural zeros
Section 4. Strategies for speeding up the MCMC sampler
The
rejection sampling step in the Gibbs sampler in Section 2.2 can be
inefficient when
is large (Manrique-Vallier and Reiter, 2014;
Hu et al., 2018), as the sampler tends to generate many impossible
households before getting enough feasible ones. In addition, it takes computing
time to check whether or not each sampled household satisfies all the
structural zero rules. These computational costs are compounded when the
sampler also incorporates missing values. In this section, we present two
strategies that can reduce the number of impossible households that the algorithm
generates, thereby speeding up the sampler. The Appendix includes simulation
studies showing that both strategies can speed up the MCMC significantly.
4.1 Moving the household head to the household level
Many datasets include a
variable recording the relationship of each individual to the household head.
There can be only one household head in any household. This restriction can
account for a large proportion of the combinations in
As a
simple working example, consider a dataset that contains
households of size two, resulting in a total
of
individuals. Suppose the data contain no
household-level variables and two individual-level variables, age and
relationship to household head. Also, suppose age has 100 levels while
relationship to household head has 13 levels, which include household head,
spouse of the household head, etc. Then,
contains
combinations. Suppose the rule, “each
household must contain exactly one head”, is the only structural zero rule
defined on the dataset. Then,
contains
impossible combinations, approximately 86% the
size of
If, for example, the model assigns uniform
probability to all combinations in
we would expect to sample about
impossible households at every iteration to augment
the
feasible households.
Instead, we treat the
variables for the household head as a household-level characteristic. This
eliminates structural zero rules defined on the household head alone. Using the
working example, moving the household head to the household level results in one
new household-level variable, age of household head, which has 100 levels. The
relationship to household head variable can be ignored for household heads. For
others in the household, the relationship to household head variable now has 12
levels, with the level corresponding to “household head” removed. Thus,
contains
combinations, and
contains zero impossible combinations. We
wouldn’t even need to sample impossible households in the Gibbs sampler in
Section 2.2.
In general, this strategy can
reduce the size of
significantly, albeit usually not to zero as
in the simple example here since
usually contains combinations resulting from
other types of structural zero rules. This strategy is not a replacement for
the rejection sampler in Section 2.2; rather, it is a data reformatting
technique that can be combined with the sampler.
4.2 Setting an upper bound on the number of impossible
households to sample
To reduce computation time, we
can put an upper bound on the number of sampled cases in
One way to achieve this is to replace
in step S1(f) of Section 2.2 with
for some
such that
is a positive integer, so that we sample only
approximately
impossible households for each
However, doing so underestimates the actual
probability mass assigned to
by the model. We can illustrate this using the
simple example of Section 4.1. Suppose the model assigns uniform
probability to all combinations in
as before. We set
so that we sample approximately
impossible households in every iteration of
the MCMC sampler. The probability of generating one impossible household is
a decrease from the actual value of 0.86.
Therefore, we would underestimate the true contribution of
to the likelihood.
To use the cap-and-weight
approach, we need to apply a correction that re-weights the contribution of
to the full joint likelihood. We do so using
ideas akin to those used by Chambers and Skinner (2003) and Savitsky and Toth
(2016), approximating the likelihood of the full unobserved data with a
“pseudo” likelihood using weights (the
The impossible households only contribute to
the full joint likelihood through the discrete distributions in (2.3) to (2.6).
The sufficient statistics for estimating the parameters of the discrete
distributions in (2.3) to (2.6) are the observed counts for the corresponding
variables in the set
within each latent class for the
household-level variables and within each latent class pair for the
individual-level variables. Thus, for each
we can re-weight the contribution of
impossible households by multiplying the observed counts for households of size
in
by
for the corresponding variable and latent
classes. This raises the likelihood contribution of impossible households of
size
to the power of
Clearly,
need not be a positive integer. We require
that only to make its multiplication with the observed counts free of decimals.
We modify the Gibbs sampler to incorporate the cap-and-weight approach by
replacing steps S1, S3, S4, S5 and S6; see the Appendix for the modified steps.
Setting each
corresponds to the original rejection sampler,
so that the two approaches should provide very similar results when
near 1. Based on our experience, results of
the cap-and-weight approach become significantly less accurate than the regular
rejection sampler when
The time gained using this speedup approach in
comparison to the regular sampler depends on the features of the data and the
specified values for the weights
To select the
we suggest trying out different values
starting with values close to one
in initial runs of the MCMC sampler on a small random sample
of the data. Analysts should examine the convergence and mixing behavior of the
chains in comparison to the chain with all the
set to one, and select values that offer
reasonable speedup while preserving convergence and mixing. This can be done
quickly by comparing trace plots of a random set of parameters from the model
that are not subject to label switching, such as
and
or by examining marginal, bivariate and
trivariate probabilities estimated from synthetic data generated from the MCMC.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2019
All rights reserved. Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa