Multiple imputation of missing values in household data with structural zeros
Section 2. Review of the NDPMPM model
Hu
et al. (2018) present the NDPMPM model including motivation for how it can
preserve associations across variables and account for structural zeros. Here,
we summarize the model without detailed motivations, referring the reader to Hu
et al. (2018) for more information. We begin with notation needed to
understand the model and the Gibbs sampler, assuming complete data. The
presentation closely follows that in Hu et al. (2018).
2.1 Notation and model specification
Suppose
the data contain
households. Each household
contains
individuals, so that there are
individuals in the data. Let
be the value of categorical variable
for household
which is assumed to be identical for all
individuals in household
where
Let
be the value of categorical variable
for person
in household
where
and
Let
include all household-level and
individual-level variables for the
individuals in household
Let
be the set of all household sizes that are
possible in the population. For all
let
represent the set of all combinations of
individual-level and household-level variables for households of size
, including impossible
combinations; that is,
Let
represent the set of impossible combinations,
i.e., those that are structural zeros, for households of size
These include combinations of variables within
any individual, e.g., a three year old person cannot be a spouse, or across
individuals in the same household, e.g., a person cannot be older than his
biological parents. Let
and
Although
the NDPMPM model we use restricts the support of
to
it is helpful for understanding the model to
begin with no restrictions on the support of
Each household
belongs to one of
classes representing latent household types.
For
let
indicate the household class for household
Let
be the probability that household
belongs to class
Within any class, all household-level
variables follow independent, multinomial distributions. For any
and any
let
for any class
where
is the same value for every household in class
Let
and
Within
each household class, each individual belongs to one of
individual-level latent classes. For
and
let
represent the individual-level latent class of
individual
in household
Let
be the probability that individual
in household
belongs to individual-level class
nested within household-level class
Within any individual-level class, all
individual-level variables follow independent, multinomial distributions. For
any
and any
let
for the class pair
where
is the same value for every individual in the
class pair
Let
and
For
purposes of the Gibbs sampler in Section 2.2, it is useful to distinguish
values of
that satisfy all the structural zero
constraints from those that do not. Let the superscript
indicate that a random variable has support
only on
For example,
represents data for a household with values
restricted only on
i.e., not an impossible household, whereas
represents data for a household with any
values in
Let
be the observed data comprising
households, that is, a realization of
The kernel of the NDPMPM,
is
where
includes all the parameters, and
equals one when the condition inside the
is true and equals zero otherwise.
For
all
let
be the number of households of size
in
and
As stated in Hu et al. (2018), the
normalizing constant in the likelihood in (2.1) is
Therefore, the posterior distribution is
where
emphasizes that the density is for the NDPMPM
with support restricted to
The
likelihood in (2.1) can be written as a generative model of the form
where the Discrete distribution refers to the multinomial distribution
with sample size equal to one. We restrict the support of each
to ensure the model assigns zero probability
to all combinations in
as desired. The model in (2.3) to (2.6) can be used without
restricting the support to
This ignores all structural zeros. While not
appropriate for the joint distribution of household data, this model turns out
to useful for the Gibbs sampler. We refer to the generative model in (2.3) to (2.6)
with support on all of
as the untruncated NDPMPM. For contrast, we
call the model in (2.1) the truncated NDPMPM.
For
prior distributions, we follow the recommendations of Hu et al. (2018). We
use independent uniform Dirichlet distributions as priors for
and
and the truncated stick-breaking
representation of the Dirichlet process as priors for
and
(Sethuraman, 1994; Dunson and Xing, 2009; Si and Reiter,
2013; Manrique-Vallier and Reiter, 2014),
We
set the parameters for the Dirichlet distributions in (2.7) and (2.8) to
(a
dimensional vector of ones)
and the parameters for the Gamma distributions in (2.11) and (2.14) to 0.25 to
represent vague prior specifications. We also set
for computational expedience. For further
discussion on prior specifications, see Hu et al. (2018).
Conceptually,
the latent household-level classes can be interpreted as clusters of households
with similar compositions, e.g., households with children or households in
which no one is related. Similarly, the latent individual-level classes can be
interpreted as clusters of individuals with similar characteristics, e.g.,
older male spouses or young female children. However, for purposes of
imputation, we do not care much about interpreting the classes, as they serve
mainly to induce dependence across variables and individuals in the joint
distribution.
It
is important to select
and
to be large enough to ensure accurate
estimation of the joint distribution. However, we also do not want to make
and
so large as to produce many empty classes in
the model estimation. Allowing many empty classes increases computational
running time without any corresponding increase in estimation accuracy. This
can be especially problematic in the Gibbs sampler for the truncated NDPMPM, as
these empty classes can introduce mass in regions of the space where impossible
combinations are likely to be generated. This slows down the convergence of the
Gibbs sampler.
We
therefore recommend following the strategy in Hu et al. (2018) when
setting
Analysts can start with moderate values for
both, say between 10 and 15, in initial tuning runs. After convergence,
analysts examine posterior samples of the latent classes to check how many
individual-level and household-level latent classes are occupied. Such
posterior predictive checks can provide evidence for the case that larger
values for
and
are needed. If the numbers of occupied
household-level classes hits
we suggest increasing
If the number of occupied individual-level
classes hits
we suggest increasing
first but then increasing
possibly in addition to
if increasing
alone does not suffice. When posterior
predictive checks do not provide evidence that larger values of
and
are needed, analysts need not increase the
number of classes, as doing so is not expected to improve the accuracy of the
estimation. We note that similar logic is used in other mixture model contexts (Walker,
2007; Si and Reiter, 2013; Manrique-Vallier and Reiter, 2014; Murray and
Reiter, 2016).
2.2 MCMC sampler for the NDPMPM
Hu
et al. (2018) use a data augmentation strategy (Manrique-Vallier and
Reiter, 2014) to estimate the posterior distribution in (2.2). They assume that
the observed data
which includes only feasible households, is a
subset from a hypothetical sample
of
households directly generated from the
untruncated NDPMPM. That is,
is generated on the support
where all combinations are possible and
structural zeros rules are not enforced, but we only observe the sample of
households
that satisfy the structural zero rules and do
not observe the sample of
households
that fail the rules.
We
use the strategy of Hu et al. (2018) and augment the data as follows. For
each
we simulate
from the untruncated NDPMPM, stopping when the
number of simulated feasible households in
directly matches
for all
We replace the simulated feasible households
in
with
thus, assuming that
already contains
and we only need to generate the part
that fall in
Given a draw of
we draw
from posterior distribution defined by the
untruncated NDPMPM, treating
as the observed data. This posterior
distribution can be estimated using a blocked Gibbs sampler (Ishwaran and
James, 2001; Si and Reiter, 2013).
We
now present the full MCMC sampler for fitting the truncated NDPMPM. Let
and
be vectors of the latent class membership
indicators for the households in
and
be the number of households of size
in
with
In each full conditional, let “
” represent conditioning on
all other variables and parameters in the model. At each MCMC iteration, we do
the following steps.
- S1. Set
For each
repeat the following:
- Set
and
- Sample
where
and
is the index for the household-level variable
“household size”.
- For
sample
- Set
where
corresponds to the variable for household
size. Sample the remaining household-level and individual-level values using
the likelihoods in (2.3) and (2.4). Set the household’s simulated value to
- If
let
and
Otherwise set
- If
return to step (b). Otherwise, set
- S2. For observations in
- Sample
for
where
for
Set
- Sample
for
and
where
for
Set
- S3. Set
Sample
where
for
- S4. Set
for
Sample
where
for
and
where
for
and
where
for
and
This Gibbs sampler is
implemented in the R software package “NestedCategBayesImpute” (Wang, Akande,
Hu, Reiter and Barrientos, 2016). The software can be used to generate
synthetic versions of the original data, but it requires all data to be
complete.