# Multiple imputation of missing values in household data with structural zeros

Section 3. Handling missing data using the NDPMPM

We modify the Gibbs sampler for the truncated NDPMPM to incorporate missing data. For $i\mathrm{=1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}n,$ let ${a}_{i}\text{\hspace{0.17em}}\mathrm{=}\left({a}_{i\left(p+1\right)}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{a}_{i\left(p+q\right)}\right)$ be a vector with ${a}_{ik}\mathrm{=1}$ when household-level variable $k\in \left\{p+\mathrm{1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}p+q\right\}$ in ${X}_{i}^{1}$ is missing, and ${a}_{ik}\mathrm{=0}$ otherwise. For $i\mathrm{=1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}n$ and $j\mathrm{=1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{n}_{i},$ let ${b}_{ij}\mathrm{=}\left({b}_{ij1}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{b}_{ijp}\right)$ be a vector with ${b}_{ijk}\mathrm{=1}$ when individual-level variable $k\in \left\{\mathrm{1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}p\right\}$ for individual $j\in \left\{\mathrm{1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{n}_{i}\right\}$ in ${X}_{i}^{1}$ is missing, and ${b}_{ijk}\mathrm{=0}$ otherwise. For each household $i,$ let ${X}_{i}^{1}\text{\hspace{0.17em}}\mathrm{=}\left({X}_{i}^{\text{obs}}\text{}\mathrm{,}\text{\hspace{0.17em}}{X}_{i}^{\text{mis}}\right)$ , where ${X}_{i}^{\text{obs}}$ comprise all data values corresponding to ${a}_{ik}\mathrm{=0}$ and ${b}_{ijk}\mathrm{=0},$ and ${X}_{i}^{\text{mis}}$ comprises all data values corresponding to ${a}_{ik}\mathrm{=1}$ and ${b}_{ijk}\mathrm{=1.}$ We assume that the data are missing at random (Rubin, 1976).

To incorporate missing values in the Gibbs sampler, we need to sample from the full conditional of each variable in ${X}_{i}^{\text{mis}}\text{},$ conditioned on the variables for which ${a}_{ik}\mathrm{=0}$ and ${b}_{ijk}\mathrm{=0},$ at every iteration. Thus, we add the ninth step,

- S9. For $i\mathrm{=1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}n,$ sample ${X}_{i}^{\text{mis}}$ from its full conditional distribution

$$\mathrm{Pr}\left({X}_{i}^{\text{mis}}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}-\right)\propto \text{\hspace{0.17em}}1\left\{{X}_{i}^{1}\notin {\mathcal{S}}_{h}\right\}\left({\pi}_{{G}_{i}^{1}}{\displaystyle \prod _{k\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{a}_{ik}\mathrm{=1}}^{p+q}\text{\hspace{0.17em}}{\lambda}_{{G}_{i}^{1}{X}_{ik}^{1}}^{\left(k\right)}}{\displaystyle \prod _{j\mathrm{=1}}^{{n}_{i}}}\text{\hspace{0.17em}}{\omega}_{{G}_{i}^{1}{M}_{ij}^{1}}{\displaystyle \prod _{k\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{b}_{ijk}\mathrm{=1}}^{p}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\varphi}_{{G}_{i}^{1}{M}_{ij}^{1}{X}_{ijk}^{1}}^{\left(k\right)}\right).$$

Sampling from this conditional distribution is nontrivial because of the dependence among variables induced by the structural zero rules in each ${\mathcal{S}}_{h}.$ Because of the dependence, we cannot simply sample each variable independently using the likelihoods in (2.3) and (2.4). If we could generate the set of all possible completions for all households with missing entries, conditional on the observed values, then calculating the probability of each one and sampling from the set would be straightforward. Unfortunately, this approach is not practical when the size of each ${\mathcal{S}}_{h}$ is large. Even when the size of each ${\mathcal{S}}_{h}$ is modest, each household could have different sets of completions, necessitating significant computing, storage, and memory requirements.

However, the full conditional in S9 takes a similar form as the kernel of the truncated NDPMPM in (2.1), so that we can generate the desired samples through a second rejection sampling scheme. Essentially, we sample from an untruncated version of the full conditional ${P}_{{X}_{i}^{\text{mis}}}^{*}={\pi}_{{G}_{i}^{1}}{\displaystyle {\prod}_{k\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{a}_{ik}\mathrm{=1}}^{p+q}}\text{\hspace{0.17em}}{\lambda}_{{G}_{i}^{1}{X}_{ik}^{1}}^{\left(k\right)}\left({\displaystyle {\prod}_{j\mathrm{=1}}^{{n}_{i}}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\omega}_{{G}_{i}^{1}{M}_{ij}^{1}}{\displaystyle {\prod}_{k\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{b}_{ijk}\mathrm{=1}}^{p}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\varphi}_{{G}_{i}^{1}{M}_{ij}^{1}{X}_{ijk}^{1}}^{\left(k\right)}\right),$ until we obtain a valid sample that satisfies ${X}_{i}^{1}\notin {\mathcal{S}}_{h};$ see the Appendix for a proof that this rejection sampling scheme results in a valid Gibbs sampler. Notice that since ${P}_{{X}_{i}^{\text{mis}}}^{*}$ itself is untruncated, we can generate samples from it by sampling each variable independently using (2.3) and (2.4). We therefore replace step S9 with S9'.

- S9'. For $i\mathrm{=1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}n,$ sample ${X}_{i}^{\text{mis}}$ as follows.

- For each missing household-level variable, that is, each variable where $k\in \left\{p+\mathrm{1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}p+q\right\}$ with ${a}_{ik}\mathrm{=1},$ sample ${X}_{ik}^{1}$ using (2.3).
- For each missing individual-level variable, that is, each variable where $j\mathrm{=1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{n}_{i}$ and $k\in \left\{\mathrm{1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}p\right\}$ with ${b}_{ijk}\mathrm{=1},$ sample ${X}_{ijk}^{1}$ using (2.4).
- Set the sampled household-level and individual-level values to ${X}_{i}^{\text{mis}*}\text{}.$
- Combine ${X}_{i}^{\text{mis}*}$ with the observed ${X}_{i}^{\text{obs}}\text{},$ that is, set ${X}_{i}^{1*}\mathrm{=}\left({X}_{i}^{\text{obs}}\text{}\mathrm{,}\text{\hspace{0.17em}}{X}_{i}^{\text{mis}*}\right)$ . If ${X}_{i}^{1*}\notin {\mathcal{S}}_{h},$ set ${X}_{i}^{\text{mis}}\mathrm{=}{X}_{i}^{\text{mis}*}\text{},$ otherwise, return to step (9'a).

To initialize each ${X}_{i}^{\text{mis}}\text{},$ we suggest sampling from the empirical marginal distribution of each variable $k$ using the available cases for each variable, and requiring that the household satisfies ${X}_{i}^{1}\notin {\mathcal{S}}_{h}.$

## Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

- Date modified: