# Multiple imputation of missing values in household data with structural zeros

Section 6. Discussion

The empirical study suggests that the NDPMPM can provide high quality imputations for categorical data nested within households. To our knowledge, this is the first parametric imputation engine for nested multivariate categorical data. The study also illustrates that, with modest sample sizes, agencies should not expect the NDPMPM to preserve all features of the joint distribution. Of course, this is the case with any imputation engine. For the NDPMPM, agencies may be able to improve accuracy for targeted quantities by recoding the data used to fit the model. For example, one can create a new household-level variable that equals one when everyone has the same race and equals zero otherwise, and replace the individual race variable with a new variable that has levels “1 = race is the same as race of household head”, “2 = race is white and differs from race of household head”, “3 = race is black and differs from race of household head”, and so on. The NDPMPM would be estimated with the household-level same race variable and the new individual-level race variable. This would encourage the NDPMPM to estimate the percentages with the same race very accurately, as it would be just another household-level variable like home ownership. It also would add structural zeros involving race to the computation. Evaluating the trade offs in accuracy and computational costs of such recodings is a topic for future research.

The NDPMPM can be computationally expensive, even with the speed-ups presented in this article. The expensive parts of the algorithm are the rejection sampling steps. Fortunately, these can be done easily by parallel processing. For example, we can require each processor to generate a fraction of the impossible cases in Section 2.2. We also can spread the rejection steps for the imputations over many processors. These steps should cut run time by a factor roughly equal to the number of processors available.

The empirical study used households up to size four. We have run the model on data with households up to size seven in reasonable time (a few hours on a standard laptop). Accuracy results are similar qualitatively. As the household sizes get large, the model can generate hundreds or even thousands times as many impossible households as there are feasible ones, slowing the algorithm. In such cases, the cap-and-weight approach is essential for practical applications.

## Acknowledgements

This research was supported by grants from the National Science Foundation (NSF SES 1131897) and the Alfred P. Sloan Foundation (G-2-15-20166003).

## Appendix

This is an Appendix to the paper. It contains proof that the rejection sampling step S9' in Section 3 generates samples from the correct posterior distribution. It also contains the modified Gibbs sampler for the cap-and-weight approach and a list of the structural zero rules used in fitting the NDPMPM model. Finally, we include empirical results for the speedup approaches mentioned in the paper, using synthetic data, and additional results for handling missing data using the NDPMPM under a missing completely at random scenario.

### A.1 Proof that the rejection sampling step S9' in Section 3 generates samples from the correct posterior distribution

The ${X}_{ik}^{1}$ and ${X}_{ijk}^{1}$ values generated using the rejection sampler in Step S9' are generated from the full conditionals, resulting in a valid Gibbs sampler. The proof follows from the properties of rejection sampling (or simple accept reject). The target distribution is the full conditional for ${X}_{i}^{\text{mis}}\text{}.$ It can be re-expressed as

$$p\left({X}_{i}^{\text{mis}}\right)\text{\hspace{0.17em}}\mathrm{=\text{\hspace{0.17em}}}\frac{1\left\{{X}_{i}^{1}\notin {\mathcal{S}}_{h}\right\}}{\mathrm{Pr}\left({X}_{i}\notin {\mathcal{S}}_{h}|\text{\hspace{0.17em}}\theta \right)}g\left({X}_{i}^{\text{mis}}\right)$$

where

$$g\left({X}_{i}^{\text{mis}}\right)={\pi}_{{G}_{i}^{1}}{\displaystyle \prod _{k\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{a}_{ik}\mathrm{=1}}^{p+q}}\text{\hspace{0.17em}}{\lambda}_{{G}_{i}^{1}{X}_{ik}^{1}}^{\left(k\right)}\left({\displaystyle \prod _{j\mathrm{=1}}^{{n}_{i}}}\text{\hspace{0.17em}}{\omega}_{{G}_{i}^{1}{M}_{ij}^{1}}{\displaystyle \prod _{k\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{b}_{ijk}\mathrm{=1}}^{p}}\text{\hspace{0.17em}}{\varphi}_{{G}_{i}^{1}{M}_{ij}^{1}{X}_{ijk}^{1}}^{\left(k\right)}\right)\mathrm{.}$$

Our rejection scheme uses $g\left({X}_{i}^{\text{mis}}\right)$ as a proposal for $p\left({X}_{i}^{\text{mis}}\right).$ To show that the draws are indeed from $p\left({X}_{i}^{\text{mis}}\right),$ we need to verify that $w\left({X}_{i}^{\text{mis}}\right)\mathrm{=}p\left({X}_{i}^{\text{mis}}\right)/g\left({X}_{i}^{\text{mis}}\right)\mathrm{<}M,$ where $\mathrm{1<}M\mathrm{<}\infty ,$ and that we are accepting each sample with probability $w\left({X}_{i}^{\text{mis}}\right)/M\text{}.$ In our case,

- $w\left({X}_{i}^{\text{mis}}\right)\mathrm{=}p\left({X}_{i}^{\text{mis}}\right)/g\left({X}_{i}^{\text{mis}}\right)\mathrm{=}1\left\{{X}_{i}^{1}\notin {\mathcal{S}}_{h}\right\}/\mathrm{Pr}\left({X}_{i}\notin {\mathcal{S}}_{h}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}\theta \right)\le 1/\mathrm{Pr}\left({X}_{i}\notin {\mathcal{S}}_{h}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}\theta \right),$ and $\mathrm{0<}\mathrm{Pr}\left({X}_{i}\notin {\mathcal{S}}_{h}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}\theta \right)\mathrm{<1}\text{\hspace{0.17em}}\Rightarrow \text{\hspace{0.17em}}\mathrm{1<}1/\mathrm{Pr}\left({X}_{i}\notin {\mathcal{S}}_{h}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}\theta \right)\mathrm{<}\infty $ necessarily.
- By sampling until we obtain a valid sample that satisfies ${X}_{i}^{1}\notin {\mathcal{S}}_{h},$ we are indeed sampling with probability $w\left({X}_{i}^{\text{mis}}\right)/M\mathrm{=}1\left\{{X}_{i}^{1}\notin {\mathcal{S}}_{h}\right\}.$

### A.2 Modified Gibbs sampler for the cap-and-weight approach

The modified Gibbs sampler for the cap-and-weight approach replaces steps S1, S3, S4, S5 and S6 of the Gibbs sampler in the main text as follows.

- S1*. For each $h\in \mathcal{H},$ repeat steps S1(a) to S1(e) as before but modify step S1(f) to: if ${t}_{1}\mathrm{<}\lceil {n}_{1h}\times {\psi}_{h}\rceil ,$ return to step (b). Otherwise, set ${n}_{0h}\mathrm{=}{t}_{0}.$

- S3*. Set ${u}_{F}\mathrm{=1.}$ Sample

$${u}_{g}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}-\sim \text{Beta}\left(1+{U}_{g}\text{}\mathrm{,}\text{\hspace{0.17em}}\alpha +{\displaystyle \sum _{f\mathrm{=}g+1}^{F}}\text{\hspace{0.17em}}{U}_{f}\right)\mathrm{,}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\pi}_{g}\mathrm{=}{u}_{g}{\displaystyle \prod _{f\mathrm{<}g}}\left(1-{u}_{f}\right)$$

$${U}_{g}\mathrm{=}{\displaystyle \sum _{i\mathrm{=1}}^{n}}\text{\hspace{0.17em}}1\left({G}_{i}^{1}\mathrm{=}g\right)+{\displaystyle \sum _{h\in \mathcal{H}}}\text{\hspace{0.17em}}\frac{1}{{\psi}_{h}}{\displaystyle \sum _{i\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{n}_{i}^{0}\mathrm{=}h}}\text{\hspace{0.17em}}1\left({G}_{i}^{0}\mathrm{=}g\right)$$

- S4*. Set ${v}_{gM}\mathrm{=1}$ for $g\mathrm{=1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}F.$ Sample

$${v}_{gm}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}-\sim \text{Beta}\left(1+{V}_{gm}\mathrm{,}\text{\hspace{0.17em}}\beta +{\displaystyle \sum _{s\mathrm{=}m+1}^{S}}\text{\hspace{0.17em}}{V}_{gs}\right)\mathrm{,}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\omega}_{gm}\mathrm{=}{v}_{gm}{\displaystyle \prod _{s\mathrm{<}m}}\left(1-{v}_{gs}\right)$$

$${V}_{gm}\mathrm{=}{\displaystyle \sum _{i\mathrm{=1}}^{n}}\text{\hspace{0.17em}}1\left({M}_{ij}^{1}\mathrm{=}m\mathrm{,}\text{\hspace{0.17em}}{G}_{i}^{1}\mathrm{=}g\right)+{\displaystyle \sum _{h\in \mathcal{H}}}\frac{1}{{\psi}_{h}}{\displaystyle \sum _{i\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{n}_{i}^{0}\mathrm{=}h}}\text{\hspace{0.17em}}1\left({M}_{ij}^{0}\mathrm{=}m\mathrm{,}\text{\hspace{0.17em}}{G}_{i}^{0}\mathrm{=}g\right)$$

- S5*. Sample

$${\lambda}_{g}^{\left(k\right)}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}-\sim \text{Dirichlet}\left(1+{\eta}_{g1}^{\left(k\right)}\text{}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}1+{\eta}_{g{d}_{k}}^{\left(k\right)}\right)$$

$${\eta}_{gc}^{\left(k\right)}\mathrm{=}{\displaystyle \sum _{i\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{G}_{i}^{1}\mathrm{=}g}^{n}}\text{\hspace{0.17em}}1\left({X}_{ik}^{1}=c\right)+{\displaystyle \sum _{h\in \mathcal{H}}}\frac{1}{{\psi}_{h}}{\displaystyle \sum _{i\text{\hspace{0.17em}}|\text{\hspace{0.17em}}\begin{array}{c}{n}_{i}^{0}\mathrm{=}h\text{\hspace{0.05em}}\mathrm{,}\text{\hspace{0.17em}}{G}_{i}^{0}\mathrm{=}g\text{\hspace{0.17em}}\end{array}}}\text{\hspace{0.17em}}1\left({X}_{ik}^{0}\mathrm{=}c\right)$$

- S6*. Sample

$${\varphi}_{gm}^{\left(k\right)}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}-\sim \text{Dirichlet}\left(1+{\nu}_{gm1}^{\left(k\right)}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}1+{\nu}_{gm{d}_{k}}^{\left(k\right)}\right)$$

$${\nu}_{gmc}^{\left(k\right)}\mathrm{=}{\displaystyle \sum _{i\text{\hspace{0.17em}}|\text{\hspace{0.17em}}\begin{array}{c}{G}_{i}^{1}\mathrm{=}g\text{\hspace{0.05em}}\mathrm{,}\text{\hspace{0.17em}}{M}_{ij}^{1}\mathrm{=}m\end{array}}^{n}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}1\left({X}_{ijk}^{1}\mathrm{=}c\right)+{\displaystyle \sum _{h\in \mathcal{H}}}\text{\hspace{0.17em}}\frac{1}{{\psi}_{h}}\text{\hspace{0.17em}}{\displaystyle \sum _{i\text{\hspace{0.17em}}|\text{\hspace{0.17em}}\begin{array}{c}{n}_{i}^{0}\mathrm{=}h\mathrm{,}\text{\hspace{0.17em}}{G}_{i}^{0}\mathrm{=}g\text{\hspace{0.05em}}\mathrm{,}\text{\hspace{0.17em}}{M}_{ij}^{0}\mathrm{=}m\end{array}}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}1\left({X}_{ijk}^{0}\mathrm{=}c\right)$$

### A.3 List of structural zeros

We fit the NDPMPM model using structural zeros which involve ages and relationships of individuals in the same house. The full list of the rules used is presented in Table A.1. These rules were derived from the 2012 ACS by identifying combinations involving the relationship variable that do not appear in the constructed population. This list should not be interpreted as a “true” list of impossible combinations in census data.

Description | This is an empty column | This is an empty column |
---|---|---|

Rules common to generating both the synthetic and imputed datasets |
1 | Each household must contain exactly one head and he/she must be at least 16 years old. |

2 | Each household cannot contain more than one spouse and he/she must be at least 16 years old. | |

3 | Married couples are of opposite sex, and age difference between individuals in the couples cannot exceed 49. | |

4 | The youngest parent must be older than the household head by at least 4. | |

5 | The youngest parent-in-law must be older than the household head by at least 4. | |

6 | The age difference between the household head and siblings cannot exceed 37. | |

7 | The household head must be at least 31 years old to be a grandparent and his/her spouse must be at least 17. Also, He/she must be older than the oldest grandchild by at least 26. | |

Rules specific to generating the synthetic datasets | 8 | The household head must be older than the oldest child by at least 7. |

Rules specific to generating the imputed datasets | 9 | The household head must be older than the oldest biological child by at least 7. |

10 | The household head must be older than the oldest adopted child by at least 11. | |

11 | The household head must be older than the oldest stepchild by at least 9. |

### A.4 Empirical study of the speedup approaches

We evaluate the performance of the two speedup approaches mentioned in the main text using synthetic data. We use data from the public use microdata files from the 2012 ACS, available for download from the United States Census Bureau (http://www2.census.gov/acs2012_1yr/pums/) to construct a population of 857,018 households of sizes $\mathcal{H}\mathrm{=}\left\{\mathrm{2,}\text{\hspace{0.17em}}\mathrm{3,}\text{\hspace{0.17em}}\mathrm{4,}\text{\hspace{0.17em}}\mathrm{5,}\text{\hspace{0.17em}}6\right\},$ from which we sample $n\mathrm{=}\text{10,000}$ households comprising $N\mathrm{=}\text{29,117}$ individuals. We work with the variables described in Table A.2. We evaluate the approaches using probabilities that depend on within household relationships and the household head.

Description of variable | Categories | |
---|---|---|

Household-level variables | Ownership of dwelling | 1 = owned or being bought, 2 = rented |

Household size | 2 = 2 people, 3 = 3 people, 4 = 4 people, | |

5 = 5 people, 6 = 6 people | ||

Individual-level variables | Gender | 1 = male, 2 = female |

Race | 1 = white, 2 = black, | |

3 = American Indian or Alaska native, | ||

4 = Chinese, 5 = Japanese, | ||

6 = other Asian/Pacific islander, 7 = other race, | ||

8 = two major races, | ||

9 = three or more major races | ||

Hispanic origin | 1 = not Hispanic, 2 = Mexican, | |

3 = Puerto Rican, 4 = Cuban, 5 = other | ||

Age | 1 = less than one year old, 2 = 1 year old, | |

3 = 2 years old, ..., 96 = 95 years old | ||

Relationship to head of household | 1 = household head, 2 = spouse, 3 = child, | |

4 = child-in-law, 5 = parent, 6 = parent-in-law, | ||

7 = sibling, 8 = sibling-in-law, 9 = grandchild, | ||

10 = other relative, 11 = partner/friend/visitor, | ||

12 = other non-relative |

We consider the NDPMPM using two approaches, both moving the values of the household head to the household level as in Section 4.1 of the main text and also using the cap-and-weight approach in Section 4.2 of the main text. The first approach considers ${\psi}_{2}\mathrm{=}{\psi}_{3}\mathrm{=}{\psi}_{4}\mathrm{=}{\psi}_{5}\mathrm{=}{\psi}_{6}\mathrm{=1}$ while the second approach considers ${\psi}_{2}\mathrm{=}{\psi}_{3}\mathrm{=}1/2$ and ${\psi}_{4}\mathrm{=}{\psi}_{5}\mathrm{=}{\psi}_{6}\mathrm{=}1/3.$ We compare these approaches to the NDPMPM as presented in Hu et al., 2018. For each approach, we create $L\mathrm{=50}$ synthetic datasets, $Z\mathrm{=}\left({Z}^{\left(1\right)}\text{}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{Z}^{\left(50\right)}\right).$ We generate the synthetic datasets so that the number of households of size $h\in \mathcal{H}$ in each ${Z}^{\left(l\right)}$ exactly matches ${n}_{h}$ from the observed data. Thus, $Z$ comprises partially synthetic data (Little, 1993; Reiter, 2003), even though every released ${Z}_{ijk}$ is a simulated value. We combine the estimates using using the approach in Reiter (2003). As a brief review, let $q$ be the point estimator of some estimand $Q,$ and let $u$ be the estimator of variance associated with $q.$ For $l\mathrm{=1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}L,$ let ${q}_{l}$ and ${u}_{l}$ be the values of $q$ and $u$ in synthetic dataset ${Z}^{\left(l\right)}\text{}.$ We use $\overline{q}\mathrm{=}{\displaystyle {\sum}_{l\mathrm{=1}}^{L}}\text{\hspace{0.17em}}{q}_{l}/L$ as the point estimate of $Q$ and $T\mathrm{=}\overline{u}+b/L$ as the estimated variance of $\overline{q},$ where $b\mathrm{=}{\displaystyle {\sum}_{l\mathrm{=1}}^{L}}{\left({q}_{l}-\overline{q}\right)}^{2}/\left(L-1\right)$ and $\overline{u}\mathrm{=}{\displaystyle {\sum}_{l\mathrm{=1}}^{L}}\text{\hspace{0.17em}}{u}_{l}/L.$ We make inference about $Q$ using $\left(\overline{q}-Q\right)\sim {t}_{v}\left(\mathrm{0,}\text{\hspace{0.17em}}T\right),$ where ${t}_{v}$ is a $t\text{\hspace{0.05em}}-$ distribution with $v\mathrm{=}\left(L-1\right){\left(1+L\overline{u}/b\right)}^{2}$ degrees of freedom.

For each approach, we run the MCMC sampler for 20,000 iterations, discarding the first 10,000 as burn-in and thinning the remaining samples every five iterations, resulting in 2,000 MCMC post burn-in iterates. We create the $L\mathrm{=50}$ synthetic datasets by randomly sampling from the 2,000 iterates. We set $F\mathrm{=40}$ and $S\mathrm{=15}$ for each approach based on initial tuning runs. For convergence, we examined trace plots of $\alpha $ , $\beta $ and weighted averages of a random sample of the multinomial probabilities in the NDPMPM likelihood. Across the approaches, the effective number of occupied household-level clusters usually ranges from 20 to 33 with a maximum of 38, while the effective number of occupied individual-level clusters across all household-level clusters ranges from 5 to 9 with a maximum of 12.

Based on MCMC runs on a standard laptop, moving household heads’ data values to the household level alone results in a speedup of about 63% on the default rejection sampler while the cap-and-weight approach alone results in a speedup of about 40%.

Table A.3 shows the 95% confidence intervals for each approach. Essentially, all three approaches result in similar confidence intervals, suggesting not much loss in accuracy from the speedups. Most intervals also are reasonably similar to confidence intervals based on the original data, except for the percentage of same age couples. The last row is a rigorous test of how well each method can estimate a probability that can be fairly difficult to estimate accurately. In this case, the probability that a household head and spouse are the same age can be difficult to estimate since each individual’s age can take 96 different values. All three approaches are thus off from the estimate from the original data in this case. These results suggest that we can significantly speedup the sampler with minimal loss in accuracy of estimates and confidence intervals of population estimands.

Original | NDPMPM | NDPMPM w/ HH moved | NDPMPM capped w/ HH moved | ||
---|---|---|---|---|---|

All same race | ${n}_{i}\mathrm{=2}$ | (0.939, 0.951) | (0.918, 0.932) | (0.912, 0.928) | (0.910, 0.925) |

${n}_{i}\mathrm{=3}$ | (0.896, 0.920) | (0.859, 0.888) | (0.845, 0.875) | (0.844, 0.874) | |

${n}_{i}\mathrm{=4}$ | (0.885, 0.912) | (0.826, 0.860) | (0.813, 0.848) | (0.817, 0.852) | |

${n}_{i}\mathrm{=5}$ | (0.879, 0.922) | (0.786, 0.841) | (0.786, 0.841) | (0.777, 0.834) | |

${n}_{i}\mathrm{=6}$ | (0.831, 0.910) | (0.701, 0.803) | (0.718, 0.819) | (0.660, 0.768) | |

SP present | This is an empty cell | (0.693, 0.711) | (0.678, 0.697) | (0.676, 0.695) | (0.677, 0.695) |

SP with white HH | This is an empty cell | (0.589, 0.608) | (0.577, 0.597) | (0.576, 0.595) | (0.575, 0.595) |

SP with black HH | This is an empty cell | (0.036, 0.043) | (0.035, 0.043) | (0.034, 0.042) | (0.034, 0.042) |

White couple | This is an empty cell | (0.570, 0.589) | (0.560, 0.580) | (0.553, 0.573) | (0.552, 0.572) |

White couple, own | This is an empty cell | (0.495, 0.514) | (0.468, 0.488) | (0.461, 0.481) | (0.463, 0.483) |

Same race couple | This is an empty cell | (0.655, 0.673) | (0.636, 0.655) | (0.626, 0.645) | (0.625, 0.644) |

White-nonwhite couple | This is an empty cell | (0.028, 0.035) | (0.028, 0.035) | (0.034, 0.041) | (0.036, 0.044) |

Nonwhite couple, own | This is an empty cell | (0.057, 0.067) | (0.047, 0.056) | (0.045, 0.053) | (0.045, 0.054) |

Only mother present | This is an empty cell | (0.017, 0.022) | (0.014, 0.019) | (0.014, 0.019) | (0.013, 0.018) |

Only one parent present | This is an empty cell | (0.021, 0.026) | (0.026, 0.032) | (0.026, 0.033) | (0.027, 0.033) |

Children present | This is an empty cell | (0.507, 0.527) | (0.493, 0.512) | (0.517, 0.537) | (0.511, 0.531) |

Siblings present | This is an empty cell | (0.022, 0.028) | (0.027, 0.034) | (0.027, 0.033) | (0.027, 0.033) |

Grandchild present | This is an empty cell | (0.041, 0.049) | (0.051, 0.060) | (0.049, 0.058) | (0.050, 0.059) |

Three generations present | This is an empty cell | (0.036, 0.044) | (0.037, 0.045) | (0.042, 0.050) | (0.040, 0.048) |

White HH, older than SP | This is an empty cell | (0.309, 0.327) | (0.283, 0.301) | (0.294, 0.313) | (0.302, 0.321) |

Nonhisp HH | This is an empty cell | (0.882, 0.894) | (0.875, 0.888) | (0.879, 0.891) | (0.876, 0.889) |

White, Hisp HH | This is an empty cell | (0.071, 0.082) | (0.074, 0.085) | (0.072, 0.082) | (0.073, 0.084) |

Same age couple | This is an empty cell | (0.087, 0.098) | (0.027, 0.034) | (0.023, 0.029) | (0.024, 0.031) |

### A.5 Empirical study of missing data imputation under MCAR

We also evaluate the performance of the NDPMPM as an imputation method under a missing completely at random (MCAR) scenario. We use the same data as in Section 5 of the main text. As a reminder, the data contains $n\mathrm{=}\text{5,000}$ households of sizes $\mathcal{H}\mathrm{=}\left\{\mathrm{2,}\text{\hspace{0.17em}}\mathrm{3,}\text{\hspace{0.17em}}4\right\},$ comprising $N\mathrm{=}\text{13,181}$ individuals. We introduce missing values using a MCAR scenario. We randomly select 80% households to be complete cases for all variables. For the remaining 20%, we let the variable “household size” be fully observed and randomly $\u2013$ and independently $\u2013$ blank 50% of each variable for the remaining household-level and individual-level variables. We use these low rates to mimic the actual rates of item nonresponse in census data.

Similar to the main text, we estimate the NDPMPM using two approaches, both combining the rejection step in Section 4.1 of the main text with the cap-and-weight approach in Section 4.2 of the main text. The first approach considers ${\psi}_{2}\mathrm{=}{\psi}_{3}\mathrm{=}{\psi}_{4}\mathrm{=1}$ while the second approach considers ${\psi}_{2}\mathrm{=}{\psi}_{3}\mathrm{=}1/2$ and ${\psi}_{4}\mathrm{=}1/3.$ For each approach, we run the MCMC sampler for 10,000 iterations, discarding the first 5,000 as burn-in and thinning the remaining samples every five iterations, resulting in 1,000 MCMC post burn-in iterates. We set $F\mathrm{=30}$ and $S\mathrm{=15}$ for each approach based on initial tuning runs. We monitor convergence as in the main text. For both methods, we generate $L\mathrm{=50}$ completed datasets, $Z\mathrm{=}\left({Z}^{\left(1\right)}\text{}\mathrm{,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}{Z}^{\left(50\right)}\right),$ using the posterior predictive distribution of the NDPMPM, from which we estimate the same probabilities as in the main text.

Figures A.1 and A.2 display each estimated marginal, bivariate and trivariate probability ${\overline{q}}_{50}$ plotted against its corresponding estimate from the original data, without missing values. Figure A.1 shows the results for the NDPMPM with the rejection sampler, and Figure A.2 shows the results for the NDPMPM using the cap-and-weight approach. For both approaches, the NDPMPM does a good job of capturing important features of the joint distribution of the variables as the point estimates are very close to those from the data before introducing missing values. In short, the results are very similar to those in the main text, though more accurate.

Table A.4 displays 95% confidence intervals for selected probabilities involving within-household relationships, as well as the value in the full population of 764,580 households. The intervals include the two based on the NDPMPM imputation engines and the interval from the data before introducing missingness. The intervals are generally more accurate than those presented in the main text. This is expected since we use lower rates of missingness in the MCAR scenario. For the most part, the intervals from the NDPMPM with the two approaches tend to include the true population quantity. Again, the NDPMPM imputation engine results in downward bias for the percentages of households where everyone is the same race. As mentioned in the main text, this is a challenging estimand to estimate accurately via imputation, particularly for larger households.

## Description for Figure A.1

Figure presenting the marginal, bivariate and trivariate probabilities computed in the sample and imputed datasets under MCAR from the truncated NDPMPM with the rejection sampler (household heads’ data values moved to the household level). There are three scatter plots with a 45° straight line. The three graphs illustrate the marginal, bivariate and trivariate probabilities respectively. The average from 50 imputed datasets is on the y-axis, ranging from 0.0 to 1.0. The sample estimate is on the x-axis, ranging from 0.0 to 0.6. For all three graphs, estimations from imputed data are close to those from the sample, almost on the line.

## Description for Figure A.2

Figure presenting the marginal, bivariate and trivariate probabilities computed in the sample and imputed datasets under MCAR from the truncated NDPMPM using the cap-and-weight approach (household heads’ data values moved to the household level). There are three scatter plots with a 45° straight line. The three graphs illustrate the marginal, bivariate and trivariate probabilities respectively. The average from 50 imputed datasets is on the y-axis, ranging from 0.0 to 1.0. The sample estimate is on the x-axis, ranging from 0.0 to 0.6. For all three graphs, estimations from imputed data are close to those from the sample, almost on the line.

Q |
No Missing | NDPMPM | NDPMPM Capped | ||
---|---|---|---|---|---|

All same race household: | ${n}_{i}\mathrm{=2}$ | 0.942 | (0.932, 0.949) | (0.924, 0.944) | (0.925, 0.946) |

${n}_{i}\mathrm{=3}$ | 0.908 | (0.907, 0.937) | (0.887, 0.924) | (0.890, 0.925) | |

${n}_{i}\mathrm{=4}$ | 0.901 | (0.879, 0.917) | (0.854, 0.900) | (0.855, 0.900) | |

SP present | This is an empty cell | 0.696 | (0.682, 0.707) | (0.683, 0.709) | (0.683, 0.709) |

Same race CP | This is an empty cell | 0.656 | (0.641, 0.668) | (0.637, 0.664) | (0.638, 0.665) |

SP present, HH is White | This is an empty cell | 0.600 | (0.589, 0.616) | (0.590, 0.618) | (0.590, 0.618) |

White CP | This is an empty cell | 0.580 | (0.569, 0.596) | (0.568, 0.596) | (0.568, 0.597) |

CP with age difference less than five | This is an empty cell | 0.488 | (0.465, 0.492) | (0.422, 0.451) | (0.422, 0.450) |

Male HH, home owner | This is an empty cell | 0.476 | (0.456, 0.484) | (0.455, 0.483) | (0.456, 0.485) |

HH over 35, no CH present | This is an empty cell | 0.462 | (0.441, 0.468) | (0.438, 0.466) | (0.438, 0.466) |

At least one biological CH present | This is an empty cell | 0.437 | (0.431, 0.458) | (0.432, 0.460) | (0.432, 0.460) |

HH older than SP, White HH | This is an empty cell | 0.322 | (0.309, 0.335) | (0.308, 0.335) | (0.306, 0.333) |

Adult female w/ at least one CH under 5 | This is an empty cell | 0.078 | (0.070, 0.085) | (0.068, 0.084) | (0.067, 0.083) |

White HH with Hisp origin | This is an empty cell | 0.066 | (0.064, 0.078) | (0.064, 0.079) | (0.064, 0.079) |

Non-White CP, home owner | This is an empty cell | 0.058 | (0.050, 0.063) | (0.048, 0.061) | (0.048, 0.061) |

Two generations present, Black HH | This is an empty cell | 0.057 | (0.053, 0.066) | (0.053, 0.066) | (0.053, 0.067) |

Black HH, home owner | This is an empty cell | 0.052 | (0.046, 0.058) | (0.046, 0.059) | (0.046, 0.059) |

SP present, HH is Black | This is an empty cell | 0.039 | (0.032, 0.042) | (0.032, 0.043) | (0.032, 0.042) |

White-nonwhite CP | This is an empty cell | 0.034 | (0.029, 0.039) | (0.032, 0.044) | (0.032, 0.044) |

Hisp HH over 50, home owner | This is an empty cell | 0.029 | (0.025, 0.034) | (0.025, 0.035) | (0.025, 0.035) |

One grandchild present | This is an empty cell | 0.028 | (0.023, 0.033) | (0.024, 0.034) | (0.024, 0.034) |

Adult Black female w/ at least one CH under 18 | This is an empty cell | 0.027 | (0.028, 0.038) | (0.027, 0.037) | (0.027, 0.037) |

At least two generations present, Hisp CP | This is an empty cell | 0.027 | (0.022, 0.031) | (0.022, 0.031) | (0.022, 0.031) |

Hisp CP with at least one biological CH | This is an empty cell | 0.025 | (0.020, 0.028) | (0.019, 0.028) | (0.019, 0.028) |

At least three generations present | This is an empty cell | 0.023 | (0.020, 0.028) | (0.019, 0.028) | (0.019, 0.028) |

Only one parent | This is an empty cell | 0.020 | (0.016, 0.024) | (0.016, 0.024) | (0.016, 0.024) |

At least one stepchild | This is an empty cell | 0.019 | (0.018, 0.026) | (0.018, 0.027) | (0.018, 0.027) |

Adult Hisp male w/ at least one CH under 10 | This is an empty cell | 0.018 | (0.017, 0.025) | (0.016, 0.025) | (0.016, 0.025) |

At least one adopted CH, White CP | This is an empty cell | 0.008 | (0.005, 0.010) | (0.005, 0.010) | (0.005, 0.010) |

Black CP with at least two biological children | This is an empty cell | 0.006 | (0.003, 0.007) | (0.003, 0.007) | (0.003, 0.007) |

Black HH under 40, home owner | This is an empty cell | 0.005 | (0.005, 0.009) | (0.005, 0.010) | (0.005, 0.011) |

Three generations present, White CP | This is an empty cell | 0.005 | (0.004, 0.008) | (0.004, 0.010) | (0.004, 0.009) |

White HH under 25, home owner | This is an empty cell | 0.003 | (0.002, 0.005) | (0.004, 0.009) | (0.004, 0.009) |

## References

Andridge, R.R., and Little, R.J.A. (2010). A review of
hot deck imputation for survey non-response. *International Statistical Review*, 78(1), 40-64.

Bennink, M., Croon, M.A., Kroon, B. and Vermunt, J.K. (2016).
Micro-macro multilevel latent class models with multiple discrete
individual-level variables. *Advances in
Data Analysis and Classification*.

Chambers, R., and Skinner, C. (2003). Analysis of Survey Data, Wiley Series in Survey Methodology, Wiley.

Dunson, D.B., and Xing, C. (2009). Nonparametric Bayes
modeling of multivariate categorical data. *Journal
of the American Statistical Association*, 104, 1042-1051.

Hu, J., Reiter, J.P. and Wang, Q. (2018). Dirichlet
process mixture models for modeling and generating synthetic versions of nested
categorical data. *Bayesian Analysis*,
13, 183-200.

Ishwaran, H., and James, L.F. (2001). Gibbs sampling
methods for stick-breaking priors. *Journal
of the American Statistical Association*, 161-173.

Kalton, G., and Kasprzyk, D. (1986). The treatment of
missing survey data. *Survey Methodology*,
12, 1, 1-16. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/1986001/article/14404-eng.pdf.

Little, R.J.A. (1993). Statistical analysis of masked
data. *Journal of Official Statistics*,
9, 407-426.

Manrique-Vallier, D., and Reiter, J.P. (2014). Bayesian
estimation of discrete multivariate latent structure models with structural
zeros. *Journal of Computational and
Graphical Statistics*, 23, 1061-1079.

Murray, J.S., and Reiter, J.P. (2016). Multiple
imputation of missing categorical and continuous values via Bayesian mixture
models with local dependence (forthcoming). *Journal
of the American Statistical Association*.

Raghunathan, T.E., and Rubin, D.B. (2001). Multiple imputation for statistical disclosure limitation. Technical Report.

Reiter, J.P. (2003). Inference for partially synthetic,
public use microdata sets. *Survey
Methodology*, 29, 2, 181-188. Paper available at https://www150.statcan.gc.ca/n1/pub/12-001-x/2003002/article/6785-eng.pdf.

Reiter, J.P., and Raghunathan, T.E. (2007). The multiple
adaptations of multiple imputation. *Journal
of the American Statistical Association*, 102, 1462-1471.

Rubin, D.B. (1976). Inference and missing data (with
discussion). *Biometrika*, 63, 581-592.

Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys, New York: John Wiley & Sons, Inc.

Rubin, D.B. (1993). Discussion: Statistical disclosure
limitation. *Journal of Official
Statistics*, 9, 462-468.

Savitsky, T.D., and Toth, D. (2016). Bayesian estimation
under informative sampling. *Electronic
Journal of Statistics*, 10.1, 1677-1708.

Sethuraman, J. (1994). A constructive definition of
Dirichlet priors. *Statistica Sinica*,
4, 639-650.

Si, Y., and Reiter, J.P. (2013). Nonparametric Bayesian
multiple imputation for incomplete categorical variables in large-scale
assessment surveys. *Journal of Educational
and Behavioral Statistics*, 38.5, 199-521.

Vermunt, J.K. (2003). Multilevel latent class models. *Sociological Methodology*, 213-239.

Vermunt, J.K. (2008). Latent class and finite mixture
models for multilevel data sets. *Statistical
Methods in Medical Research*, 33-51.

Walker, S.G. (2007). Sampling the Dirichlet mixture
model with slices. *Communications in
Statistics - Simulation and Computation*, 1, 45-54.

Wang, Q., Akande, O., Hu, J., Reiter, J. and Barrientos,
A. (2016). NestedCategBayesImpute: Modeling and Generating Synthetic Versions
of Nested Categorical Data in the Presence of Impossible Combinations. *The Comprehensive R Archive Network*.

## Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

- Date modified: