Multiple imputation of missing values in household data with structural zeros
Section 6. Discussion

Table of contents

The empirical study suggests that the NDPMPM can provide high quality imputations for categorical data nested within households. To our knowledge, this is the first parametric imputation engine for nested multivariate categorical data. The study also illustrates that, with modest sample sizes, agencies should not expect the NDPMPM to preserve all features of the joint distribution. Of course, this is the case with any imputation engine. For the NDPMPM, agencies may be able to improve accuracy for targeted quantities by recoding the data used to fit the model. For example, one can create a new household-level variable that equals one when everyone has the same race and equals zero otherwise, and replace the individual race variable with a new variable that has levels “1 = race is the same as race of household head”, “2 = race is white and differs from race of household head”, “3 = race is black and differs from race of household head”, and so on. The NDPMPM would be estimated with the household-level same race variable and the new individual-level race variable. This would encourage the NDPMPM to estimate the percentages with the same race very accurately, as it would be just another household-level variable like home ownership. It also would add structural zeros involving race to the computation. Evaluating the trade offs in accuracy and computational costs of such recodings is a topic for future research.

The NDPMPM can be computationally expensive, even with the speed-ups presented in this article. The expensive parts of the algorithm are the rejection sampling steps. Fortunately, these can be done easily by parallel processing. For example, we can require each processor to generate a fraction of the impossible cases in Section 2.2. We also can spread the rejection steps for the imputations over many processors. These steps should cut run time by a factor roughly equal to the number of processors available.

The empirical study used households up to size four. We have run the model on data with households up to size seven in reasonable time (a few hours on a standard laptop). Accuracy results are similar qualitatively. As the household sizes get large, the model can generate hundreds or even thousands times as many impossible households as there are feasible ones, slowing the algorithm. In such cases, the cap-and-weight approach is essential for practical applications.

Acknowledgements

This research was supported by grants from the National Science Foundation (NSF SES 1131897) and the Alfred P. Sloan Foundation (G-2-15-20166003).

Appendix

This is an Appendix to the paper. It contains proof that the rejection sampling step S9' in Section 3 generates samples from the correct posterior distribution. It also contains the modified Gibbs sampler for the cap-and-weight approach and a list of the structural zero rules used in fitting the NDPMPM model. Finally, we include empirical results for the speedup approaches mentioned in the paper, using synthetic data, and additional results for handling missing data using the NDPMPM under a missing completely at random scenario.

A.1 Proof that the rejection sampling step S9' in Section 3 generates samples from the correct posterior distribution

The $X_{i k}^{1}$ and $X_{i j k}^{1}$ values generated using the rejection sampler in Step S9' are generated from the full conditionals, resulting in a valid Gibbs sampler. The proof follows from the properties of rejection sampling (or simple accept reject). The target distribution is the full conditional for $X_{i}^{mis} .$ It can be re-expressed as

$p (X_{i}^{mis}) = \frac{1 {X_{i}^{1} \notin S_{h}}}{\Pr (X_{i} \notin S_{h} | θ)} g (X_{i}^{mis})$

where

$g (X_{i}^{mis}) = π_{G_{i}^{1}} \prod_{k | a_{i k} =1}^{p + q} λ_{G_{i}^{1} X_{i k}^{1}}^{(k)} (\prod_{j =1}^{n_{i}} ω_{G_{i}^{1} M_{i j}^{1}} \prod_{k | b_{i j k} =1}^{p} ϕ_{G_{i}^{1} M_{i j}^{1} X_{i j k}^{1}}^{(k)}) .$

Our rejection scheme uses $g (X_{i}^{mis})$ as a proposal for $p (X_{i}^{mis}) .$ To show that the draws are indeed from $p (X_{i}^{mis}),$ we need to verify that $w (X_{i}^{mis}) = p (X_{i}^{mis}) / g (X_{i}^{mis}) < M,$ where $1< M < \infty,$ and that we are accepting each sample with probability $w (X_{i}^{mis}) / M .$ In our case,

$w (X_{i}^{mis}) = p (X_{i}^{mis}) / g (X_{i}^{mis}) = 1 {X_{i}^{1} \notin S_{h}} / \Pr (X_{i} \notin S_{h} | θ) \leq 1 / \Pr (X_{i} \notin S_{h} | θ),$ and $0< \Pr (X_{i} \notin S_{h} | θ) <1 \Rightarrow 1< 1 / \Pr (X_{i} \notin S_{h} | θ) < \infty$ necessarily.
By sampling until we obtain a valid sample that satisfies $X_{i}^{1} \notin S_{h},$ we are indeed sampling with probability $w (X_{i}^{mis}) / M = 1 {X_{i}^{1} \notin S_{h}} .$

A.2 Modified Gibbs sampler for the cap-and-weight approach

The modified Gibbs sampler for the cap-and-weight approach replaces steps S1, S3, S4, S5 and S6 of the Gibbs sampler in the main text as follows.

S1*. For each $h \in H,$ repeat steps S1(a) to S1(e) as before but modify step S1(f) to: if $t_{1} < ⌈ n_{1 h} \times ψ_{h} ⌉,$ return to step (b). Otherwise, set $n_{0 h} = t_{0} .$

S3*. Set $u_{F} =1.$ Sample

$u_{g} | - \sim Beta (1 + U_{g} , α + \sum_{f = g + 1}^{F} U_{f}), π_{g} = u_{g} \prod_{f < g} (1 - u_{f})$

where

$U_{g} = \sum_{i =1}^{n} 1 (G_{i}^{1} = g) + \sum_{h \in H} \frac{1}{ψ_{h}} \sum_{i | n_{i}^{0} = h} 1 (G_{i}^{0} = g)$

for

g =1, \dots, F - 1.

S4*. Set $v_{g M} =1$ for $g =1, \dots, F .$ Sample

$v_{g m} | - \sim Beta (1 + V_{g m}, β + \sum_{s = m + 1}^{S} V_{g s}), ω_{g m} = v_{g m} \prod_{s < m} (1 - v_{g s})$

where

$V_{g m} = \sum_{i =1}^{n} 1 (M_{i j}^{1} = m, G_{i}^{1} = g) + \sum_{h \in H} \frac{1}{ψ_{h}} \sum_{i | n_{i}^{0} = h} 1 (M_{i j}^{0} = m, G_{i}^{0} = g)$

for

m =1, \dots, S - 1

and

g =1, \dots, F .

S5*. Sample

$λ_{g}^{(k)} | - \sim Dirichlet (1 + η_{g 1}^{(k)} , \dots, 1 + η_{g d_{k}}^{(k)})$

where

$η_{g c}^{(k)} = \sum_{i | G_{i}^{1} = g}^{n} 1 (X_{i k}^{1} = c) + \sum_{h \in H} \frac{1}{ψ_{h}} \sum_{i | \begin{matrix} n_{i}^{0} = h, G_{i}^{0} = g \end{matrix}} 1 (X_{i k}^{0} = c)$

for

g =1, \dots, F

and

k = p + 1, \dots, q .

S6*. Sample

$ϕ_{g m}^{(k)} | - \sim Dirichlet (1 + ν_{g m 1}^{(k)}, \dots, 1 + ν_{g m d_{k}}^{(k)})$

where

$ν_{g m c}^{(k)} = \sum_{i | \begin{matrix} G_{i}^{1} = g, M_{i j}^{1} = m \end{matrix}}^{n} 1 (X_{i j k}^{1} = c) + \sum_{h \in H} \frac{1}{ψ_{h}} \sum_{i | \begin{matrix} n_{i}^{0} = h, G_{i}^{0} = g, M_{i j}^{0} = m \end{matrix}} 1 (X_{i j k}^{0} = c)$

for

g =1, \dots, F,

m =1, \dots, S

and

k =1, \dots, p .

A.3 List of structural zeros

We fit the NDPMPM model using structural zeros which involve ages and relationships of individuals in the same house. The full list of the rules used is presented in Table A.1. These rules were derived from the 2012 ACS by identifying combinations involving the relationship variable that do not appear in the constructed population. This list should not be interpreted as a “true” list of impossible combinations in census data.

Table A.1
List of structural zeros
Table summary
This table displays the results of List of structural zeros. The information is grouped by Description (appearing as row headers), (appearing as column headers).
Description	This is an empty column	This is an empty column
Rules common to generating both the synthetic and imputed datasets	1	Each household must contain exactly one head and he/she must be at least 16 years old.
	2	Each household cannot contain more than one spouse and he/she must be at least 16 years old.
	3	Married couples are of opposite sex, and age difference between individuals in the couples cannot exceed 49.
	4	The youngest parent must be older than the household head by at least 4.
	5	The youngest parent-in-law must be older than the household head by at least 4.
	6	The age difference between the household head and siblings cannot exceed 37.
	7	The household head must be at least 31 years old to be a grandparent and his/her spouse must be at least 17. Also, He/she must be older than the oldest grandchild by at least 26.
Rules specific to generating the synthetic datasets	8	The household head must be older than the oldest child by at least 7.
Rules specific to generating the imputed datasets	9	The household head must be older than the oldest biological child by at least 7.
	10	The household head must be older than the oldest adopted child by at least 11.
	11	The household head must be older than the oldest stepchild by at least 9.

A.4 Empirical study of the speedup approaches

We evaluate the performance of the two speedup approaches mentioned in the main text using synthetic data. We use data from the public use microdata files from the 2012 ACS, available for download from the United States Census Bureau (http://www2.census.gov/acs2012_1yr/pums/) to construct a population of 857,018 households of sizes $H = {2, 3, 4, 5, 6},$ from which we sample $n = 10,000$ households comprising $N = 29,117$ individuals. We work with the variables described in Table A.2. We evaluate the approaches using probabilities that depend on within household relationships and the household head.

Table A.2
Description of variables used in the synthetic data illustration
Table summary
This table displays the results of Description of variables used in the synthetic data illustration. The information is grouped by Description of variable (appearing as row headers), Categories (appearing as column headers).
Description of variable		Categories
Household-level variables	Ownership of dwelling	1 = owned or being bought, 2 = rented
	Household size	2 = 2 people, 3 = 3 people, 4 = 4 people,
	Household size	5 = 5 people, 6 = 6 people
Individual-level variables	Gender	1 = male, 2 = female
	Race	1 = white, 2 = black,
		3 = American Indian or Alaska native,
		4 = Chinese, 5 = Japanese,
		6 = other Asian/Pacific islander, 7 = other race,
		8 = two major races,
		9 = three or more major races
	Hispanic origin	1 = not Hispanic, 2 = Mexican,
	Hispanic origin	3 = Puerto Rican, 4 = Cuban, 5 = other
	Age	1 = less than one year old, 2 = 1 year old,
	Age	3 = 2 years old, ..., 96 = 95 years old
	Relationship to head of household	1 = household head, 2 = spouse, 3 = child,
		4 = child-in-law, 5 = parent, 6 = parent-in-law,
		7 = sibling, 8 = sibling-in-law, 9 = grandchild,
		10 = other relative, 11 = partner/friend/visitor,
		12 = other non-relative

We consider the NDPMPM using two approaches, both moving the values of the household head to the household level as in Section 4.1 of the main text and also using the cap-and-weight approach in Section 4.2 of the main text. The first approach considers $ψ_{2} = ψ_{3} = ψ_{4} = ψ_{5} = ψ_{6} =1$ while the second approach considers $ψ_{2} = ψ_{3} = 1 / 2$ and $ψ_{4} = ψ_{5} = ψ_{6} = 1 / 3 .$ We compare these approaches to the NDPMPM as presented in Hu et al., 2018. For each approach, we create $L =50$ synthetic datasets, $Z = (Z^{(1)} , \dots, Z^{(50)}) .$ We generate the synthetic datasets so that the number of households of size $h \in H$ in each $Z^{(l)}$ exactly matches $n_{h}$ from the observed data. Thus, $Z$ comprises partially synthetic data (Little, 1993; Reiter, 2003), even though every released $Z_{i j k}$ is a simulated value. We combine the estimates using using the approach in Reiter (2003). As a brief review, let $q$ be the point estimator of some estimand $Q,$ and let $u$ be the estimator of variance associated with $q .$ For $l =1, \dots, L,$ let $q_{l}$ and $u_{l}$ be the values of $q$ and $u$ in synthetic dataset $Z^{(l)} .$ We use $\bar{q} = \sum_{l =1}^{L} q_{l} / L$ as the point estimate of $Q$ and $T = \bar{u} + b / L$ as the estimated variance of $\bar{q},$ where $b = \sum_{l =1}^{L} {(q_{l} - \bar{q})}^{2} / (L - 1)$ and $\bar{u} = \sum_{l =1}^{L} u_{l} / L .$ We make inference about $Q$ using $(\bar{q} - Q) \sim t_{v} (0, T),$ where $t_{v}$ is a $t -$ distribution with $v = (L - 1) {(1 + L \bar{u} / b)}^{2}$ degrees of freedom.

For each approach, we run the MCMC sampler for 20,000 iterations, discarding the first 10,000 as burn-in and thinning the remaining samples every five iterations, resulting in 2,000 MCMC post burn-in iterates. We create the $L =50$ synthetic datasets by randomly sampling from the 2,000 iterates. We set $F =40$ and $S =15$ for each approach based on initial tuning runs. For convergence, we examined trace plots of $α$ , $β$ and weighted averages of a random sample of the multinomial probabilities in the NDPMPM likelihood. Across the approaches, the effective number of occupied household-level clusters usually ranges from 20 to 33 with a maximum of 38, while the effective number of occupied individual-level clusters across all household-level clusters ranges from 5 to 9 with a maximum of 12.

Based on MCMC runs on a standard laptop, moving household heads’ data values to the household level alone results in a speedup of about 63% on the default rejection sampler while the cap-and-weight approach alone results in a speedup of about 40%.

Table A.3 shows the 95% confidence intervals for each approach. Essentially, all three approaches result in similar confidence intervals, suggesting not much loss in accuracy from the speedups. Most intervals also are reasonably similar to confidence intervals based on the original data, except for the percentage of same age couples. The last row is a rigorous test of how well each method can estimate a probability that can be fairly difficult to estimate accurately. In this case, the probability that a household head and spouse are the same age can be difficult to estimate since each individual’s age can take 96 different values. All three approaches are thus off from the estimate from the original data in this case. These results suggest that we can significantly speedup the sampler with minimal loss in accuracy of estimates and confidence intervals of population estimands.

Table A.3
Confidence intervals for selected probabilities that depend on within-household relationships in the original and synthetic datasets. “Original” is based on the sampled data, “NDPMPM” is the default MCMC sampler described in Section 2.2 of the main text, “NDPMPM w/ HH moved” is the default sampler, moving household heads’ data values to the household level, “NDPMPM capped w/ HH moved” uses the cap-and-weight approach and moving household heads’ data values to the household level. “HH ” means household head and “SP” means spouse
Table summary
This table displays the results of Confidence intervals for selected probabilities that depend on within-household relationships in the original and synthetic datasets. “Original” is based on the sampled data Original, NDPMPM, NDPMPM w/ HH moved and NDPMPM capped w/ HH moved (appearing as column headers).
		Original	NDPMPM	NDPMPM w/ HH moved	NDPMPM capped w/ HH moved
All same race	$n_{i} =2$	(0.939, 0.951)	(0.918, 0.932)	(0.912, 0.928)	(0.910, 0.925)
	$n_{i} =3$	(0.896, 0.920)	(0.859, 0.888)	(0.845, 0.875)	(0.844, 0.874)
	$n_{i} =4$	(0.885, 0.912)	(0.826, 0.860)	(0.813, 0.848)	(0.817, 0.852)
	$n_{i} =5$	(0.879, 0.922)	(0.786, 0.841)	(0.786, 0.841)	(0.777, 0.834)
	$n_{i} =6$	(0.831, 0.910)	(0.701, 0.803)	(0.718, 0.819)	(0.660, 0.768)
SP present	This is an empty cell	(0.693, 0.711)	(0.678, 0.697)	(0.676, 0.695)	(0.677, 0.695)
SP with white HH	This is an empty cell	(0.589, 0.608)	(0.577, 0.597)	(0.576, 0.595)	(0.575, 0.595)
SP with black HH	This is an empty cell	(0.036, 0.043)	(0.035, 0.043)	(0.034, 0.042)	(0.034, 0.042)
White couple	This is an empty cell	(0.570, 0.589)	(0.560, 0.580)	(0.553, 0.573)	(0.552, 0.572)
White couple, own	This is an empty cell	(0.495, 0.514)	(0.468, 0.488)	(0.461, 0.481)	(0.463, 0.483)
Same race couple	This is an empty cell	(0.655, 0.673)	(0.636, 0.655)	(0.626, 0.645)	(0.625, 0.644)
White-nonwhite couple	This is an empty cell	(0.028, 0.035)	(0.028, 0.035)	(0.034, 0.041)	(0.036, 0.044)
Nonwhite couple, own	This is an empty cell	(0.057, 0.067)	(0.047, 0.056)	(0.045, 0.053)	(0.045, 0.054)
Only mother present	This is an empty cell	(0.017, 0.022)	(0.014, 0.019)	(0.014, 0.019)	(0.013, 0.018)
Only one parent present	This is an empty cell	(0.021, 0.026)	(0.026, 0.032)	(0.026, 0.033)	(0.027, 0.033)
Children present	This is an empty cell	(0.507, 0.527)	(0.493, 0.512)	(0.517, 0.537)	(0.511, 0.531)
Siblings present	This is an empty cell	(0.022, 0.028)	(0.027, 0.034)	(0.027, 0.033)	(0.027, 0.033)
Grandchild present	This is an empty cell	(0.041, 0.049)	(0.051, 0.060)	(0.049, 0.058)	(0.050, 0.059)
Three generations present	This is an empty cell	(0.036, 0.044)	(0.037, 0.045)	(0.042, 0.050)	(0.040, 0.048)
White HH, older than SP	This is an empty cell	(0.309, 0.327)	(0.283, 0.301)	(0.294, 0.313)	(0.302, 0.321)
Nonhisp HH	This is an empty cell	(0.882, 0.894)	(0.875, 0.888)	(0.879, 0.891)	(0.876, 0.889)
White, Hisp HH	This is an empty cell	(0.071, 0.082)	(0.074, 0.085)	(0.072, 0.082)	(0.073, 0.084)
Same age couple	This is an empty cell	(0.087, 0.098)	(0.027, 0.034)	(0.023, 0.029)	(0.024, 0.031)

A.5 Empirical study of missing data imputation under MCAR

We also evaluate the performance of the NDPMPM as an imputation method under a missing completely at random (MCAR) scenario. We use the same data as in Section 5 of the main text. As a reminder, the data contains $n = 5,000$ households of sizes $H = {2, 3, 4},$ comprising $N = 13,181$ individuals. We introduce missing values using a MCAR scenario. We randomly select 80% households to be complete cases for all variables. For the remaining 20%, we let the variable “household size” be fully observed and randomly $-$ and independently $-$ blank 50% of each variable for the remaining household-level and individual-level variables. We use these low rates to mimic the actual rates of item nonresponse in census data.

Similar to the main text, we estimate the NDPMPM using two approaches, both combining the rejection step in Section 4.1 of the main text with the cap-and-weight approach in Section 4.2 of the main text. The first approach considers $ψ_{2} = ψ_{3} = ψ_{4} =1$ while the second approach considers $ψ_{2} = ψ_{3} = 1 / 2$ and $ψ_{4} = 1 / 3 .$ For each approach, we run the MCMC sampler for 10,000 iterations, discarding the first 5,000 as burn-in and thinning the remaining samples every five iterations, resulting in 1,000 MCMC post burn-in iterates. We set $F =30$ and $S =15$ for each approach based on initial tuning runs. We monitor convergence as in the main text. For both methods, we generate $L =50$ completed datasets, $Z = (Z^{(1)} , \dots, Z^{(50)}),$ using the posterior predictive distribution of the NDPMPM, from which we estimate the same probabilities as in the main text.

Figures A.1 and A.2 display each estimated marginal, bivariate and trivariate probability ${\bar{q}}_{50}$ plotted against its corresponding estimate from the original data, without missing values. Figure A.1 shows the results for the NDPMPM with the rejection sampler, and Figure A.2 shows the results for the NDPMPM using the cap-and-weight approach. For both approaches, the NDPMPM does a good job of capturing important features of the joint distribution of the variables as the point estimates are very close to those from the data before introducing missing values. In short, the results are very similar to those in the main text, though more accurate.

Table A.4 displays 95% confidence intervals for selected probabilities involving within-household relationships, as well as the value in the full population of 764,580 households. The intervals include the two based on the NDPMPM imputation engines and the interval from the data before introducing missingness. The intervals are generally more accurate than those presented in the main text. This is expected since we use lower rates of missingness in the MCAR scenario. For the most part, the intervals from the NDPMPM with the two approaches tend to include the true population quantity. Again, the NDPMPM imputation engine results in downward bias for the percentages of households where everyone is the same race. As mentioned in the main text, this is a challenging estimand to estimate accurately via imputation, particularly for larger households.

Figure A.1 Marginal, bivariate and trivariate probabilities computed in the sample and imputed datasets under MCAR from the truncated NDPMPM with the rejection sampler. Household heads’ data values moved to the household level

Description for Figure A.1

Figure presenting the marginal, bivariate and trivariate probabilities computed in the sample and imputed datasets under MCAR from the truncated NDPMPM with the rejection sampler (household heads’ data values moved to the household level). There are three scatter plots with a 45° straight line. The three graphs illustrate the marginal, bivariate and trivariate probabilities respectively. The average from 50 imputed datasets is on the y-axis, ranging from 0.0 to 1.0. The sample estimate is on the x-axis, ranging from 0.0 to 0.6. For all three graphs, estimations from imputed data are close to those from the sample, almost on the line.

Figure A.2 Marginal, bivariate and trivariate probabilities computed in the sample and imputed datasets under MCAR from the truncated NDPMPM using the cap-and-weight approach. Household heads’ data values to the household level

Description for Figure A.2

Figure presenting the marginal, bivariate and trivariate probabilities computed in the sample and imputed datasets under MCAR from the truncated NDPMPM using the cap-and-weight approach (household heads’ data values moved to the household level). There are three scatter plots with a 45° straight line. The three graphs illustrate the marginal, bivariate and trivariate probabilities respectively. The average from 50 imputed datasets is on the y-axis, ranging from 0.0 to 1.0. The sample estimate is on the x-axis, ranging from 0.0 to 0.6. For all three graphs, estimations from imputed data are close to those from the sample, almost on the line.

Table A.4
Confidence intervals for selected probabilities that depend on within-household relationships in the original and imputed datasets under MCAR. “No missing” is based on the sampled data before introducing missing values, “NDPMPM” uses the truncated NDPMPM, moving household heads’ data values to the household level, and “NDPMPM Capped” uses the truncated NDPMPM with the cap-and-weight approach and moving household heads’ data values to the household level. “HH ” means household head, “SP” means spouse, “CH” means child, and “CP” means couple. Q is the value in the full population of 764,580 households
Table summary
This table displays the results of Confidence intervals for selected probabilities that depend on within-household relationships in the original and imputed datasets under MCAR. “No missing” is based on the sampled data before introducing missing values Q, No Missing, NDPMPM and NDPMPM Capped (appearing as column headers).
		Q	No Missing	NDPMPM	NDPMPM Capped
All same race household:	$n_{i} =2$	0.942	(0.932, 0.949)	(0.924, 0.944)	(0.925, 0.946)
	$n_{i} =3$	0.908	(0.907, 0.937)	(0.887, 0.924)	(0.890, 0.925)
	$n_{i} =4$	0.901	(0.879, 0.917)	(0.854, 0.900)	(0.855, 0.900)
SP present	This is an empty cell	0.696	(0.682, 0.707)	(0.683, 0.709)	(0.683, 0.709)
Same race CP	This is an empty cell	0.656	(0.641, 0.668)	(0.637, 0.664)	(0.638, 0.665)
SP present, HH is White	This is an empty cell	0.600	(0.589, 0.616)	(0.590, 0.618)	(0.590, 0.618)
White CP	This is an empty cell	0.580	(0.569, 0.596)	(0.568, 0.596)	(0.568, 0.597)
CP with age difference less than five	This is an empty cell	0.488	(0.465, 0.492)	(0.422, 0.451)	(0.422, 0.450)
Male HH, home owner	This is an empty cell	0.476	(0.456, 0.484)	(0.455, 0.483)	(0.456, 0.485)
HH over 35, no CH present	This is an empty cell	0.462	(0.441, 0.468)	(0.438, 0.466)	(0.438, 0.466)
At least one biological CH present	This is an empty cell	0.437	(0.431, 0.458)	(0.432, 0.460)	(0.432, 0.460)
HH older than SP, White HH	This is an empty cell	0.322	(0.309, 0.335)	(0.308, 0.335)	(0.306, 0.333)
Adult female w/ at least one CH under 5	This is an empty cell	0.078	(0.070, 0.085)	(0.068, 0.084)	(0.067, 0.083)
White HH with Hisp origin	This is an empty cell	0.066	(0.064, 0.078)	(0.064, 0.079)	(0.064, 0.079)
Non-White CP, home owner	This is an empty cell	0.058	(0.050, 0.063)	(0.048, 0.061)	(0.048, 0.061)
Two generations present, Black HH	This is an empty cell	0.057	(0.053, 0.066)	(0.053, 0.066)	(0.053, 0.067)
Black HH, home owner	This is an empty cell	0.052	(0.046, 0.058)	(0.046, 0.059)	(0.046, 0.059)
SP present, HH is Black	This is an empty cell	0.039	(0.032, 0.042)	(0.032, 0.043)	(0.032, 0.042)
White-nonwhite CP	This is an empty cell	0.034	(0.029, 0.039)	(0.032, 0.044)	(0.032, 0.044)
Hisp HH over 50, home owner	This is an empty cell	0.029	(0.025, 0.034)	(0.025, 0.035)	(0.025, 0.035)
One grandchild present	This is an empty cell	0.028	(0.023, 0.033)	(0.024, 0.034)	(0.024, 0.034)
Adult Black female w/ at least one CH under 18	This is an empty cell	0.027	(0.028, 0.038)	(0.027, 0.037)	(0.027, 0.037)
At least two generations present, Hisp CP	This is an empty cell	0.027	(0.022, 0.031)	(0.022, 0.031)	(0.022, 0.031)
Hisp CP with at least one biological CH	This is an empty cell	0.025	(0.020, 0.028)	(0.019, 0.028)	(0.019, 0.028)
At least three generations present	This is an empty cell	0.023	(0.020, 0.028)	(0.019, 0.028)	(0.019, 0.028)
Only one parent	This is an empty cell	0.020	(0.016, 0.024)	(0.016, 0.024)	(0.016, 0.024)
At least one stepchild	This is an empty cell	0.019	(0.018, 0.026)	(0.018, 0.027)	(0.018, 0.027)
Adult Hisp male w/ at least one CH under 10	This is an empty cell	0.018	(0.017, 0.025)	(0.016, 0.025)	(0.016, 0.025)
At least one adopted CH, White CP	This is an empty cell	0.008	(0.005, 0.010)	(0.005, 0.010)	(0.005, 0.010)
Black CP with at least two biological children	This is an empty cell	0.006	(0.003, 0.007)	(0.003, 0.007)	(0.003, 0.007)
Black HH under 40, home owner	This is an empty cell	0.005	(0.005, 0.009)	(0.005, 0.010)	(0.005, 0.011)
Three generations present, White CP	This is an empty cell	0.005	(0.004, 0.008)	(0.004, 0.010)	(0.004, 0.009)
White HH under 25, home owner	This is an empty cell	0.003	(0.002, 0.005)	(0.004, 0.009)	(0.004, 0.009)

References

Andridge, R.R., and Little, R.J.A. (2010). A review of hot deck imputation for survey non-response. International Statistical Review, 78(1), 40-64.

Bennink, M., Croon, M.A., Kroon, B. and Vermunt, J.K. (2016). Micro-macro multilevel latent class models with multiple discrete individual-level variables. Advances in Data Analysis and Classification.

Chambers, R., and Skinner, C. (2003). Analysis of Survey Data, Wiley Series in Survey Methodology, Wiley.

Dunson, D.B., and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. Journal of the American Statistical Association, 104, 1042-1051.

Hu, J., Reiter, J.P. and Wang, Q. (2018). Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Analysis, 13, 183-200.

Ishwaran, H., and James, L.F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 161-173.

Kalton, G., and Kasprzyk, D. (1986). The treatment of missing survey data. Survey Methodology, 12, 1, 1-16. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/1986001/article/14404-eng.pdf.

Little, R.J.A. (1993). Statistical analysis of masked data. Journal of Official Statistics, 9, 407-426.

Manrique-Vallier, D., and Reiter, J.P. (2014). Bayesian estimation of discrete multivariate latent structure models with structural zeros. Journal of Computational and Graphical Statistics, 23, 1061-1079.

Murray, J.S., and Reiter, J.P. (2016). Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence (forthcoming). Journal of the American Statistical Association.

Raghunathan, T.E., and Rubin, D.B. (2001). Multiple imputation for statistical disclosure limitation. Technical Report.

Reiter, J.P. (2003). Inference for partially synthetic, public use microdata sets. Survey Methodology, 29, 2, 181-188. Paper available at https://www150.statcan.gc.ca/n1/pub/12-001-x/2003002/article/6785-eng.pdf.

Reiter, J.P., and Raghunathan, T.E. (2007). The multiple adaptations of multiple imputation. Journal of the American Statistical Association, 102, 1462-1471.

Rubin, D.B. (1976). Inference and missing data (with discussion). Biometrika, 63, 581-592.

Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys, New York: John Wiley & Sons, Inc.

Rubin, D.B. (1993). Discussion: Statistical disclosure limitation. Journal of Official Statistics, 9, 462-468.

Savitsky, T.D., and Toth, D. (2016). Bayesian estimation under informative sampling. Electronic Journal of Statistics, 10.1, 1677-1708.

Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639-650.

Si, Y., and Reiter, J.P. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics, 38.5, 199-521.

Vermunt, J.K. (2003). Multilevel latent class models. Sociological Methodology, 213-239.

Vermunt, J.K. (2008). Latent class and finite mixture models for multilevel data sets. Statistical Methods in Medical Research, 33-51.

Walker, S.G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics - Simulation and Computation, 1, 45-54.

Wang, Q., Akande, O., Hu, J., Reiter, J. and Barrientos, A. (2016). NestedCategBayesImpute: Modeling and Generating Synthetic Versions of Nested Categorical Data in the Presence of Impossible Combinations. The Comprehensive R Archive Network.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2019-07-04

Language selection

Search and menus

Search

Multiple imputation of missing values in household data with structural zeros
Section 6. Discussion

Acknowledgements

Appendix

A.1 Proof that the rejection sampling step S9' in Section 3 generates samples from the correct posterior distribution

A.2 Modified Gibbs sampler for the cap-and-weight approach

A.3 List of structural zeros

A.4 Empirical study of the speedup approaches

A.5 Empirical study of missing data imputation under MCAR

References

Multiple imputation of missing values in household data with structural zeros Section 6. Discussion

Acknowledgements

Appendix

A.1 Proof that the rejection sampling step S9' in Section 3 generates samples from the correct posterior distribution

A.2 Modified Gibbs sampler for the cap-and-weight approach

A.3 List of structural zeros

A.4 Empirical study of the speedup approaches

A.5 Empirical study of missing data imputation under MCAR

References

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Multiple imputation of missing values in household data with structural zeros
Section 6. Discussion