# Multiple imputation of missing values in household data with structural zeros Section 5. Empirical study

To evaluate the performance of the NDPMPM as an imputation method, as well as the speed up strategies, we use data from the public use microdata files from the 2012 ACS, available for download from the United States Census Bureau (http://www2.census.gov/acs2012_1yr/pums/). We construct a population of 764,580 households of sizes $\mathcal{H}=\left\{2,\text{\hspace{0.17em}}3,\text{\hspace{0.17em}}4\right\},$ from which we sample $n=\text{5,000}$ households comprising $N=\text{13,181}$ individuals. We work with the variables described in Table 5.1, which mimic those in the U.S. decennial census. The structural zeros involve ages and relationships of individuals in the same house; see the Appendix for a full list of rules that we used. We move the household head to the household level as in Section 4.1 to take advantage of the computational gains.

We introduce missing values using the following scenario. We let household size and age of household heads be fully observed. We randomly and independently blank 30% of each variable for the remaining household-level variables. For individuals other than the household head, we randomly and independently blank 30% of the values for gender, race and Hispanic origin. We make age missing with rates 50%, 20%, 40% and 30% for values of the relationship variable in the sets {2}, {3, 4, 5, 10}, {7, 9} and {6, 8, 11, 12, 13}, respectively. We make the relationship variable missing with rates 40%, 25%, 10%, and 55% for values of age in the sets $\left\{x\text{ }:\text{\hspace{0.17em}}x\le 20\right\},$ $\left\{x\text{ }:\text{\hspace{0.17em}}20 $\left\{x\text{ }:\text{\hspace{0.17em}}50 and $\left\{x\text{ }:\text{\hspace{0.17em}}x>70\right\},$ respectively. This results in approximately 30% missing values for both variables. About 8% of the individuals in the sample are missing both the age and relationship variable, and 2% are missing gender, age, and relationship jointly. This mechanism results in data that technically are not missing at random, but we use the NDPMPM approach regardless to examine its potential in a complicated missingness mechanism. Actual rates of item nonresponse in census data tend to be smaller than what we use here, but we use high rates to put the NDPMPM through a challenging stress test. We also introduce missing values using a missing completely at random scenario with rates in the 10% range across all the variables. In short, the results are similar to those here, though more accurate due to the lower rates of missingness. See the Appendix for the results.

﻿
Table 5.1
Description of variables used in the study. “HH” means household head
Table summary
This table displays the results of Description of variables used in the study. “HH” means household head. The information is grouped by Description of variable (appearing as row headers), Categories (appearing as column headers).
Description of variable Categories
Household-level variables Ownership of dwelling 1 = owned or being bought, 2 = rented
Household size 2 = 2 people, 3 = 3 people, 4 = 4 people
Gender of HH 1 = male, 2 = female
Race of HH 1 = white, 2 = black,
3 = American Indian or Alaska native,
4 = Chinese, 5 = Japanese,
6 = other Asian/Pacific islander, 7 = other race,
8 = two major races,
9 = three or more major races
Hispanic origin of HH 1 = not Hispanic, 2 = Mexican,
3 = Puerto Rican, 4 = Cuban, 5 = other
Age of HH 1 = less than one year old, 2 = 1 year old,
3 = 2 years old, ..., 96 = 95 years old
Individual-level variables Gender same as “Gender of HH”
Race same as “Race of HH”
Hispanic origin same as “Hispanic origin of HH”
Age same as “Age of HH”
Relationship to head of household 1 = spouse, 2 = biological child,
3 = adopted child, 4 = stepchild, 5 = sibling,
6 = parent, 7 = grandchild, 8 = parent-in-law,
9 = child-in-law, 10 = other relative,
11 = boarder, roommate or partner,
12 = other non-relative or foster child

We estimate the NDPMPM using two approaches, both using the rejection step S9' in Section 3. The first approach considers ${\psi }_{2}={\psi }_{3}={\psi }_{4}=1,$ i.e., without using the cap-and-weight approach, while the second approach considers ${\psi }_{2}={\psi }_{3}=1/2$ and ${\psi }_{4}=1/3.$ For each approach, we run the MCMC sampler for 10,000 iterations, discarding the first 5,000 as burn-in and thinning the remaining samples every five iterations, resulting in 1,000 MCMC post burn-in iterates. We set $F=30$ and $S=15$ for each approach based on initial tuning runs. Across the approaches, the effective number of occupied household-level clusters usually ranges from 13 to 16 with a maximum of 25, while the effective number of occupied individual-level clusters across all household-level clusters ranges from 3 to 5 with a maximum of 10. For convergence, we examined trace plots of $\alpha ,$ $\beta ,$ and weighted averages of a random sample of the multinomial probabilities in (2.3) and (2.4) (since the multinomial probabilities themselves are prone to label switching).

For both methods, we generate $L=50$ completed datasets, $Z=\left({Z}^{\left(1\right)},\text{\hspace{0.17em}}\dots ,\text{\hspace{0.17em}}{Z}^{\left(50\right)}\right)$ , using the posterior predictive distribution of the NDPMPM, from which we estimate all marginal distributions, bivariate distributions of all possible pairs of variables, and trivariate distributions of all possible triplets of variables. We also estimate several probabilities that depend on within household relationships and the household head to investigate the performance of the NDPMPM in estimating complex relationships. We obtain confidence intervals using multiple imputation inferences (Rubin, 1987). As a brief review, let $q$ be the completed-data point estimator of some estimand $Q,$ and let $u$ be the estimator of variance associated with $q.$ For $l=1,\text{\hspace{0.17em}}\dots ,\text{\hspace{0.17em}}L,$ let ${q}^{\left(l\right)}$ and ${u}^{\left(l\right)}$ be the values of $q$ and $u$ in completed dataset ${Z}^{\left(l\right)}\text{​}.$ We use ${\overline{q}}_{L}={\sum }_{l=1}^{L}\text{\hspace{0.17em}}{q}^{\left(l\right)}/L$ as the point estimate of $Q.$ We use ${T}_{L}=\left(1+1/L\right){b}_{L}+{\overline{u}}_{L}$ as the estimated variance of $\overline{q},$ where ${b}_{L}={\sum }_{l=1}^{L}{\left({q}^{\left(l\right)}-{\overline{q}}_{L}\right)}^{2}/\left(L-1\right)$ and ${\overline{u}}_{L}={\sum }_{l=1}^{L}\text{\hspace{0.17em}}{u}^{\left(l\right)}/L.$ We make inference about $Q$ using $\left({\overline{q}}_{L}-Q\right)\sim {t}_{v}\left(0,\text{\hspace{0.17em}}{T}_{L}\right),$ where ${t}_{v}$ is a $t\text{ }-$ distribution with $v=\left(L-1\right){\left(1+{\overline{u}}_{L}/\left[\left(1+1/L\right){b}_{L}\right]\right)}^{2}$ degrees of freedom.

Figures 5.1 and 5.2 display the value of ${\overline{q}}_{50}$ for each estimated marginal, bivariate and trivariate probability plotted against its corresponding estimate from the original data, without missing values. Figure 5.1 shows the results for the NDPMPM with the rejection sampler, and Figure 5.2 shows the results for the NDPMPM using the cap-and-weight approach. For both approaches, the point estimates are close to those from the data before introducing missing values, suggesting that the NDPMPM does a good job of capturing important features of the joint distribution of the variables. Figure 5.2 in particular also shows that the cap-and-weight approach did not degrade the estimates.

Table 5.2 displays 95% confidence intervals for several probabilities involving within-household relationships, as well as the value in the full population of 764,580 households. The intervals include the two based on the NDPMPM imputation engines and the interval from the data before introducing missingness. For the latter, we use the usual Wald interval, $\stackrel{^}{p}±\text{1}\text{.96}\text{\hspace{0.17em}}\sqrt{\stackrel{^}{p}\left(1-\stackrel{^}{p}\right)/n},$ where $\stackrel{^}{p}$ is the corresponding sample percentage. For the most part, the intervals from the NDPMPM with the full rejection sampling are close to those based on the data without any missingness. They tend to include the true population quantity. The NDPMPM imputation engine results in noticeable downward bias for the percentages of households where everyone is the same race, with bias increasing as the household size gets bigger. This is a challenging estimand to estimate accurately via imputation, particularly for larger households. Hu et al. (2018) identified biases in the same direction when using the NDPMPM (with household head data treated as individual-level variables) to generate fully synthetic data, noting that the bias gets smaller as the sample size increases. The NDPMPM fits the joint distribution of the data better and better as the sample size grows. Hence, we expect the NDPMPM imputation engine to be more accurate with larger sample sizes, as well as with smaller fractions of missing values.

The interval estimates from the cap-and-weight method are generally similar to those for the full rejection sampler, with some degradation particularly for the percentages of same race households by household size. This degradation comes with a benefit, however. Based on MCMC runs on a standard laptop, the NDPMPM using the cap-and-weight approach and moving household heads’ data values to the household level is about 42% faster than the NDPMPM with household heads’ data values moved to the household level.

Description for Figure 5.1 ﻿

Figure presenting the marginal, bivariate and trivariate probabilities computed in the sample and imputed datasets from the truncated NDPMPM with the rejection sampler (household heads’ data values moved to the household level). There are three scatter plots with a 45° straight line. The three graphs illustrate the marginal, bivariate and trivariate probabilities respectively. The average from 50 imputed datasets is on the y-axis, ranging from 0.0 to 1.0. The sample estimate is on the x-axis, ranging from 0.0 to 0.6. For all three graphs, estimations from imputed data are close to those from the sample, almost on the line.

Description for Figure 5.2 ﻿

Figure presenting the marginal, bivariate and trivariate probabilities computed in the sample and imputed datasets from the truncated NDPMPM using the cap-and-weight approach (household heads’ data values moved to the household level). There are three scatter plots with a 45° straight line. The three graphs illustrate the marginal, bivariate and trivariate probabilities respectively. The average from 50 imputed datasets is on the y-axis, ranging from 0.0 to 1.0. The sample estimate is on the x-axis, ranging from 0.0 to 0.6. For all three graphs, estimations from imputed data are close to those from the sample, almost on the line. The cap-and-weight approach did not degrade the estimates.

﻿
Q No Missing NDPMPM NDPMPM Capped All same race household: 0.942 (0.932, 0.949) (0.891, 0.917) (0.884, 0.911) 0.908 (0.907, 0.937) (0.843, 0.890) (0.821, 0.870) 0.901 (0.879, 0.917) (0.793, 0.851) (0.766, 0.828) This is an empty cell 0.696 (0.682, 0.707) (0.695, 0.722) (0.695, 0.722) This is an empty cell 0.656 (0.641, 0.668) (0.640, 0.669) (0.634, 0.664) This is an empty cell 0.600 (0.589, 0.616) (0.603, 0.632) (0.604, 0.634) This is an empty cell 0.580 (0.569, 0.596) (0.577, 0.606) (0.574, 0.604) This is an empty cell 0.488 (0.465, 0.492) (0.341, 0.371) (0.324, 0.355) This is an empty cell 0.476 (0.456, 0.484) (0.450, 0.479) (0.451, 0.480) This is an empty cell 0.462 (0.441, 0.468) (0.442, 0.470) (0.443, 0.471) This is an empty cell 0.437 (0.431, 0.458) (0.430, 0.459) (0.428, 0.456) This is an empty cell 0.322 (0.309, 0.335) (0.307, 0.339) (0.311, 0.343) This is an empty cell 0.078 (0.070, 0.085) (0.062, 0.078) (0.061, 0.077) This is an empty cell 0.066 (0.064, 0.078) (0.062, 0.079) (0.062, 0.078) This is an empty cell 0.058 (0.050, 0.063) (0.038, 0.052) (0.037, 0.051) This is an empty cell 0.057 (0.053, 0.066) (0.052, 0.066) (0.052, 0.067) This is an empty cell 0.052 (0.046, 0.058) (0.044, 0.058) (0.044, 0.059) This is an empty cell 0.039 (0.032, 0.042) (0.032, 0.044) (0.031, 0.043) This is an empty cell 0.034 (0.029, 0.039) (0.038, 0.053) (0.043, 0.059) This is an empty cell 0.029 (0.025, 0.034) (0.023, 0.034) (0.024, 0.034) This is an empty cell 0.028 (0.023, 0.033) (0.024, 0.035) (0.023, 0.035) This is an empty cell 0.027 (0.028, 0.038) (0.025, 0.036) (0.025, 0.036) This is an empty cell 0.027 (0.022, 0.031) (0.022, 0.032) (0.023, 0.033) This is an empty cell 0.025 (0.020, 0.028) (0.019, 0.029) (0.020, 0.030) This is an empty cell 0.023 (0.020, 0.028) (0.017, 0.026) (0.017, 0.026) This is an empty cell 0.020 (0.016, 0.024) (0.013, 0.021) (0.013, 0.021) This is an empty cell 0.019 (0.018, 0.026) (0.019, 0.030) (0.019, 0.030) This is an empty cell 0.018 (0.017, 0.025) (0.014, 0.022) (0.014, 0.022) This is an empty cell 0.008 (0.005, 0.010) (0.004, 0.010) (0.004, 0.011) This is an empty cell 0.006 (0.003, 0.007) (0.003, 0.007) (0.003, 0.007) This is an empty cell 0.005 (0.005, 0.009) (0.006, 0.013) (0.007, 0.013) This is an empty cell 0.005 (0.004, 0.008) (0.004, 0.010) (0.004, 0.009) This is an empty cell 0.003 (0.002, 0.005) (0.003, 0.007) (0.003, 0.007)
﻿

Is something not working? Is there information outdated? Can't find what you're looking for?