Multiple imputation of missing values in household data with structural zeros
Section 5. Empirical study
To evaluate the performance of
the NDPMPM as an imputation method, as well as the speed up strategies, we use
data from the public use microdata files from the 2012 ACS, available for
download from the United States Census Bureau (http://www2.census.gov/acs2012_1yr/pums/).
We construct a population of 764,580 households of sizes
from which we sample
households comprising
individuals. We work with the variables
described in Table 5.1, which mimic those in the U.S. decennial census.
The structural zeros involve ages and relationships of individuals in the same
house; see the Appendix for a full list of rules that we used. We move the
household head to the household level as in Section 4.1 to take advantage
of the computational gains.
We introduce missing values
using the following scenario. We let household size and age of household heads
be fully observed. We randomly and independently blank 30% of each variable for
the remaining household-level variables. For individuals other than the
household head, we randomly and independently blank 30% of the values for
gender, race and Hispanic origin. We make age missing with rates 50%, 20%, 40%
and 30% for values of the relationship variable in the sets {2}, {3, 4, 5, 10},
{7, 9} and {6, 8, 11, 12, 13}, respectively. We make
the relationship variable missing with rates 40%, 25%, 10%, and 55% for values
of age in the sets
and
respectively. This results in approximately
30% missing values for both variables. About 8% of the individuals in the
sample are missing both the age and relationship variable, and 2% are missing
gender, age, and relationship jointly. This mechanism results in data that
technically are not missing at random, but we use the NDPMPM approach
regardless to examine its potential in a complicated missingness mechanism.
Actual rates of item nonresponse in census data tend to be smaller than what we
use here, but we use high rates to put the NDPMPM through a challenging stress
test. We also introduce missing values using a missing completely at random
scenario with rates in the 10% range across all the variables. In short, the
results are similar to those here, though more accurate due to the lower rates
of missingness. See the Appendix for the results.
Table 5.1
Description of variables used in the study. “HH” means household head
Table summary
This table displays the results of Description of variables used in the study. “HH” means household head. The information is grouped by Description of variable (appearing as row headers), Categories (appearing as column headers).
Description of variable |
Categories |
Household-level variables |
Ownership of dwelling |
1 = owned or being bought, 2 = rented |
Household size |
2 = 2 people, 3 = 3 people, 4 = 4 people |
Gender of HH |
1 = male, 2 = female |
Race of HH |
1 = white, 2 = black, |
3 = American Indian or Alaska native, |
4 = Chinese, 5 = Japanese, |
6 = other Asian/Pacific islander, 7 = other race, |
8 = two major races, |
9 = three or more major races |
Hispanic origin of HH |
1 = not Hispanic, 2 = Mexican, |
3 = Puerto Rican, 4 = Cuban, 5 = other |
Age of HH |
1 = less than one year old, 2 = 1 year old, |
3 = 2 years old, ..., 96 = 95 years old |
Individual-level variables |
Gender |
same as “Gender of HH” |
Race |
same as “Race of HH” |
Hispanic origin |
same as “Hispanic origin of HH” |
Age |
same as “Age of HH” |
Relationship to head of household |
1 = spouse, 2 = biological child, |
3 = adopted child, 4 = stepchild, 5 = sibling, |
6 = parent, 7 = grandchild, 8 = parent-in-law, |
9 = child-in-law, 10 = other relative, |
11 = boarder, roommate or partner, |
12 = other non-relative or foster child |
We
estimate the NDPMPM using two approaches, both using the rejection step S9' in
Section 3. The first approach considers
i.e., without using the cap-and-weight
approach, while the second approach considers
and
For each approach, we run the MCMC sampler for
10,000 iterations, discarding the first 5,000 as burn-in and thinning the
remaining samples every five iterations, resulting in 1,000 MCMC post burn-in
iterates. We set
and
for each approach based on initial tuning
runs. Across the approaches, the effective number of occupied household-level
clusters usually ranges from 13 to 16 with a maximum of 25, while the effective
number of occupied individual-level clusters across all household-level
clusters ranges from 3 to 5 with a maximum of 10. For convergence, we examined
trace plots of
and weighted averages of a random sample of
the multinomial probabilities in (2.3) and (2.4) (since the multinomial
probabilities themselves are prone to label switching).
For both methods, we generate
completed datasets,
, using the posterior
predictive distribution of the NDPMPM, from which we estimate all marginal
distributions, bivariate distributions of all possible pairs of variables, and
trivariate distributions of all possible triplets of variables. We also
estimate several probabilities that depend on within household relationships
and the household head to investigate the performance of the NDPMPM in
estimating complex relationships. We obtain confidence intervals using multiple
imputation inferences (Rubin, 1987). As a brief review, let
be the completed-data point estimator of some
estimand
and let
be the estimator of variance associated with
For
let
and
be the values of
and
in completed dataset
We use
as the point estimate of
We use
as the estimated variance of
where
and
We make inference about
using
where
is a
distribution with
degrees of freedom.
Figures 5.1 and 5.2
display the value of
for each estimated marginal, bivariate and
trivariate probability plotted against its corresponding estimate from the
original data, without missing values. Figure 5.1 shows the results for
the NDPMPM with the rejection sampler, and Figure 5.2 shows the results
for the NDPMPM using the cap-and-weight approach. For both approaches, the
point estimates are close to those from the data before introducing missing
values, suggesting that the NDPMPM does a good job of capturing important
features of the joint distribution of the variables. Figure 5.2 in
particular also shows that the cap-and-weight approach did not degrade the
estimates.
Table 5.2 displays 95%
confidence intervals for several probabilities involving within-household
relationships, as well as the value in the full population of 764,580
households. The intervals include the two based on the NDPMPM imputation
engines and the interval from the data before introducing missingness. For the
latter, we use the usual Wald interval,
where
is the corresponding sample percentage. For
the most part, the intervals from the NDPMPM with the full rejection sampling
are close to those based on the data without any missingness. They tend to
include the true population quantity. The NDPMPM imputation engine results in
noticeable downward bias for the percentages of households where everyone is
the same race, with bias increasing as the household size gets bigger. This is
a challenging estimand to estimate accurately via imputation, particularly for
larger households. Hu et al. (2018) identified biases in the same
direction when using the NDPMPM (with household head data treated as
individual-level variables) to generate fully synthetic data, noting that the
bias gets smaller as the sample size increases. The NDPMPM fits the joint
distribution of the data better and better as the sample size grows. Hence, we
expect the NDPMPM imputation engine to be more accurate with larger sample
sizes, as well as with smaller fractions of missing values.
The interval estimates from
the cap-and-weight method are generally similar to those for the full rejection
sampler, with some degradation particularly for the percentages of same race
households by household size. This degradation comes with a benefit, however.
Based on MCMC runs on a standard laptop, the NDPMPM using the cap-and-weight
approach and moving household heads’ data values to the household level is
about 42% faster than the NDPMPM with household heads’ data values moved to the
household level.
Description for Figure 5.1
Figure presenting the marginal, bivariate and trivariate probabilities computed in the sample and imputed datasets from the truncated NDPMPM with the rejection sampler (household heads’ data values moved to the household level). There are three scatter plots with a 45° straight line. The three graphs illustrate the marginal, bivariate and trivariate probabilities respectively. The average from 50 imputed datasets is on the y-axis, ranging from 0.0 to 1.0. The sample estimate is on the x-axis, ranging from 0.0 to 0.6. For all three graphs, estimations from imputed data are close to those from the sample, almost on the line.
Description for Figure 5.2
Figure presenting the marginal, bivariate and trivariate probabilities computed in the sample and imputed datasets from the truncated NDPMPM using the cap-and-weight approach (household heads’ data values moved to the household level). There are three scatter plots with a 45° straight line. The three graphs illustrate the marginal, bivariate and trivariate probabilities respectively. The average from 50 imputed datasets is on the y-axis, ranging from 0.0 to 1.0. The sample estimate is on the x-axis, ranging from 0.0 to 0.6. For all three graphs, estimations from imputed data are close to those from the sample, almost on the line. The cap-and-weight approach did not degrade the estimates.
Table 5.2
Confidence intervals for selected probabilities that depend on within-household relationships in the original and imputed datasets. “No missing” is based on the sampled data before introducing missing values, “NDPMPM” uses the truncated NDPMPM, moving household heads’ data values to the household level, and “NDPMPM Capped” uses the truncated NDPMPM with the cap-and-weight approach and moving household heads’ data values to the household level. “HH ” means household head, “SP” means spouse, “CH” means child, and “CP” means couple. Q is the value in the full population of 764,580 households
Table summary
This table displays the results of Confidence intervals for selected probabilities that depend on within-household relationships in the original and imputed datasets. “No missing” is based on the sampled data before introducing missing values Q, No Missing, NDPMPM and NDPMPM Capped (appearing as column headers).
|
Q |
No Missing |
NDPMPM |
NDPMPM Capped |
All same race household: |
|
0.942 |
(0.932, 0.949) |
(0.891, 0.917) |
(0.884, 0.911) |
|
0.908 |
(0.907, 0.937) |
(0.843, 0.890) |
(0.821, 0.870) |
|
0.901 |
(0.879, 0.917) |
(0.793, 0.851) |
(0.766, 0.828) |
SP present |
This is an empty cell |
0.696 |
(0.682, 0.707) |
(0.695, 0.722) |
(0.695, 0.722) |
Same race CP |
This is an empty cell |
0.656 |
(0.641, 0.668) |
(0.640, 0.669) |
(0.634, 0.664) |
SP present, HH is White |
This is an empty cell |
0.600 |
(0.589, 0.616) |
(0.603, 0.632) |
(0.604, 0.634) |
White CP |
This is an empty cell |
0.580 |
(0.569, 0.596) |
(0.577, 0.606) |
(0.574, 0.604) |
CP with age difference less than five |
This is an empty cell |
0.488 |
(0.465, 0.492) |
(0.341, 0.371) |
(0.324, 0.355) |
Male HH, home owner |
This is an empty cell |
0.476 |
(0.456, 0.484) |
(0.450, 0.479) |
(0.451, 0.480) |
HH over 35, no CH present |
This is an empty cell |
0.462 |
(0.441, 0.468) |
(0.442, 0.470) |
(0.443, 0.471) |
At least one biological CH present |
This is an empty cell |
0.437 |
(0.431, 0.458) |
(0.430, 0.459) |
(0.428, 0.456) |
HH older than SP, White HH |
This is an empty cell |
0.322 |
(0.309, 0.335) |
(0.307, 0.339) |
(0.311, 0.343) |
Adult female w/ at least one CH under 5 |
This is an empty cell |
0.078 |
(0.070, 0.085) |
(0.062, 0.078) |
(0.061, 0.077) |
White HH with Hisp origin |
This is an empty cell |
0.066 |
(0.064, 0.078) |
(0.062, 0.079) |
(0.062, 0.078) |
Non-White CP, home owner |
This is an empty cell |
0.058 |
(0.050, 0.063) |
(0.038, 0.052) |
(0.037, 0.051) |
Two generations present, Black HH |
This is an empty cell |
0.057 |
(0.053, 0.066) |
(0.052, 0.066) |
(0.052, 0.067) |
Black HH, home owner |
This is an empty cell |
0.052 |
(0.046, 0.058) |
(0.044, 0.058) |
(0.044, 0.059) |
SP present, HH is Black |
This is an empty cell |
0.039 |
(0.032, 0.042) |
(0.032, 0.044) |
(0.031, 0.043) |
White-nonwhite CP |
This is an empty cell |
0.034 |
(0.029, 0.039) |
(0.038, 0.053) |
(0.043, 0.059) |
Hisp HH over 50, home owner |
This is an empty cell |
0.029 |
(0.025, 0.034) |
(0.023, 0.034) |
(0.024, 0.034) |
One grandchild present |
This is an empty cell |
0.028 |
(0.023, 0.033) |
(0.024, 0.035) |
(0.023, 0.035) |
Adult Black female w/ at least one CH under 18 |
This is an empty cell |
0.027 |
(0.028, 0.038) |
(0.025, 0.036) |
(0.025, 0.036) |
At least two generations present, Hisp CP |
This is an empty cell |
0.027 |
(0.022, 0.031) |
(0.022, 0.032) |
(0.023, 0.033) |
Hisp CP with at least one biological CH |
This is an empty cell |
0.025 |
(0.020, 0.028) |
(0.019, 0.029) |
(0.020, 0.030) |
At least three generations present |
This is an empty cell |
0.023 |
(0.020, 0.028) |
(0.017, 0.026) |
(0.017, 0.026) |
Only one parent |
This is an empty cell |
0.020 |
(0.016, 0.024) |
(0.013, 0.021) |
(0.013, 0.021) |
At least one stepchild |
This is an empty cell |
0.019 |
(0.018, 0.026) |
(0.019, 0.030) |
(0.019, 0.030) |
Adult Hisp male w/ at least one CH under 10 |
This is an empty cell |
0.018 |
(0.017, 0.025) |
(0.014, 0.022) |
(0.014, 0.022) |
At least one adopted CH, White CP |
This is an empty cell |
0.008 |
(0.005, 0.010) |
(0.004, 0.010) |
(0.004, 0.011) |
Black CP with at least two biological children |
This is an empty cell |
0.006 |
(0.003, 0.007) |
(0.003, 0.007) |
(0.003, 0.007) |
Black HH under 40, home owner |
This is an empty cell |
0.005 |
(0.005, 0.009) |
(0.006, 0.013) |
(0.007, 0.013) |
Three generations present, White CP |
This is an empty cell |
0.005 |
(0.004, 0.008) |
(0.004, 0.010) |
(0.004, 0.009) |
White HH under 25, home owner |
This is an empty cell |
0.003 |
(0.002, 0.005) |
(0.003, 0.007) |
(0.003, 0.007) |
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2019
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa