Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
Section 3. Simulation-based evaluation of imputation methods
Methods
for missing data imputation are usually evaluated via real-data based
simulations (van Buuren, 2018).
Namely, one creates missing values from a complete dataset according to a
missing data mechanism (Little and
Rubin, 2014), imputes the missing values by a specific method, and then
compares these imputed values with the original “true” values based on some
metrics.
We
first give a quick review of Rubin’s MI combination rules. Let
be the target estimand in the population, and
and
be the point and variance estimate of
based on the
imputed dataset, respectively. The MI point
estimate of
is
and the corresponding estimate of the variance
is equal to
where
and
The confidence interval of
is constructed using
where
is a
-distribution with
degrees of freedom.
The
first step in our simulation-based evaluation procedure is choosing a dataset
with all values observed, which is taken as the “population”. We then choose a
set of target estimands
and compute their values from this population
data, which are taken as the “ground truth”. The estimands are usually summary
statistics of the variables or parameters in a down-stream analysis model,
e.g., a coefficient in a regression model (Tang, Song, Belin and Unützer, 2005; Huque, Carlin, Simpson and Lee, 2018). Second, we randomly draw without replacement
samples of size
from the population data, and in each of
sample
create missing data according to a specific
missing data mechanism and pre-fixed proportion of missingness. Third, for each
simulated sample with missing data, we create
imputed datasets using the imputation method
under consideration and construct the point and interval estimate of each
estimand using Rubin’s rules. Lastly, we compute performance metrics of each
estimand from the quantities obtained in the previous step.
In
the empirical application, we select a large complete subsample from the
American Community Survey (ACS)
−
a national survey that bears the hallmarks of
many big survey data
−
as our population. Since discrete variables
are prevalent in the ACS, as well as in most survey data, we focus on the
marginal probabilities of binary and categorical variables; e.g., a categorical
variable with
categories has
estimands. To evaluate how well the imputation
methods preserve the multivariate distributional properties, similar to Akande et al. (2017), we also
consider the bivariate probabilities of all two-way combinations of categories
in binary and categorical variables. Another useful metric is the finite-sample
pairwise correlations between continuous variables. For continuous variables,
the common estimands are mean, median or variance. To facilitate meaningful
comparisons of the results between the categorical and continuous variables, we
propose to discretize each continuous variable into
categories based on the sample quantiles. We
then evaluate these binned continuous variables as categorical variables based
on the aforementioned estimands of marginal and bivariate probabilities.
For
each estimand
we consider three metrics. The first metric
focuses on bias. To accommodate close-to-zero estimands that are prevalent in
probabilities of categorical variables, we consider the absolute standardized
bias (ASB) of each estimand
where
is the MI point estimate of
in simulation
The second metric is the relative mean squared error
(Rel.MSE), which is the ratio between the MSE of estimating
from the imputed data and that from the
sampled data before introducing the missing data:
where
is defined earlier, and
is the prototype estimator of
i.e., the point estimate from the complete
sampled data in simulation
The third metric is coverage rate, which is the
proportion of the
(e.g., 95%) confidence intervals, denoted by
in the
simulations that contain the true
We recommend conducting a large number of simulations
(e.g.,
to obtain reliable estimates of MSE and
coverage. This would not be a problem for deep learning algorithms, which can
be typically completed in seconds even with large sample sizes. However, it can
be computationally prohibitive for the MICE algorithms when each of the
simulated data is large (e.g.,
in some of our simulations). In the situation
that one has to rely on only a few or even a single simulation for evaluation,
we propose a modified metric of bias. Specifically, for each categorical
variable or binned continuous variable
we define the weighted absolute bias (WAB) as
the sum of the absolute bias weighted by the true marginal probability in each
category:
where
is the total number of categories,
is the population marginal probability of
category
in variable
and
is its corresponding point estimate in
simulation
We can also average the weighted absolute bias
over a number of repeatedly simulated samples.
The
above procedure and metrics differ from the common practice in the machine
learning literature. For example, many machine learning papers on missing data
imputation conduct simulations on benchmark datasets, but these data often have
vastly different structure and features from survey data and thus are less
informative for the goal of this paper. One such dataset is the Breast Cancer
dataset in the UCI Machine Learning Repository (Dua and Graff, 2017), which has only 569 sample units and no
categorical variables. Also, these simulations are usually based on randomly
creating missing values of a single dataset repeatedly rather than on drawing
repeated samples from a population, and thus fails to account for the sampling
mechanism. Moreover, these evaluations often use metrics focusing on accuracy
of individual predictions rather than distributional features. Specifically,
the most commonly used metrics are the root mean squared error (RMSE) and
accuracy (Gondara and Wang, 2018; Yoon,
Jordon and Schaar, 2018; Lu et al.,
2020). Both metrics can be defined in an overall or variable-specific
fashion, but the machine learning literature usually focuses on the overall
version. The overall RMSE is defined as
where
is the value of continuous variable
for individual
in the complete data before introducing
missing data, and
is the corresponding imputed value. For
non-missing values (i.e.,
The (overall) accuracy is defined for
categorical variables, namely it is the proportion of the imputed values being
equal to the corresponding original “true” value:
where
is the set of categorical variables.
A number of caveats are in order for the RMSE and
accuracy metrics. First, they are usually computed on a single imputed sample
as an overall measure of an imputation method, but this ignores the uncertainty
of imputations. Second, both RMSE and accuracy are single value summaries and
do not capture the multivariate distributional feature of data. Third, RMSE
does not adjust for the different scale of variables and can be be easily
dominated by a few outliers; also, it is often computed without differentiating
between continuous and categorical variables. Lastly, when there are multiple
imputed data, a common way is to use the mean
of the
imputed value as
in (3.5), but the statistical meaning of the
resulting metrics is opaque. This is particularly problematic for categorical
variables. For these reasons, we warn against using the overall RMSE and
accuracy as the only metrics for comparing imputation methods, and one should
exercise caution when interpreting them.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa