Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
Section 3. Simulation-based evaluation of imputation methods

Table of contents

Methods for missing data imputation are usually evaluated via real-data based simulations (van Buuren, 2018). Namely, one creates missing values from a complete dataset according to a missing data mechanism (Little and Rubin, 2014), imputes the missing values by a specific method, and then compares these imputed values with the original “true” values based on some metrics.

We first give a quick review of Rubin’s MI combination rules. Let $Q$ be the target estimand in the population, and $q^{(l)}$ and $u^{(l)}$ be the point and variance estimate of $Q$ based on the $l^{th}$ imputed dataset, respectively. The MI point estimate of $Q$ is ${\bar{q}}_{L} = \sum_{l =1}^{L} q^{(l)} / L,$ and the corresponding estimate of the variance is equal to $T_{L} = (1 + 1 / L) b_{L} + {\bar{u}}_{L},$ where $b_{L} = \sum_{l =1}^{L} {(q^{(l)} - {\bar{q}}_{L})}^{2} / (L - 1),$ and ${\bar{u}}_{L} = \sum_{l =1}^{L} u^{(l)} / L .$ The confidence interval of $Q$ is constructed using $({\bar{q}}_{L} - Q) ~ t_{ν} (0, T_{L}),$ where $t_{v}$ is a $t$ -distribution with $ν = (L - 1) {(1 + {\bar{u}}_{L} / [(1 + 1 / L) b_{L}])}^{2}$ degrees of freedom.

The first step in our simulation-based evaluation procedure is choosing a dataset with all values observed, which is taken as the “population”. We then choose a set of target estimands $Q$ and compute their values from this population data, which are taken as the “ground truth”. The estimands are usually summary statistics of the variables or parameters in a down-stream analysis model, e.g., a coefficient in a regression model (Tang, Song, Belin and Unützer, 2005; Huque, Carlin, Simpson and Lee, 2018). Second, we randomly draw without replacement $H$ samples of size $n$ from the population data, and in each of sample $(h = 1, \dots, H)$ create missing data according to a specific missing data mechanism and pre-fixed proportion of missingness. Third, for each simulated sample with missing data, we create $L$ imputed datasets using the imputation method under consideration and construct the point and interval estimate of each estimand using Rubin’s rules. Lastly, we compute performance metrics of each estimand from the quantities obtained in the previous step.

In the empirical application, we select a large complete subsample from the American Community Survey (ACS) − a national survey that bears the hallmarks of many big survey data − as our population. Since discrete variables are prevalent in the ACS, as well as in most survey data, we focus on the marginal probabilities of binary and categorical variables; e.g., a categorical variable with $K$ categories has $K - 1$ estimands. To evaluate how well the imputation methods preserve the multivariate distributional properties, similar to Akande et al. (2017), we also consider the bivariate probabilities of all two-way combinations of categories in binary and categorical variables. Another useful metric is the finite-sample pairwise correlations between continuous variables. For continuous variables, the common estimands are mean, median or variance. To facilitate meaningful comparisons of the results between the categorical and continuous variables, we propose to discretize each continuous variable into $K$ categories based on the sample quantiles. We then evaluate these binned continuous variables as categorical variables based on the aforementioned estimands of marginal and bivariate probabilities.

For each estimand $Q,$ we consider three metrics. The first metric focuses on bias. To accommodate close-to-zero estimands that are prevalent in probabilities of categorical variables, we consider the absolute standardized bias (ASB) of each estimand $Q :$

$ASB = \sum_{h = 1}^{H} | {\bar{q}}_{L}^{(h)} - Q | / (H \cdot Q), (3.1)$

where ${\bar{q}}_{L}^{(h)}$ is the MI point estimate of $Q$ in simulation $h .$

The second metric is the relative mean squared error (Rel.MSE), which is the ratio between the MSE of estimating $Q$ from the imputed data and that from the sampled data before introducing the missing data:

$Rel .MSE = \frac{\sum_{h =1}^{H} {({\bar{q}}_{L}^{(h)} - Q)}^{2}}{\sum_{h =1}^{H} {({\tilde{Q}}^{(h)} - Q)}^{2}}, (3.2)$

where ${\bar{q}}_{L}^{(h)}$ is defined earlier, and ${\tilde{Q}}^{(h)}$ is the prototype estimator of $Q,$ i.e., the point estimate from the complete sampled data in simulation $h .$

The third metric is coverage rate, which is the proportion of the $α %$ (e.g., 95%) confidence intervals, denoted by ${CI}_{h}^{α}$ $(h = 1, \dots, H),$ in the $H$ simulations that contain the true $Q :$

$Coverage = \sum_{h =1}^{H} 1 {Q \in {CI}_{h}^{α}} / H . (3.3)$

We recommend conducting a large number of simulations (e.g., $H \geq 100)$ to obtain reliable estimates of MSE and coverage. This would not be a problem for deep learning algorithms, which can be typically completed in seconds even with large sample sizes. However, it can be computationally prohibitive for the MICE algorithms when each of the simulated data is large (e.g., $n = 100,000$ in some of our simulations). In the situation that one has to rely on only a few or even a single simulation for evaluation, we propose a modified metric of bias. Specifically, for each categorical variable or binned continuous variable $j,$ we define the weighted absolute bias (WAB) as the sum of the absolute bias weighted by the true marginal probability in each category:

$Weighted absolute bias = \sum_{k = 1}^{K} Q_{j k} | {\bar{q}}_{j k}^{(h)} - Q_{j k} |, (3.4)$

where $K$ is the total number of categories, $Q_{j k}$ is the population marginal probability of category $k$ in variable $j,$ and ${\bar{q}}_{j k}^{(h)}$ is its corresponding point estimate in simulation $h .$ We can also average the weighted absolute bias over a number of repeatedly simulated samples.

The above procedure and metrics differ from the common practice in the machine learning literature. For example, many machine learning papers on missing data imputation conduct simulations on benchmark datasets, but these data often have vastly different structure and features from survey data and thus are less informative for the goal of this paper. One such dataset is the Breast Cancer dataset in the UCI Machine Learning Repository (Dua and Graff, 2017), which has only 569 sample units and no categorical variables. Also, these simulations are usually based on randomly creating missing values of a single dataset repeatedly rather than on drawing repeated samples from a population, and thus fails to account for the sampling mechanism. Moreover, these evaluations often use metrics focusing on accuracy of individual predictions rather than distributional features. Specifically, the most commonly used metrics are the root mean squared error (RMSE) and accuracy (Gondara and Wang, 2018; Yoon, Jordon and Schaar, 2018; Lu et al., 2020). Both metrics can be defined in an overall or variable-specific fashion, but the machine learning literature usually focuses on the overall version. The overall RMSE is defined as

$RMSE = \sqrt{\frac{\sum_{i = 1}^{} \sum_{j} M_{i j} {({\hat{Y}}_{i j} - Y_{i j})}^{2}}{\sum_{i = 1}^{n} \sum_{j} M_{i j}}}, (3.5)$

where $Y_{i j}$ is the value of continuous variable $j$ for individual $i$ in the complete data before introducing missing data, and ${\hat{Y}}_{i j}$ is the corresponding imputed value. For non-missing values (i.e., $M_{i j} = 1),$ $Y_{i j} = {\hat{Y}}_{i j} .$ The (overall) accuracy is defined for categorical variables, namely it is the proportion of the imputed values being equal to the corresponding original “true” value:

$Accuracy = \frac{\sum_{i = 1}^{n} \sum_{j \in S_{cat}} M_{i j} 1 ({\hat{Y}}_{i j} = Y_{i j})}{\sum_{i = 1}^{n} \sum_{j \in S_{cat}} M_{i j}}, (3.6)$

where $S_{cat}$ is the set of categorical variables.

A number of caveats are in order for the RMSE and accuracy metrics. First, they are usually computed on a single imputed sample as an overall measure of an imputation method, but this ignores the uncertainty of imputations. Second, both RMSE and accuracy are single value summaries and do not capture the multivariate distributional feature of data. Third, RMSE does not adjust for the different scale of variables and can be be easily dominated by a few outliers; also, it is often computed without differentiating between continuous and categorical variables. Lastly, when there are multiple $(L)$ imputed data, a common way is to use the mean of the $L$ imputed value as ${\hat{Y}}_{i j}$ in (3.5), but the statistical meaning of the resulting metrics is opaque. This is particularly problematic for categorical variables. For these reasons, we warn against using the overall RMSE and accuracy as the only metrics for comparing imputation methods, and one should exercise caution when interpreting them.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-12-15

Language selection

Search and menus

Search

Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
Section 3. Simulation-based evaluation of imputation methods

Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison Section 3. Simulation-based evaluation of imputation methods

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
Section 3. Simulation-based evaluation of imputation methods