Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
Section 4. Evaluation based on ACS
In this section, we evaluate the four imputation methods described in Section 2 following the procedure and metrics described in Section 3. For simplicity, in the following discussions we use CART and RF to denote MICE-CART and MICE-RF, respectively.
4.1 The “population” data
We use the one-year Public Use Microdata Sample from the 2018 ACS to construct our population. The 2018 ACS data contains both household-level variables ‒ for example, whether or not a house is owned or rented ‒ and individual-level variables ‒ for example, age, income and sex of the individuals within each household. Since individuals nested within a household are often dependent, and the imputation methods we evaluate generally assume independence across all observations, we set our unit of observation at the household-level, where independence is more likely to hold. We first remove units corresponding to vacant houses. Next, we delete units with any missing values, so that we only keep the complete cases. Within each household, we also retain individual-level data corresponding only to the household head and merge them with the household-level variables, resulting in a rich set of variables with potentially complex joint relationships.
It is often challenging to generate plausible imputations for ordinal variables with many levels when there is very low mass at the highest levels, as is the case for some variables in the ACS data. Following Li, Baccini, Mealli, Zell, Frangakis and Rubin (2014), we treat ordinal variables with more than 10 levels as continuous variables. We also follow the approach in Akande et al. (2017) to exclude binary variables where the marginal probabilities violate or this eliminates estimands where the central limit theorem is not likely to hold. For each categorical variable with more than two levels but less than 10 levels where this might also be a problem, we merge the levels with a small number of observations in the population data. For example, for the household language variable, we recode the levels from five to three (English, Spanish, and other), because the probability of speaking neither English nor Spanish in the full population is less than 8.8%.
The final population data contains 1,257,501 units, with 18 binary variables, 20 categorical variables with 3 to 9 levels, and 8 continuous variables. We describe the variables in more detail in the supplementary material. We compute the population values of the estimands described in Section 3, including all marginal and bivariate probabilities of discrete and binned continuous variables. We vary the size of the simulated samples from 10,000 to 100,000, and simulate missing data according to either missing completely at random (MCAR) or missing at random (MAR) mechanisms in each of these scenarios.
4.2 Simulations with n = 10,000
We first randomly draw 100 samples of size 10,000, and set 30% of each sample to be missing under either MCAR or MAR. CART or RF takes around 2.8 and 9.2 hours, respectively, to create 10 imputed datasets with default parameters on a standard desktop computer with a single central processing unit (CPU). The deep learning methods are much faster because they leverage GPU computing power when implemented on the GPU-enabled TensorFlow software framework (Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, Devin, Ghemawat, Goodfellow, Harp, Irving, Isard, Jia, Jozefowicz, Kaiser, Kudlur, Levenberg, Mané, Monga, Moore, Murray, Olah, Schuster, Shlens, Steiner, Sutskever, Talwar, Tucker, Vanhoucke, Vasudevan, Viégas, Vinyals, Warden, Wattenberg, Wicke, Yu and Zheng, 2015). GAIN takes roughly 1.5 minutes and MIDA takes roughly 4 minutes to create 10 completed datasets using a GeForce GTX 1660 Ti GPU. Note that it is infeasible to manually tune the hyperparameter in each of the 100 simulations in each scenario for the deep learning models. So for each scenario, we have randomly selected one simulation, and tune the hyperparameters using the procedure described in Section 2. We then apply these selected hyperparameters to all simulations.
4.2.1 MCAR scenario
To create the MCAR scenario, we randomly set 30% of the values of each variable to be missing independently. Table 4.1 displays the distributions of the estimated ASB and relative MSE of all the marginal and bivariate probabilities in the imputed data by the four imputation methods.
Overall, for the estimands of marginal and bivariate probabilities of the categorical and binned continuous variables, MICE with CART significantly outperforms all other three methods, with consistently yielding the smallest ASB and relative MSE. RF is the second best, also consistently outperforming the deep learning methods. The advantage of the MICE algorithms is particularly pronounced in the upper (e.g., 75% and 90%) quantiles, indicating that GAIN and MIDA imputations have large variations over repeated samples and variables. Indeed, MIDA and GAIN lead to ultra long tails in estimating the summary statistics of the variables. For example, for bivariate probabilities of binned continuous variables, the 90% percentile of the ASB from MIDA and GAIN is approximately 20 and 27 times, respectively, of that from CART. The discrepancy is even bigger for relative MSE. There is no consistent pattern in comparing MIDA and GAIN. Specifically, for continuous variables, MIDA generally outperforms GAIN, but the difference is small except for the upper percentiles, where GAIN tends to produce very large bias and relative MSE. For categorical variables, GAIN outperforms MIDA half of the time, but again leads to the largest variation in imputations across the variables. Moreover, an interesting and somewhat surprising observation is that MICE with CART consistently outperforms RF ‒ sometimes by a large magnitude ‒ regardless of the choice of estimand or metric.
| Quantiles | Marginal | Bivariate | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| CART | RF | GAIN | MIDA | CART | RF | GAIN | MIDA | |||
| ASB | Cat. | 10% | 0.05 | 0.47 | 0.76 | 0.98 | 0.15 | 1.14 | 1.21 | 1.54 |
| 25% | 0.13 | 1.25 | 1.48 | 2.22 | 0.40 | 2.83 | 3.08 | 3.93 | ||
| 50% | 0.27 | 2.80 | 3.22 | 4.69 | 1.05 | 6.74 | 7.14 | 8.47 | ||
| 75% | 0.64 | 5.86 | 7.18 | 8.86 | 2.51 | 13.59 | 17.03 | 15.23 | ||
| 90% | 1.14 | 10.01 | 19.55 | 14.41 | 5.34 | 22.33 | 26.92 | 21.90 | ||
| B.Cont. | 10% | 0.06 | 0.24 | 7.25 | 2.73 | 0.19 | 1.30 | 6.05 | 4.80 | |
| 25% | 0.10 | 1.05 | 12.86 | 8.36 | 0.43 | 3.24 | 17.61 | 12.01 | ||
| 50% | 0.21 | 3.59 | 27.30 | 18.51 | 1.02 | 6.61 | 34.29 | 24.07 | ||
| 75% | 0.43 | 5.43 | 30.21 | 26.84 | 1.90 | 11.76 | 49.38 | 39.54 | ||
| 90% | 0.81 | 8.49 | 46.41 | 31.36 | 3.42 | 20.79 | 90.90 | 64.65 | ||
| Rel.MSE | Cat. | 10% | 1.05 | 1.67 | 2.50 | 3.38 | 0.96 | 1.11 | 2.75 | 2.98 |
| 25% | 1.16 | 2.40 | 4.97 | 9.03 | 1.08 | 1.61 | 4.33 | 4.75 | ||
| 50% | 1.37 | 5.99 | 10.37 | 14.89 | 1.25 | 3.35 | 7.40 | 8.16 | ||
| 75% | 1.49 | 10.25 | 27.73 | 26.16 | 1.48 | 9.07 | 14.87 | 15.80 | ||
| 90% | 1.62 | 16.22 | 97.33 | 40.16 | 1.89 | 23.91 | 36.37 | 27.92 | ||
| B.Cont. | 10% | 1.19 | 1.50 | 44.06 | 4.35 | 0.82 | 0.86 | 7.40 | 2.05 | |
| 25% | 1.30 | 1.77 | 74.42 | 13.82 | 0.92 | 1.11 | 14.80 | 4.90 | ||
| 50% | 1.44 | 3.31 | 139.24 | 72.57 | 1.07 | 1.90 | 32.26 | 13.76 | ||
| 75% | 1.55 | 6.71 | 284.00 | 150.35 | 1.26 | 4.09 | 88.78 | 47.56 | ||
| 90% | 1.64 | 19.69 | 603.38 | 451.44 | 1.54 | 10.80 | 282.29 | 127.15 | ||
|
“Cat.” means categorical variables and “B.Cont.” means binned continuous variables. CART = Classification and regression trees; RF = Random forests; GAIN = Generative adversarial imputation network; MIDA = Multiple imputation using denoising autoencoders. |
||||||||||
All methods generally yield less biased estimates (i.e., smaller ASB) of the marginal probabilities than the bivariate probabilities. This illustrates preserving multivariate distributional features is more challenging than univariate ones. The advantage of CART over the other methods is comparatively larger when estimating bivariate estimands than univariate ones. Interestingly, the relative MSE tends to be higher for the marginal probabilities than the bivariate probabilities. This is likely due to the fact that the denominator in the definition of relative MSE in (3.2) is the MSE from the sampled data before introducing missing data, which tends to be smaller for marginal probabilities than bivariate probabilities. CART yields MSEs that are very close to the corresponding MSEs from the sampled data before introducing missing data; i.e., the relative MSE is close to 1. On the contrary, both deep learning methods, and GAIN in particular, can result in exceedingly large relative MSE for many estimands.
Figures 4.1 displays the estimated coverage rates of the 95% confidence intervals for the marginal and bivariate probabilities. The patterns on coverage between different methods is similar to those on bias and MSE. Specifically, CART tends to result in coverage rates that are close to the nominal 95% level, with the median consistently being around 95% and tight interquartile range. In contrast, RF, GAIN and MIDA all result in coverage rates that are much farther off from the nominal 95% level. For example, the median coverage rates under both GAIN and MIDA are all under 0.60, and are even less than 0.30 for continuous variables. A closer look into the prediction accuracy of each variable reveals that GAIN and MIDA tend to generate imputations that are biased toward the most frequent levels, and GAIN in particular generally produces narrower intervals than the other methods. This once again provides evidence of significant bias under the deep learning methods. All methods tend to result in higher median coverage rates for the bivariate probabilities than the marginal probabilities, although the left tails are generally longer for the former than the latter.

Description of Figure 4.1
Figure presenting Coverage rate of the 95% confidence interval for all marginal (left side graphs) and bivariate (right side graphs) probabilities for categorical (upper graphs) and for binned continuous (lower graphs) variables obtained from the four imputation methods (CART, RF, GAIN, and MIDA) in the 100 simulations with a sample size of 10,000 and 30% values MCAR. The red dashed line is the nominal 95% level of coverage. CART tends to result in coverage rates that are close to the nominal level. RF, GAIN and MIDA all result in coverage rates that are much farther off from the nominal 95% level. All methods tend to result in higher median coverage rates for the bivariate probabilities than the marginal probabilities, although the left tails are generally longer for the former than the latter.
4.2.2 MAR scenario
We also consider a MAR scenario, which is more plausible than MCAR in practice. We set six variables ‒ age, gender, marital status, race, educational attainment and class of worker ‒ to be fully observed. It would be cumbersome to specify different MAR mechanism for each of the remaining 40 variables, so we randomly divide them into three groups, consisting of 10, 15, and 15 variables. We then specify a separate nonresponse model by which to generate the missing data for the variables in each group. Specifically, we postulate a logistic model per group, conditional on the fully observed six variables, based on which we then generate binary missing data indicators for each variable in that group. This process results in approximately 30% missing rate for each of the 40 variables. We describe the models in more detail in the supplementary material.
Table 4.2 displays the distributions of the ASB and relative MSE of all the marginal and bivariate probabilities from the four methods. All methods yield larger ASB and relative MSE under the MAR scenario than the previous MCAR scenario. This is expected because MAR is a stronger assumption than MCAR that requires conditioning on more information. Nonetheless, the overall patterns of relative performance between the methods remain the same as those under MCAR. Specifically, CART once again produces estimates with the least ASB and relative MSE ‒ by an even larger margin then under MCAR ‒ among the four methods, followed by RF, and then MIDA and GAIN. One notable observation is the deteriorating performance of the deep learning methods, particularly GAIN, in imputing continuous variables, sometimes resulting in several hundreds fold of relative MSE than CART. This indicates the huge uncertainties associated with GAIN in imputing continuous variables.
| Quantiles | Marginal | Bivariate | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| CART | RF | GAIN | MIDA | CART | RF | GAIN | MIDA | |||
| ASB | Cat. | 10% | 0.05 | 0.13 | 0.15 | 0.14 | 0.15 | 0.71 | 0.76 | 0.89 |
| 25% | 0.11 | 0.44 | 0.62 | 0.61 | 0.40 | 2.23 | 2.55 | 3.20 | ||
| 50% | 0.29 | 2.13 | 3.05 | 4.55 | 1.08 | 6.06 | 6.85 | 8.14 | ||
| 75% | 1.04 | 4.98 | 6.63 | 10.22 | 2.49 | 13.43 | 16.78 | 16.19 | ||
| 90% | 1.80 | 10.49 | 18.91 | 17.00 | 5.68 | 24.06 | 28.04 | 25.36 | ||
| B.Cont. | 10% | 0.07 | 0.29 | 0.33 | 0.33 | 0.27 | 1.17 | 10.87 | 6.18 | |
| 25% | 0.17 | 1.07 | 9.64 | 3.13 | 0.69 | 3.49 | 23.67 | 16.26 | ||
| 50% | 0.67 | 3.14 | 32.86 | 23.85 | 1.58 | 7.83 | 38.52 | 31.17 | ||
| 75% | 1.20 | 6.95 | 39.57 | 36.09 | 3.40 | 15.20 | 53.59 | 47.34 | ||
| 90% | 3.40 | 12.39 | 63.45 | 41.99 | 5.94 | 25.16 | 97.47 | 85.44 | ||
| Rel.MSE | Cat. | 10% | 1.00 | 1.00 | 1.00 | 1.00 | 0.97 | 1.00 | 1.53 | 1.93 |
| 25% | 1.08 | 1.82 | 2.56 | 4.75 | 1.04 | 1.39 | 3.78 | 4.03 | ||
| 50% | 1.33 | 4.33 | 19.03 | 15.13 | 1.25 | 3.00 | 10.42 | 8.38 | ||
| 75% | 1.72 | 13.08 | 55.07 | 33.36 | 1.59 | 9.56 | 27.45 | 16.95 | ||
| 90% | 2.27 | 18.70 | 101.91 | 48.44 | 2.23 | 27.44 | 64.01 | 32.85 | ||
| B.Cont. | 10% | 1.00 | 1.00 | 1.00 | 1.00 | 0.88 | 0.90 | 11.19 | 2.96 | |
| 25% | 1.38 | 1.83 | 90.98 | 8.49 | 1.00 | 1.16 | 20.15 | 6.87 | ||
| 50% | 1.70 | 4.57 | 207.58 | 96.08 | 1.18 | 2.29 | 45.25 | 21.33 | ||
| 75% | 2.12 | 11.47 | 692.67 | 239.69 | 1.50 | 6.95 | 125.39 | 70.90 | ||
| 90% | 3.12 | 50.56 | 1342.23 | 806.43 | 2.12 | 18.07 | 459.78 | 205.14 | ||
|
“Cat.” means categorical variables and “B.Cont.” means binned continuous variables. CART = Classification and regression trees; RF = Random forests; GAIN = Generative adversarial imputation network; MIDA = Multiple imputation using denoising autoencoders. |
||||||||||
Figures 4.2 displays the estimated coverage rates of the 95% confidence intervals for the marginal and bivariate probabilities, under each method. Similar as the case of bias and MSE, all methods generally result in lower coverage rates under MAR than MCAR, with visibly longer left tails in some cases, but the overall patterns comparing between the methods remain the same. Specifically, CART still tends to result in coverage rates that are above 90%, while the other three methods have consistently lower coverage rate. In particular, both GAIN and MIDA result in extremely low ‒ below 7% median coverage rates for continuous variables. This is closely related to the previous observation of the large uncertainty of the deep learning methods in imputing continuous variables.

Description of Figure 4.2
Figure presenting Coverage rate of the 95% confidence interval for all marginal (left side graphs) and bivariate (right side graphs) probabilities for categorical (upper graphs) and for binned continuous (lower graphs) variables obtained from the four imputation methods (CART, RF, GAIN, and MIDA) in the 100 simulations with a sample size of 10,000 and 30% values MAR. The red dashed line is the nominal 95% level of coverage. All methods generally result in lower coverage rates under MAR than MCAR. Specifically, CART still tends to resultin coverage rates that are above 90%, while the other three methods have consistently lower coverage rate.
Finally, to illustrate that evaluating only the overall RMSE and accuracy metrics may be misleading, we display the mean and empirical standard errors of the overall RMSE and accuracy over the 100 simulations in Table 4.3, where MCAR is in the top panel and MAR is in the bottom panel. Under both missing data mechanisms, for the continuous variables, MIDA leads to the smallest overall RMSE, followed by CART, and with RF and GAIN being last. For the categorical variables, CART and GAIN lead to the highest overall accuracy, with MIDA being closely behind and RF last. These patterns, not surprisingly, differ from those reported earlier based on marginal and bivariate probabilities and different metrics. As discussed in Section 3, overall RMSE and accuracy do not capture the distributional features of multivariate data or the repeated sampling properties of the imputation methods.
| Mechanism | Metric | CART | RF | GAIN | MIDA |
|---|---|---|---|---|---|
| MCAR | RMSE | 0.128 (0.002) | 0.159 (0.003) | 0.161 (0.008) | 0.112 (0.002) |
| Accuracy | 0.785 (0.001) | 0.658 (0.003) | 0.782 (0.002) | 0.752 (0.004) | |
| MAR | RMSE | 0.130 (0.003) | 0.154 (0.004) | 0.145 (0.009) | 0.110 (0.002) |
| Accuracy | 0.819 (0.001) | 0.704 (0.003) | 0.820 (0.002) | 0.780 (0.007) | |
|
The empirical standard errors in the parenthesis. The top panel is under MCAR and the bottom panel is under MAR, all with 30% missing data. CART = Classification and regression trees; RF = Random forests; GAIN = Generative adversarial imputation network; MIDA = Multiple imputation using denoising autoencoders. |
|||||
4.3 Simulations with n = 100,000 and 30% MCAR
Deep learning models usually require a large sample size to train. Therefore, to give MIDA and GAIN a more favorable setting as well as to investigate the sensitivity of our results to variations in sample size, we generate a simulation scenario of 10 samples with 100,000 under MCAR. That is, we randomly set 30% of the values of each variable to be missing independently. Here we only generate 10 simulations due to the huge computational cost of MICE for samples with this size. In this scenario, we omit RF because the previous results in Section 4.2 have shown that RF is consistently inferior to CART in terms of performance and computation. We use CART, GAIN, and MIDA to create 10 completed datasets.
Because it usually requires a much larger number of simulations to reliably calculate MSE and coverage, here we focus on the weighted absolute bias metric (3.4). Table 4.4 displays the distributions of the estimated weighted absolute bias, averaged over 10 simulations, of the marginal probabilities of the categorical and binned continuous variables. Overall, the patterns comparing between the four methods remain consistent with those observed in Section 4.2. Specifically, CART again results in the smallest weighted absolute difference in both categorical and continuous variables, and the advantage is particularly pronounced with continuous variables. For example, for categorical variables, MIDA and GAIN result in a median of weighted absolute bias at least 9 and 11 times, respectively, larger than CART. The advantage of CART grows to about 30 and 60 times over MIDA and GAIN, respectively, for continuous variables. Moreover, CART performs robustly across variables, evident from the small variation in the weighted absolute bias, e.g., 0.07 for 10% percentile and 0.33 for 90% percentile among the categorical variables. In contrast, both deep learning models result in much larger variation across variables; e.g., 0.57 for 10% percentile and 2.92 for 90% percentile among the categorical variables under MIDA, and even larger for GAIN. In summary, other than computational time, MICE with CART significantly outperforms MIDA and GAIN in terms of bias and variance regardless of the sample size.
| Quantiles | Categorical | Binned Continuous | ||||
|---|---|---|---|---|---|---|
| CART | GAIN | MIDA | CART | GAIN | MIDA | |
| 10% | 0.07 | 0.43 | 0.57 | 0.10 | 5.52 | 1.98 |
| 25% | 0.11 | 1.11 | 1.02 | 0.11 | 6.65 | 2.78 |
| 50% | 0.15 | 1.74 | 1.40 | 0.12 | 7.36 | 4.04 |
| 75% | 0.24 | 3.77 | 2.07 | 0.13 | 9.40 | 6.50 |
| 90% | 0.33 | 4.63 | 2.92 | 0.15 | 11.31 | 7.72 |
|
CART = Classification and regression trees; GAIN = Generative adversarial imputation network; MIDA = Multiple imputation using denoising autoencoders. |
||||||
4.4 Role of hyperparameters in tree-based MICE
The pattern that CART outperforms RF is surprising, because the common knowledge is that ensemble methods are usually superior to single tree methods. But the same pattern was also observed in another recent study (Wongkamthong and Akande, 2021). We investigate the role of the key hyperparameter in RF ‒ the maximum number of trees ‒ in the simulations. We randomly selected a simulated data of size 10,000 and 30% of entries being MCAR. We use the mice package to fit RF with different number of trees: where is the default setting. The relative MSE of the imputed categorical variables fitted using each value, as well as that using CART, is shown as trajectories in Figure 4.3, which reveals a consistent pattern: the upper quantiles ‒ particularly those above 50% ‒ of the relative MSE deteriorates rapidly as the maximum number of trees in RF increases, while the lower quantiles, e.g., 10%, 25%, remain stable. We found a similar pattern with the standardized bias metric and continuous variables, and thus the results are omitted here. This suggests that larger number of trees in RF ‒ at least as implemented in the mice package ‒ leads to much longer tail in the distribution of the bias and MSEs. This is likely due to overfitting. We cannot exclude the possibility that a more customized hyperparameter tuning of RF may outperform CART in some applications. However, such case-specific fine-tuning of the MICE algorithm is generally not available for the vast majority of MI consumers who relies on the default setting of popular packages like mice.

Description of Figure 4.3
Figure presenting the quantiles of the relative mean squared error (MSE) on the y-axis over all marginal (left side graphs) and bivariate (right side graphs) probabilities of categorical (upper graphs) and for binned continuous (lower graphs) variables, under CART and RF with various number of trees 2, 5, 10, 15 and 20) on the x-axis, for a simulation with a sample size of 10,000 and 30% values MCAR. Each trajectory represents the relative MSE by a method (CART, or RF by for quantiles 0.1 (blue), 0.25 (green), 0.5 (yellow), 0.75 (red) and 0.9 (purple). The trajectories reveal the consistent pattern that the upper quantiles of the RMSE deteriorates rapidly as the maximum number of trees in RF increases, while the lower quantiles remain stable.
- Date modified: