Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
Section 2. Missing data imputation methods

Table of contents

We first introduce notation. Consider a sample with $n$ units, each of which is associated with $p$ variables. Let $Y_{i j}$ be the value of variable $j$ for individual $i,$ where $j = 1, \dots, p$ and $i = 1, \dots, n .$ Here, $Y$ can be continuous, binary, categorical or mixed binary-continuous. For each individual $i,$ let $Y_{i} = (Y_{i 1}, \dots, Y_{i p}) .$ For each variable $j,$ let $Y_{j} = (Y_{1 j}, \dots, Y_{n j}) .$ Let $Y = (Y_{1}, \dots, Y_{n})$ be the $n \times p$ matrix comprising the data for all records included in the sample. We write $Y = (Y_{obs}, Y_{mis}),$ where $Y_{obs}$ and $Y_{mis}$ are respectively the observed and missing parts of $Y .$ We write $Y_{mis} = (Y_{mis, 1}, \dots, Y_{mis, p}),$ where $Y_{mis, j}$ represents all missing values for variable $j,$ with $j = 1, \dots, p .$ Similarly, we write $Y_{obs} = (Y_{obs, 1}, \dots, Y_{obs, p})$ for the corresponding observed data.

In MI, the analyst generates values of the missing data $Y_{mis}$ using pre-specified models estimated with $Y_{obs},$ resulting in a completed dataset. The analyst then repeats the process to generate $L$ completed datasets, ${Y^{(l)} : l = 1, \dots, L},$ that are available for inference or dissemination. For inference, the analyst can compute sample estimates for population estimands in each completed dataset $Y^{(l)},$ and combine them using MI inference rules developed by Rubin (1987), which will be reviewed in Section 3.

2.1 MICE with classification tree models

Under MICE, the analyst begins by specifying a separate univariate conditional model for each variable with missing values. The analyst then specifies an order to iterate through the sequence of the conditional models, when doing imputation. We write the ordered list of the variables as $(Y_{(1)}, \dots, Y_{(p)}) .$ Next, the analyst initializes each $Y_{mis, (j)} .$ The most popular options are to sample from (i) the marginal distribution of the corresponding $Y_{obs, (j)},$ or (ii) the conditional distribution of $Y_{(j)}$ given all the other variables, constructed using only available cases.

After initialization, the MICE algorithm follows an iterative process that cycles through the sequence of univariate models. For each variable $j$ at each iteration $t,$ one fits the conditional model $(Y_{(j)} | Y_{obs, (j)}, {Y_{(k)}^{(t)} : k < j}, {Y_{(k)}^{(t - 1)} : k > j}) .$ Next, one replaces $Y_{mis, (j)}^{(t)}$ with draws from the implied model $(Y_{mis, (j)}^{(t)} | Y_{obs, (j)}, {Y_{(k)}^{(t)} : k < j}, {Y_{(k)}^{(t - 1)} : k > j}) .$ The iterative process continues for $T$ total iterations until convergence, and the values at the final iteration make up a completed dataset $Y^{(l)} = (Y_{obs}, Y_{mis}^{(T)}) .$ The entire process is then repeated $L$ times to create the $L$ completed datasets. We provide pseudocode detailing each step of the MICE algorithm in the supplementary material.

Under MICE-CART, the analyst uses CART (Breiman et al., 1984) for the univariate conditional models in the MICE algorithm. CART follows a decision tree structure that uses recursive binary splits to partition the predictor space into distinct non-overlapping regions. The top of the tree often represents its root and each successive binary split divides the predictor space into two new branches as one moves down the tree. The splitting criterion at each leaf is usually chosen to minimize an information theoretic entropy measure. Splits that do not decrease the lack of fit by an reasonable amount based on a set threshold are pruned off. The tree is then built until a stopping criterion is met; e.g., minimum number of observations in each leaf.

Once the tree has been fully constructed, one generates $Y_{mis, (j)}^{(t)}$ by traversing down the tree to the appropriate leaf using the combinations in $({Y_{k}^{(t)} : k < j}, {Y_{k}^{(t - 1)} : k > j}),$ and then sampling from the $Y_{(j)}^{obs}$ values in that leaf. That is, given any combination in $({Y_{k}^{(t)} : k < j}, {Y_{k}^{(t - 1)} : k > j}),$ one uses the proportion of values of $Y_{j}^{obs}$ in the corresponding leaf to approximate the conditional distribution $(Y_{(j)} | Y_{obs, (j)}, {Y_{(k)}^{(t)} : k < j}, {Y_{(k)}^{(t - 1)} : k > j}) .$ The iterative process again continues for $T$ total iterations, and the values at the final iteration make up a completed dataset.

MICE-RF instead uses random forests for the univariate conditional models in MICE (e.g., Stekhoven and Bühlmann, 2012; Shah, Bartlett, Carpenter, Nicholas and Hemingway, 2014). Random forests (Ho, 1995; Breiman, 2001) is an ensemble tree method which builds multiple decision trees to the data, instead of a single tree like CART. Specifically, random forests constructs multiple decision trees using bootstrapped samples of the original, and only uses a sample of the predictors for the recursive partitions in each tree. This approach can reduce the prevalence of unstable trees as well as the correlation among individual trees significantly, since it prevents the same variables from dominating the partitioning process across all trees. Theoretically, this decorrelation should result in predictions with less variance (Hastie, Tibshirani and Friedman, 2009).

For imputation, the analyst first trains a random forests model for each $Y_{(j)}$ using available cases, given all other variables. Next, the analyst generates predictions for $Y_{mis, j}$ under that model. Specifically, for any categorical $Y_{(j)},$ and given any particular combination in $({Y_{k}^{(t)} : k < j}, {Y_{k}^{(t - 1)} : k > j}),$ the analyst first generates predictions for each tree based on the values $Y_{j}^{obs}$ in the corresponding leaf for that tree, and then uses the most commonly occurring majority level of among all predictions from all the trees. For a continuous $Y_{(j)},$ the analyst instead uses the average of all the predictions from all the trees. The iterative process again cycles through all the variables, for $T$ total iterations, and the values at the final iteration make up a completed dataset. A particularly important hyperparameter in random forests is the maximum number of trees $d .$

For our evaluations, we use the mice R package to implement both MICE-CART and MICE-RF, and retain the default hyperparameter setting in the package to mimic the common practice in real world applications. Specifically, we set the minimum number of observations in each terminal leaf to 5 and the pruning threshold to 0.0001 in MICE-CART. In MICE-RF, the maximum number of trees $d$ is set to be 10.

2.2 Generative Adversarial Imputation Network (GAIN)

GAIN (Yoon, Jordon and Schaar, 2018) is an imputation method based on GANs (Goodfellow et al., 2014), which consist of a generator function $G$ and a discriminator function $D .$ For any data matrix $Y = (Y_{obs}, Y_{mis}),$ we replace $Y_{mis}$ with random noise, $Z_{i j},$ sampled from a uniform distribution. The generator $G$ inputs this initialized data and a mask matrix $M,$ with $M_{i j} \in {0, 1}$ indicating observed values of $Y,$ and outputs predicted values for both the observed data and missing data, $\hat{Y} .$ The discriminator $D$ utilizes $\hat{Y} = (Y_{obs}, {\hat{Y}}_{mis})$ and a hint matrix $H$ of the same dimension to identify which values are observed or imputed by $G,$ which results in a predicted mask matrix $\hat{M} .$ The hint matrix, sampled from the Bernoulli distribution with $p$ equal to a “hint rate” hyperparameter, reveals to $D$ partial information about $M$ in order to help guide $G$ to learn the underlying distribution of $Y .$

We first train $D$ to minimize the loss function, $L_{D} (M, \hat{M}),$ for each mini-batch of size $n_{i} :$

$L_{D} (M, \hat{M}) = \sum_{i = 1}^{n_{i}} \sum_{j = 1}^{J} M_{i j} log ({\hat{M}}_{i j}) + (1 - M_{i j}) log (1 - {\hat{M}}_{i j}) . (2.1)$

Next, $G$ is trained to minimize the loss function (2.2), which is composed of a generator loss, $L_{G} (M, \hat{M}),$ and a reconstruction loss, $L_{M} (Y, \hat{Y}, M) .$ The generator loss (2.3) is minimized when $D$ incorrectly identifies imputed values as being observed. The reconstruction loss (2.4) is minimized when the predicted values are similar to the observed values, and is weighted by the hyperparameter $β:$

$L (Y, \hat{Y}, M, \hat{M}) = L_{G} (M, \hat{M}) + β L_{M} (Y, \hat{Y}, M), (2.2)$

$L_{G} (M, \hat{M}) = \sum_{i = 1}^{n_{i}} \sum_{j = 1}^{J} M_{i j} log (1 - {\hat{M}}_{i j}), (2.3)$

$L_{M} (Y, \hat{Y}, M) = \sum_{i =1}^{n_{i}} \sum_{j =1}^{J} (1 - M_{i j}) L_{rec} (Y_{i j}, {\hat{Y}}_{i j}), (2.4)$

where

$L_{rec} (Y_{i j}, {\hat{Y}}_{i j}) = {\begin{array}{l} {({\hat{Y}}_{i j} - Y_{i j})}^{2} & if Y_{i j} is continuous \\ - Y_{i j} log {\hat{Y}}_{i j} & if Y_{i j} is categorical . \end{array} (2.5)$

In our experiments, we model both $G$ and $D$ as fully-connected neural networks, each with three hidden layers, and $θ$ hidden units per hidden layer. The hidden layer weights are initialized uniformly at random with the Xavier initialization method (Glorot and Bengio, 2010). We use leaky ReLU activation function (Maas, Hannun and Ng, 2013) for each hidden layer, and a softmax activation function for the output layer for $G$ in the case of categorical variables, or a sigmoid activation function in the case of numerical variables and for the output of $D .$ We facilitate this choice of output layer for numerical variables by transforming all continuous variables to be within range (0, 1) using the MinMax normalization: $Y_{i j}^{*} = {Y_{i j} - min (Y_{\cdot j})} / {max (Y_{\cdot j}) - min (Y_{\cdot j})},$ where $min (Y_{\cdot j})$ and $max (Y_{\cdot j})$ are the minimum and maximum of variable $j,$ respectively. After imputation, we transform each value back to its original scale. We generate multiple imputations using several runs of the model with varying initial imputation of the missing values.

To implement GAIN in our evaluations, we use the same architecture as the one in Yoon, Jordon, and Schaar (2018). We set $β = 100,$ $θ$ equal to the number of features of the input data, and tune the hint rate on a single simulation. Following the common practice in the GAN literature (Berthelot, Schumm and Metz, 2017; Ham, Jun and Kim, 2020), we track the evolution of GAIN’s generator and discriminator losses, and manually tune the hint rate so that the two losses are qualitatively similar. Specifically, we first coarsely select the hint rate among {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. Then we determine the final value by an additional fine tuning step. In the MAR scenario, for example, after observing that the optimal value is in the range (0.1, 0.2), we perform a search among {0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19}. Finally, we set the optimal hint rate for MCAR and MAR scenarios to be 0.3 and 0.13, respectively. We train the networks for 200 epochs using stochastic gradient descent (SGD) and mini-batches of size 512 to learn the parameter weights. We use the Adam optimizer to adapt the learning rate, with an initial rate of 0.001 (Kingma and Ba, 2014).

2.3 Multiple Imputation using Denoising Autoencoders (MIDA)

MIDA (Gondara and Wang, 2018; Lu et al., 2020) extends a class of neural networks, denoising autoencoders, for MI. An autoencoder is a neural network model trained to learn the identity function of the input data. Denoising autoencoders intentionally corrupt the input data in order to prevent the networks from learning the identity function, but rather a useful low-dimensional representation of the input data. The MIDA architecture consists of an encoder and decoder, each modeled as a fully-connected neural network with three hidden layers, with $θ$ hidden units per hidden layer. We first perform an initial imputation on missing values using the mean for continuous variables and the most frequent label for categorical variables, which results in a completed data $Y_{0} .$ The encoder inputs $Y_{0},$ and corrupts the input data by randomly dropping out half of the variables. The corrupted input data is mapped to a higher dimensional representation by adding $Θ$ hidden units to each successive hidden layer of the encoder. The decoder receives output from the encoder, and symmetrically scales the encoding back to the original input dimension. All hidden layers use a hyperbolic tangent (tanh) activation function, while the output layer of the decoder uses a softmax (sigmoid) activation function in the case of categorical (numerical) variables. Multiple imputations are generated by using multiple runs with the hidden layer weights initialized as a Gaussian random variable.

Following Lu et al. (2020), we train MIDA in two phases: a primary phase and fine-tuning phase. In the primary phase, we feed the initially imputed data to MIDA and train for $N_{prime}$ epochs. In the fine-tuning phase, MIDA is trained for $N_{tune}$ epochs on the output in the primary phase, and produces the outcome. The loss function is used in both phases and closely resembles the reconstruction loss in GAIN:

$L (Y_{i j_{0}}, {\hat{Y}}_{i j}, M_{i j}) = {\begin{array}{l} (1 - M_{i j}) {(Y_{i j_{0}} - {\hat{Y}}_{i j})}^{2} & if Y_{i j} is continuous \\ - (1 - M_{i j}) Y_{i j_{0}} log {\hat{Y}}_{i j} & if Y_{i j} is categorical . \end{array} (2.6)$

To implement MIDA in our evaluations, we use the same architecture and tune the hyperparameters in a single simulation as in Lu et al. (2020). We plot the evolution of loss function $L,$ and select the number of additional units $Θ$ among {1, 2, 3, 4, 5, 6, 7 ,8, 9, 10} to reduce the loss. In our experiments, we set $θ$ equal to the number of features of the input data and add $Θ = 7$ hidden units to each of the three hidden layers of the encoder. We train the model for $N_{prime} = 100$ epochs in the primary phase and $N_{tune} = 2$ epochs in the fine-tuning phase. Similar as in GAIN, we learn the model parameters using SGD with mini-batches of size 512, and use the Adam optimizer to adapt the learning rate with the initial rate being 0.001.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-12-15

Language selection

Search and menus

Search

Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
Section 2. Missing data imputation methods

2.1 MICE with classification tree models

2.2 Generative Adversarial Imputation Network (GAIN)

2.3 Multiple Imputation using Denoising Autoencoders (MIDA)

Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison Section 2. Missing data imputation methods

2.1 MICE with classification tree models

2.2 Generative Adversarial Imputation Network (GAIN)

2.3 Multiple Imputation using Denoising Autoencoders (MIDA)

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
Section 2. Missing data imputation methods