Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
Section 2. Missing data imputation methods
We first introduce notation. Consider a sample with
units, each of which is associated with
variables. Let
be the value of variable
for individual
where
and
Here,
can be continuous, binary, categorical or
mixed binary-continuous. For each individual
let
For each variable
let
Let
be the
matrix comprising the data for all records
included in the sample. We write
where
and
are respectively the observed and missing
parts of
We write
where
represents all missing values for variable
with
Similarly, we write
for the corresponding observed data.
In MI, the analyst generates values of the missing data
using pre-specified models estimated with
resulting in a completed dataset. The analyst
then repeats the process to generate
completed datasets,
that are available for inference or
dissemination. For inference, the analyst can compute sample estimates for
population estimands in each completed dataset
and combine them using MI inference rules
developed by Rubin (1987),
which will be reviewed in Section 3.
2.1 MICE with classification tree models
Under MICE, the analyst begins by specifying a separate
univariate conditional model for each variable with missing values. The analyst
then specifies an order to iterate through the sequence of the conditional
models, when doing imputation. We write the ordered list of the variables as
Next, the analyst initializes each
The most popular options are to sample from
(i) the marginal distribution of the corresponding
or (ii) the conditional distribution of
given all the other variables, constructed
using only available cases.
After initialization, the MICE algorithm follows an
iterative process that cycles through the sequence of univariate models. For each
variable
at each iteration
one fits the conditional model
Next, one replaces
with draws from the implied model
The iterative process continues for
total iterations until convergence, and the
values at the final iteration make up a completed dataset
The entire process is then repeated
times to create the
completed datasets. We provide pseudocode
detailing each step of the MICE algorithm in the supplementary material.
Under MICE-CART, the analyst uses CART (Breiman et al., 1984) for the
univariate conditional models in the MICE algorithm. CART follows a decision
tree structure that uses recursive binary splits to partition the predictor
space into distinct non-overlapping regions. The top of the tree often
represents its root and each successive binary split divides the predictor
space into two new branches as one moves down the tree. The splitting criterion
at each leaf is usually chosen to minimize an information theoretic entropy
measure. Splits that do not decrease the lack of fit by an reasonable amount
based on a set threshold are pruned off. The tree is then built until a
stopping criterion is met; e.g., minimum number of observations in each leaf.
Once the tree has been fully constructed, one generates
by traversing down the tree to the appropriate
leaf using the combinations in
and then sampling from the
values in that leaf. That is, given any
combination in
one uses the proportion of values of
in the corresponding leaf to approximate the
conditional distribution
The iterative process again continues for
total iterations, and the values at the final
iteration make up a completed dataset.
MICE-RF instead uses random forests for the univariate
conditional models in MICE (e.g., Stekhoven
and Bühlmann, 2012; Shah, Bartlett, Carpenter, Nicholas and Hemingway, 2014). Random forests (Ho, 1995; Breiman, 2001) is an
ensemble tree method which builds multiple decision trees to the data, instead
of a single tree like CART. Specifically, random forests constructs multiple
decision trees using bootstrapped samples of the original, and only uses a
sample of the predictors for the recursive partitions in each tree. This
approach can reduce the prevalence of unstable trees as well as the correlation
among individual trees significantly, since it prevents the same variables from
dominating the partitioning process across all trees. Theoretically, this
decorrelation should result in predictions with less variance (Hastie, Tibshirani and Friedman, 2009).
For imputation, the analyst first trains a random
forests model for each
using available cases, given all other
variables. Next, the analyst generates predictions for
under that model. Specifically, for any
categorical
and given any particular combination in
the analyst first generates predictions for
each tree based on the values
in the corresponding leaf for that tree, and
then uses the most commonly occurring majority level of among all predictions
from all the trees. For a continuous
the analyst instead uses the average of all
the predictions from all the trees. The iterative process again cycles through
all the variables, for
total iterations, and the values at the final
iteration make up a completed dataset. A particularly important hyperparameter
in random forests is the maximum number of trees
For our evaluations, we use the mice R package to implement both MICE-CART and MICE-RF,
and retain the default hyperparameter setting in the package to mimic the
common practice in real world applications. Specifically, we set the minimum
number of observations in each terminal leaf to 5 and the pruning threshold to
0.0001 in MICE-CART. In MICE-RF, the maximum number of trees
is set to be 10.
2.2 Generative Adversarial Imputation Network (GAIN)
GAIN
(Yoon, Jordon and Schaar, 2018) is an imputation method based
on GANs (Goodfellow et al., 2014),
which consist of a generator function
and a discriminator function
For any data matrix
we replace
with random noise,
sampled from a uniform distribution. The
generator
inputs this initialized data and a mask matrix
with
indicating observed values of
and outputs predicted values for both the
observed data and missing data,
The discriminator
utilizes
and a hint matrix
of the same dimension to identify which values
are observed or imputed by
which results in a predicted mask matrix
The hint matrix, sampled from the Bernoulli
distribution with
equal to a “hint rate” hyperparameter, reveals
to
partial information about
in order to help guide
to learn the underlying distribution of
We
first train
to minimize the loss function,
for each mini-batch of size
Next,
is trained to minimize the loss function (2.2),
which is composed of a generator loss,
and a reconstruction loss,
The generator loss (2.3) is minimized when
incorrectly identifies imputed values as being
observed. The reconstruction loss (2.4) is minimized when the predicted values
are similar to the observed values, and is weighted by the hyperparameter
where
In our experiments, we model both
and
as fully-connected neural networks, each with
three hidden layers, and
hidden units per hidden layer. The hidden
layer weights are initialized uniformly at random with the Xavier
initialization method (Glorot and
Bengio, 2010). We use leaky ReLU activation function (Maas, Hannun and Ng, 2013) for each hidden layer, and a
softmax activation function for the output layer for
in the case of categorical variables, or a
sigmoid activation function in the case of numerical variables and for the
output of
We facilitate this choice of output layer for
numerical variables by transforming all continuous variables to be within range
(0, 1) using the MinMax normalization:
where
and
are the minimum and maximum of variable
respectively. After imputation, we transform
each value back to its original scale. We generate multiple imputations using
several runs of the model with varying initial imputation of the missing
values.
To implement GAIN in our evaluations, we use the same
architecture as the one in Yoon, Jordon, and Schaar (2018). We set
equal to the number of features of the input
data, and tune the hint rate on a single simulation. Following the common
practice in the GAN literature (Berthelot,
Schumm and Metz, 2017; Ham, Jun
and Kim, 2020), we track the
evolution of GAIN’s generator and discriminator losses, and manually tune the
hint rate so that the two losses are qualitatively similar. Specifically, we
first coarsely select the hint rate among {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,
0.8, 0.9}. Then we determine the final value by an additional fine tuning step.
In the MAR scenario, for example, after observing that the optimal value is in
the range (0.1, 0.2), we perform a search among {0.11, 0.12, 0.13, 0.14, 0.15,
0.16, 0.17, 0.18, 0.19}. Finally, we set the optimal hint rate for MCAR and MAR
scenarios to be 0.3 and 0.13, respectively. We train the networks for 200
epochs using stochastic gradient descent (SGD) and mini-batches of size 512 to
learn the parameter weights. We use the Adam optimizer to adapt the learning
rate, with an initial rate of 0.001 (Kingma
and Ba, 2014).
2.3 Multiple Imputation using Denoising Autoencoders
(MIDA)
MIDA
(Gondara and Wang, 2018; Lu et al.,
2020) extends a class of neural networks, denoising autoencoders, for
MI. An autoencoder is a neural network model trained to learn the identity
function of the input data. Denoising autoencoders intentionally corrupt the
input data in order to prevent the networks from learning the identity
function, but rather a useful low-dimensional representation of the input data.
The MIDA architecture consists of an encoder and decoder, each modeled as a
fully-connected neural network with three hidden layers, with
hidden units per hidden layer. We first
perform an initial imputation on missing values using the mean for continuous
variables and the most frequent label for categorical variables, which results
in a completed data
The encoder inputs
and corrupts the input data by randomly
dropping out half of the variables. The corrupted input data is mapped to a
higher dimensional representation by adding
hidden units to each successive hidden layer
of the encoder. The decoder receives output from the encoder, and symmetrically
scales the encoding back to the original input dimension. All hidden layers use
a hyperbolic tangent (tanh) activation function, while the output layer of the
decoder uses a softmax (sigmoid) activation function in the case of categorical
(numerical) variables. Multiple imputations are generated by using multiple
runs with the hidden layer weights initialized as a Gaussian random variable.
Following
Lu et al. (2020), we train
MIDA in two phases: a primary phase and fine-tuning phase. In the primary
phase, we feed the initially imputed data to MIDA and train for
epochs. In the fine-tuning phase, MIDA is
trained for
epochs on the output in the primary phase, and
produces the outcome. The loss function is used in both phases and closely
resembles the reconstruction loss in GAIN:
To implement MIDA in our evaluations, we use the
same architecture and tune the hyperparameters in a single simulation as in Lu
et al. (2020). We plot the evolution of loss function
and select the number of additional units
among {1, 2, 3, 4, 5, 6, 7 ,8, 9, 10} to
reduce the loss. In our experiments, we set
equal to the number of features of the input
data and add
hidden units to each of the three hidden
layers of the encoder. We train the model for
epochs in the primary phase and
epochs in the fine-tuning phase. Similar as in
GAIN, we learn the model parameters using SGD with mini-batches of size 512,
and use the Adam optimizer to adapt the learning rate with the initial rate
being 0.001.