A note on multiply robust predictive mean matching imputation with complex survey data
Section 1. Introduction

Predictive mean matching (PMM), a procedure closely related to nearest-neighbour imputation (NNI, Chen and Shao, 2000; Beaumont and Bocci, 2009; Yang and Kim, 2019), is a popular imputation procedure in practice (Little, 1988; Yang and Kim, 2020). In NNI, a missing value to a survey variable y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWG5baaaa@3F0E@ is replaced by the y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWG5baaaa@3F0E@ -value of the closest respondent with respect to a vector of fully observed variables x . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWH4bGaaiOlaaaa@3FC3@ However, with NNI, the resulting imputed estimator may suffer from a non-negligible bias when the dimension of x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWH4baaaa@3F11@ is large (Yang and Kim, 2019), a problem often referred to as the curse of dimensionality. In contrast, PMM starts with fitting a parametric model (e.g., a linear regression model) based on the responding units with y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWG5baaaa@3F0E@ as the response variable and x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWH4baaaa@3F11@ as the set of explanatory variables. This leads to a set of predicted values or scores, m ^ , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaaceWGTbGbaKaacaGGSaaaaa@3FC2@ for all the sample units (respondents and nonrespondents). A missing value to the survey variable y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWG5baaaa@3F0E@ is then replaced by the y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWG5baaaa@3F0E@ -value of the closest respondent with respect to m ^ . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaaceWGTbGbaKaacaGGUaaaaa@3FC4@ The latter may be viewed as a scalar summary of the information contained in the vector x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWH4baaaa@3F11@ . Therefore, unlike NNI, PMM is not sensitive to the dimension of x , MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWH4bGaaiilaaaa@3FC0@ which is a desirable feature.

Both NNI and PMM belong to the class of nonparametric procedures. Therefore, both procedures are less vulnerable to model misspecification unlike parametric methods (e.g., linear regression imputation). Also, both NNI and PMM belong to the class of donor imputation procedures; that is, they produce eligible imputed values as they use actual observed values “borrowed” from the respondents.

In the first step of PMM, the information contained in the vector x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWH4baaaa@3F11@ is compressed into a single score m ^ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaaceWGTbGbaKaaaaa@3F12@ through the use of a parametric model (e.g., a linear regression model). If the specified model provides an accurate description of the relationship linking y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWG5baaaa@3F0E@ and x , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWH4bGaaiilaaaa@3FC1@ we expect PMM to perform well in terms of bias. On the other hand, if the specified model is grossly misspecified, PMM may yield biased estimators.

Multiply robust approaches with multiple outcome regression and nonresponse models have been shown to improve the robustness against model misspecification, see Han and Wang (2013), Han (2014), and Chen and Haziza (2019a) among others. In this note, we propose a novel PMM procedure that allows for multiple models, each which may be based on a different functional and/or a different set of explanatory variables. Postulating multiple models may prove useful in a number of situations; e.g., see Chen and Haziza (2017) and Chen and Haziza (2019b) for a discussion. The specified models may be parametric or nonparametric. The rationale behind the proposed method is to fit each of these specified models based on the responding units, which leads to multiple set of predicted values (scores) for all the sample units. After describing the theoretical setup in Section 2, we show how to combine these scores to construct the imputed values in Section 3. The proposed PMM procedure is multiply robust in the sense that the resulting estimator is consistent if all but one model are misspecified. Because the true model linking y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWG5baaaa@3F0E@ and x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBamXvP5wqonvsaeHbbX2zLjxAH5ga ryat1nwAKfgidfgBSL2zYfgCOLharqqtubsr4rNCHbGeaGqiFu0Je9 sqqrpepeea0dXdHaVhbbi9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc =bYP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaci GacaGaaeqabaWaaqGafaaakeaacaWH4baaaa@3F11@ is unknown, the proposed approach is attractive because it provides some protection against model misspecification. Also, unlike the multiply robust imputation procedure considered in Chen and Haziza (2017), the proposed method belongs to the class of donor imputation procedures. In Section 4, we conduct a simulation study to assess the performance of the proposed method in terms of bias and efficiency.


Date modified: