Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 6. Conclusions
We have evaluated the performance of several model-assisted survey regression estimators, in the context of both probability and non-probability sampling, through a simulation study. First, we discuss the overall conclusions from our simulation study using probability samples with a stratified SRS design. In the context of our business survey data with all categorical auxiliary variables, the regression tree estimator and the lasso (2-way) estimator with two factor interaction effects are the only model-assisted estimators that provide any efficiency gains, relative to the HT estimator, when the sample size is small and the number of categories of auxiliary variables used is large. As well, the variance estimator for the regression tree estimator is the least biased in this scenario. As the sample size increases, the difference in efficiency between the model-assisted survey regression estimators becomes negligible and all are slightly more efficient than the HT estimator. In general, the potential gains in efficiency for model-assisted estimators over the HT estimator depend on the predictive power of the model. In our simulation population, the strength of the relationship between the study variable and the available categorical auxiliary variables is somewhat weak as judged by the adjusted coefficient of determination around 0.20. We therefore generated study variables leading to larger values around 0.50 by making the model error variance smaller. As expected, model-assisted estimators led to significant efficiency gains over the HT estimator in all cases, as reported in Table 4.2 which shows that the regression tree estimator and the lasso estimator with interaction effects yield improved efficiency over the commonly used GREG estimator if two-factor interactions are present. Moreover, the regression weights for the tree estimator and the calibration weights for the lasso calibration estimators are much less variable, particularly for small sample sizes, than the weights for the GREG . We also examined the performance of the lasso-based and regression trees estimators under a scenario with no main effects and only two-factor interactions are present and another scenario where multi-collinearity among the auxiliary variables is present. In the latter scenario, GREG is not applicable, and we show that the regression tree and lasso estimators provide an automatic way of removing colinear auxiliary variables without impacting the potential efficiency gains. Overall, we recommend using either lasso (2-way) or regression tree estimators in terms of efficiency when two factor interactions are likely to be present among the categorical auxiliary variables. Even in the case of models with only main effects, both methods perform well relative to GREG in terms of MSE because the lasso (2-way) estimator automatically shrinks regression coefficients associated with the interactions to zero while the regression tree estimator does not require specification of the mean function. In other contexts where there is evidence of complex non-linear and non-additive relationships between the survey variable of interest and auxiliary variables, the use of other tree-based machine learning methods, such as xgboost and random forests, should be studied.
In Section 4.3, we studied the performance of variance estimators in terms of relative bias and showed that all the variance estimators exhibit significant underestimation for sample size and 28 -categories. Relative bias of the regression tree variance estimator did not decrease as the sample size increased, unlike in the other cases, and it could be due to overfitting. In the context of random forests method, Dagdoug, Goga and Haziza (2021) examined a procedure based on cross-validation which led to small relative biases and good coverage rates. It would be worthwhile to study a similar procedure for variance estimation of the regression tree estimator.
A major drawback of the regression tree and lasso-based approaches is that the estimation procedures do not yield a set of generic weights that can be applied to all study variables, A possible alternative approach is to derive regression weights based on a primary variable of interest and apply that set of weights to related study variables. In the survey context considered here, using a single set of weights for a group of related variables resulted in little loss of efficiency, relative to the use of variable-specific weights. As well, the bias of the estimators remained negligible. Under this approach, the desirable properties of the regression weights, low variability and, in the case of the regression tree estimator, strictly positive weights are maintained. However, the asymptotic properties of the lasso and regression tree survey estimators have not been derived for a single set of weights, applied to multiple study variables.
We also considered the use of model-assisted survey regression estimators for data from mis-specified probability sampling, treated as a non-probability sample. When the probability of selection depends on an observed auxiliary variable, the bias of the model-assisted estimators decreases as the sample size increases. Including the appropriate auxiliary variable in the working model for the GREG estimator effectively removes the selection bias. Achieving this in practice is difficult as the selection process is unknown. Performing variable selection can increase the bias for model-assisted survey regression estimators as the auxiliary variables related to the selection probability may not be included in the regression model. In fact, in our simulations, correctly including revenue as a potential auxiliary variable did not necessarily decrease the bias of the lasso estimators.
When the probability of selection depends on the survey variable of interest, all the estimators are heavily biased. The magnitude of the bias is similar across estimators and does not greatly decrease as the sample size increases. In our simulation population, the auxiliary variables are not highly predictive for the survey variables of interest. Examining the impact of the strength of the relationship between the auxiliary variables and the variable of interest when informative selection is present warrants more investigation.
Sample selection bias may not be reduced by using a non-probability sample alone, as demonstrated in our simulation study. Methods based on integrating a non-probability sample observing the study variables and associated auxiliary variables with a probability sample observing only the same auxiliary variables have the potential of reducing selection bias through modeling the participation probabilities (Chen, Li and Wu, 2020). Dual frame screening methods are also available when the study variable is observed in both samples and the units in the probability sample belonging to the non-probability sample can be identified without linkage errors without the need to model the participation probabilities (Kim and Tam, 2020; Rao, 2021 and Beaumont, 2020). However, the dual frame method is effective only when the sampling fraction for the non-probability sample is large. We are studying the above methods in the context of business surveys, for example integrating survey data with incomplete administrative data treated as a non-probability sample.
Acknowledgements
We thank Dr. Wesley Yung for initiating this work and for his constructive comments and suggestions. We also thank the reviewers, the editor and the associate editor for constructive comments and suggestions.
References
Beaumont, J.-F. (2020). Are probability surveys bound to disappear for the production of official statistics? Survey Methodology, 46, 1, 1-28. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2020001/article/00001-eng.pdf.
Breidt, F.J., and Opsomer, J.D. (2017). Model-assisted survey estimation with modern prediction techniques. Statistical Science, 32(2), 190-205.
Buskirk, T.D., Kirchner, A., Eck, A. and Signorino, C.S. (2018). An introduction to machine learning methods for survey researchers. Survey Practice, 11(1), 1-10.
Cassel, C.M., Särndal, C.-E. and Wretman, J.H. (1976). Some results on generalized difference estimation and generalized regression estimation for finite population. Biometrika, 63(3), 615-620.
Chen, Y., Li, P. and Wu, C. (2020). Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association, 115(523), 2011-2021.
Chen, J.K.T., Valliant, R.L. and Elliott, M.R. (2018). Model-assisted calibration of non-probability sample survey data using adaptive LASSO. Survey Methodology, 44, 1, 117-144. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2018001/article/54963-eng.pdf.
Chen, J.K.T., Valliant, R.L. and Elliott, M.R. (2019). Calibrating non-probability surveys to estimated control totals using LASSO, with an application to political polling. Journal of the Royal Statistical Society: Series C (Applied Statistics), 68(3), 657-681.
Dagdoug, M., Goga, C. and Haziza, D. (2021). Model-assisted estimation through random forests infinite population sampling. Journal of the American Statistical Association (to appear).
Friedman, J., Hastie, T., Simon, N., Qian, J. and Tibshirani, R. (2017). glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. R package version 2.0-13.
Kern, C., Klausch, T. and Kreuter, F. (2019). Tree-based machine learning methods for survey research. Survey Research Methods, 13(1), 73-93.
Kern, C., Li, Y. and Wang, L. (2020). Boosted kernel weighting-using statistical learning to improve inference from nonprobability samples. Journal of Survey Statistics and Methodology. https://doi.org/10.1093/jssam/smaa028.
Kim, J.K., and Tam, S.M. (2020). Data integration combing big data and survey sample data for finite population inference. International Statistical Review, 89(2), 382-401.
McConville, K.S. (2011). Department of Statistics Improved Estimation for Complex Surveys Using Modern Regression Techniques, unpublished Ph.D. thesis, Colorado State University.
McConville, K.S., and Toth, D. (2019). Automated selection of post-strata using a model-assisted regression tree estimator. Scandinavian Journal of Statistics, 46(2), 389-413.
McConville, K.S., Breidt, F.J., Lee, T.C.M. and Moisen, G.G. (2017). Model-assisted survey regression estimation with the LASSO. Journal of Survey Statistics and Methodology, 5(2), 131-158.
McConville, K.S., Tang, B., Zhu, G., Li, S., Cheung, S. and Toth, D. (2018). mase: Model-Assisted Survey Estimators. R package version 0.1.1.
Rafei, A., Flannagan, C.A. and Elliott, M.R. (2020). Big data for finite population inference: Applying quasi-random approaches to naturalistic driving data using Bayesian additive regression trees. Journal of Survey Statistics and Methodology, 8(1), 148-180.
Rao, J.N.K. (2021). On making valid inferences by integrating data from surveys and other sources. Sankhyā B, 83(1), 242-272 (published online April 2020).
Ripley, B., Venables, B., Bates, D.M., Hornik, K., Gebhardt, A. and Firth, D. (2017). MASS: Modern Applied Statistics with S. R package version 7. 3-47.
Särndal, C.E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling. Springer-Verlag Publishing.
Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, Series B, 58(1), 267-288.
Zou, H. (2006). The adaptive LASSO and its oracle properties. Journal of the American Statistical Association, 101(476), 1418-1429.
- Date modified: