Statistical inference with non-probability survey samples
Section 8. Concluding remarks
In the early years of the 21st century, Web-based surveys started to become popular, which generated substantial amount of research interest on the topic (Tourangeau, Conrad and Couper, 2013). Issues and challenges faced by web-based and other non-probability survey samples led to the “Summary Report of the AAPOR Task Force on Non-probability Sampling” by Baker, Brick, Bates, Battaglia, Couper, Dever, Gile and Tourangeau (2013). Among other things, the report indicated that (i) unlike probability sampling, there is no single framework that adequately encompasses all of non-probability sampling; (ii) making inferences for any probability or non-probability survey requires some reliance on modeling assumptions; and (iii) if non-probability samples are to gain wider acceptance among survey researchers there must be a more coherent framework and accompanying set of measures for evaluating their quality.
Survey sampling researchers have been answering the call with intensified explorations on statistical inference with non-probability survey samples. The current setting of two samples and with the non-probability sample having measurements on both the study variable and auxiliary variables and the probability sample providing information on was first considered by Rivers (2007) on sample matching using nearest neighbor imputation, which is the original idea leading to the mass imputation method (Kim et al., 2021). The weighted logistic regression using the pooled sample for estimating the propensity scores proposed by Valliant and Dever (2011) was the first serious attempt on the topic, which serves as a motivation for the pseudo maximum likelihood method developed by Chen et al. (2020). Brick (2015) considered compositional model inference under the same setting. Elliott and Valliant (2017) provided informed discussions on inference for non-probability samples. Yang, Kim and Song (2020) addressed issues with high dimensional data in combining probability and non-probability survey samples.
Statistical inference with non-probability survey samples is part of the more general topic on combining data from multiple sources. The term “data integration” is frequently used under this context. Combining information from independent probability survey samples has been studied extensively in the survey literature; see, for instance, Wu (2004), Kim and Rao (2012) and references therein. Inferences with samples from multiple frame surveys are another topic which has been heavily investigated by survey statisticians; see Lohr and Rao (2006) and Rao and Wu (2010a) and references therein. In her recent Waksberg award invited paper, Lohr (2021) provided an overview on multiple-frame surveys and some fascinating discussions on using a multiple-frame structure to serve as an organizing principle for other data combination methods. With emerging new data sources and reshaped views on traditional data sources such as administrative records, data integration has become a very broad area that calls for continued research. Further discussions are provided by Lohr and Raghunathan (2017) on combining survey data with other data sources and by Thompson (2019) on combining new and traditional sources in population surveys. Kim and Tam (2021) and Yang, Kim and Hwang (2021) discussed data integration by combining big data and survey sample data for finite population inference. Yang and Kim (2020) contained a review on statistical data integration in survey sampling.
One of the essential messages that the current paper conveys is the concepts of validity and efficiency in analyzing non-probability survey samples. Validity refers to the consistency of point estimators and efficiency is measured by the asymptotic variance of the point estimator. Validity is of primary concern and efficiency pursuit is a secondary goal when valid alternative approaches are available. Discussions on validity and efficiency require a suitable inferential framework and rigorous developments of statistical procedures, which is another main message from this paper. Non-probability samples do not fit into the traditional design-based or model-based inferential framework for probability survey samples. Standard statistical concepts and inferential procedures, however, can be built into a suitable framework for valid and efficient inference with non-probability survey samples.
Non-probability samples may have a very large sample size. Large sample sizes are a double-edged sword: when the inferential procedures are valid, large sample sizes lead to more efficient inference; when the estimators are biased, large sample sizes make the bias even more pronounced. A non-probability survey sample with a 80% sampling fraction over the population does not necessarily provide better estimation results than a small probability sample (Meng, 2018).
The large sample sizes also make non-probability samples connected to the modern big data problems. The role of traditional statistical methods in the era of big data was convincingly argued by Richard Lockhart (2018): “Huge new computing resources do not put an end to the need for careful modelling, for honest assessment of uncertainty, or for good experiment design. Classical statistical ideas continue to have a crucial role to play in keeping data analysis honest, efficient, and effective.”
Jean-François Beaumont (2020) raised the question “Are probability surveys bound to disappear for the production of official statistics?” The short answer is that probability sampling methods and probability survey samples will remain as an important data collection tool for many fields, including official statistics, and design-based inference will play a crucial role for any evolving inferential framework. The current trend of using non-probability samples and data from other sources will continue. Valid and efficient statistical inference with non-probability samples requires auxiliary information from the target population. A few high quality national probability surveys with carefully designed survey variables can play a pivotal role in analysis of non-probability survey samples.
Acknowledgements
This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and the Canadian Statistical Sciences Institute. An early version of the paper was presented at the SSC 2021 Annual Meeting as the Special Presidential Invited Address by the Survey Methods Section of the SSC. The author thanks the Editor of Survey Methodology, Jean-François Beaumont, for the invitation and for organizing the discussions on the emerging topic of statistical inference with non-probability survey samples. Thanks are also due for the two anonymous reviewers who provided constructive comments on the initial submission which led to improvements of the paper.
References
Baker, R., Brick, J.M., Bates, N.A., Battaglia, M., Couper, M.P., Dever, J.A., Gile, K.J. and Tourangeau, R. (2013). Report of the AAPOR task force on non-probability sampling. Journal of Survey Statistics and Methodology, 1, 90-143.
Beaumont, J.-F. (2020). Are probability surveys bound to disappear for the production of official statistics? Survey Methodology, 46, 1, 1-28. Paper available at https://www150.statcan.gc.ca/n1/pub/12-001-x/2020001/article/00002-eng.pdf.
Binder, D.A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51, 279-292.
Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984). Classification and Regression Trees, second edition. Wadsworth & Brooks/Cole Advanced Books & Software.
Brick, J.M. (2015). Compositional model inference. In Proceedings of the Survey Research Methods Section, Joint Statistical Meetings, American Statistical Association, Alexandria, VA, 299-307.
Chen, J., and Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics,16, 113-131.
Chen, J., and Shao, J. (2001). Jackknife variance estimation for nearest-neighbor imputation. Journal of the American Statistical Association, 96, 260-269.
Chen, J., and Sitter, R.R. (1999). Empirical likelihood estimation for finite populations and the effective usage of auxiliary information. Biometrika, 80, 107-116.
Chen, Y. (2020). Statistical Analysis with Non-probability Survey Samples, PhD Dissertation, Department of Statistics and Actuarial Science, University of Waterloo.
Chen, Y., Li, P. and Wu, C. (2020). Doubly robust inference with non-probability survey samples. Journal of the American Statistical Association, 115, 2011-2021.
Chen, Y., Li, P., Rao, J.N.K. and Wu, C. (2022). Pseudo empirical likelihood inference for non-probability survey samples. The Canadian Journal of Statistics, accepted.
Chu, K.C.K., and Beaumont, J.-F. (2019). The use of classification trees to reduce selection bias for a non-probability sample with help from a probability sample. Proceedings of the Survey Methods Section of SSC.
Elliott, M., and Valliant, R. (2017). Inference for nonprobability samples. Statistical Science, 32, 249-264.
Godambe, V.P. (1960). An optimum property of regular maximum likelihood estimation. Annals of Mathematical Statistics, 31, 1208-1212.
Godambe, V.P., and Thompson, M.E. (1986). Parameters of superpopulation and survey population: Their relationships and estimation. International Statistical Review, 54, 127-138.
Kim, J.K., and Haziza, D. (2014). Doubly robust inference with missing data in survey sampling. Statistica Sinica, 24, 375-394.
Kim, J.K., and Rao, J.N.K. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99, 85-100.
Kim, J.K., and Tam, S. (2021). Data integration by combining big data and survey sample data for finite population inference. International Statistical Review, 89, 382-401.
Kim, J.K., Park, S., Chen, Y. and Wu, C. (2021). Combining non-probability and probability survey samples through mass imputation. Journal of the Royal Statistical Society, Series A, 184, 941-963.
Liu, Z., and Valliant, R. (2021). Investigating an alternative for estimation from a nonprobability sample: Matching plus calibration. arXiv:2112.00855v1 [stat.ME]. Dec. 2021.
Lockhart, R. (2018). Special issue on big data and the statistical sciences: Guest editor’s introduction. The Canadian Journal of Statistics, 46, 4-9.
Lohr, S.L. (2021). Multiple-frame surveys for a multiple-data-source world. Survey Methodology, 47, 2, 229-263. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2021002/article/00008-eng.pdf.
Lohr, S.L., and Raghunathan, T.E. (2017). Combining survey data with other data sources. Statistical Science, 32, 293-312.
Lohr, S.L., and Rao, J.N.K. (2006). Estimation in multiple frame surveys. Journal of the American Statistical Association, 101, 1019-1030.
McCullagh, P., and Nelder, J.A. (1989). Generalized Linear Models, second edition, New York: Chapman and Hall.
Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. Annals of Applied Statistics, 12, 685-726.
Nadaraya, E.A. (1964). On estimating regression. Theory of Probability and its Applications, 9, 141-142.
Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97, 558-606.
Rao, J.N.K. (2021). On making valid inferences by integrating data from surveys and other sources. Sankhy B, 83, 242-272.
Rao, J.N.K., and Molina, I. (2015). Small Area Estimation, second Edition. Hoboken, NJ: Wiley.
Rao, J.N.K., and Wu, C. (2010a). Pseudo empirical likelihood inference for multiple frame surveys. Journal of the American Statistical Association, 105, 1494-1503.
Rao, J.N.K., and Wu, C. (2010b). Bayesian pseudo empirical likelihood intervals for complex surveys. Journal of the Royal Statistical Society, Series B, 72, 533-544.
Rivers, D. (2007). Sampling for web surveys. In Proceedings of the Survey Research Methods Section, Joint Statistical Meetings, American Statistical Association, Alexandria, VA, 1-26.
Robins, J.M., Rotnitzky, A. and Zhao, L.P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association,89, 846-866
Rosenbaum, P.R., and Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55.
Tourangeau, R., Conrad, F.G. and Couper, M.P. (2013). The Science of Web Surveys, first edition. Oxford: Oxford University Press.
Thompson, M.E. (1997). Theory of Sample Surveys. London: Chapman & Hall.
Thompson, M.E. (2019). Combining data from new and traditional sources in population surveys. International Statistical Review, 87, S79-89.
Tsiatis, A.A. (2006). Semiparametric Theory and Missing Data. New York: Springer.
Valliant, R., and Dever, J.A. (2011). Estimating propensity adjustments for volunteer web surveys. Sociological Methods & Research, 40, 105-137.
Wang, L., Graubard, B.I., Katki, H.A. and Li, Y. (2020). Improving external validity of epidemiologic cohort analysis: A kernel weighting approach. Journal of the Royal Statistical Society, Series A, 183, 1293-1311.
Wang, L., Valliant, R. and Li, Y. (2021). Adjusted logistic propensity weighting methods for population inference using nonprobability volunteer-based epidemiologic cohorts. Statistics in Medicine, 40, 5237-5250.
Watson, G.S. (1964). Smooth regression analysis. Sankhy A, 26, 359-372.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1-25.
Wu, C. (2004). Combining information from multiple surveys through the empirical likelihood method. The Canadian Journal of Statistics, 32, 15-26.
Wu, C., and Rao, J.N.K. (2006). Pseudo-empirical likelihood ratio confidence intervals for complex surveys. The Canadian Journal of Statistics, 34, 359-375.
Wu, C., and Sitter, R.R. (2001). A model-calibration approach to using complete auxiliary information from survey data. Journal of the American Statistical Association, 96, 185-193.
Wu, C., and Thompson, M.E. (2020). Sampling Theory and Practice. Springer, Cham.
Yang, S., and Kim, J.K. (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3, 625-650.
Yang, S., Kim, J.K. and Hwang, Y. (2021). Integration of data from probability surveys and big found data for finite population inference using mass imputation. Survey Methodology, 47, 1, 29-58. Paper available at https://www150.statcan.gc.ca/n1/pub/12-001-x/2021001/article/00004-eng.pdf.
Yang, S., Kim, J.K. and Song, R. (2020). Doubly robust inference when combining probability and non-probability samples with high dimensional data. Journal of the Royal Statistical Society, Series B, 82, 445-465.
Yuan, M., Li, P. and Wu, C. (2022). Nonparametric estimation of propensity scores for non-probability survey samples. Working paper.
Zhao, P., and Wu, C. (2019). Some theoretical and practical aspects of empirical likelihood methods for complex surveys. International Statistical Review, 87, S239-256.
Zhao, P., Rao, J.N.K. and Wu, C. (2020a). Empirical likelihood methods for public-use survey data. Electronic Journal of Statistics, 14, 2484-2509.
Zhao, P., Ghosh, M., Rao, J.N.K. and Wu, C. (2020b). Bayesian empirical likelihood inference with complex survey data. Journal of the Royal Statistical Society, Series B, 82, 155-174.
- Date modified: