Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 7. Probability sampling as aspiration, not prescription

As it should be clear from the definition of ddc, it is not directly estimable from the biased sample alone. One therefore naturally would (and should) question how useful ddc is or could be. The answer turns out to be an increasingly long one thanks to ddc being model-free and hence a versatile data quality metric for both probability samples and non-probability samples. Its usefulness for generating theoretical insights is demonstrated by its role in quantifying the data quality-quantify trade-off via effective sample size as seen in (6.1), in understanding simulation errors in quasi-Monte Carlo as explored in Hickernell (2016), and in anticipating the “double-plus robustness” phenomenon as presented in Section 5. Its methodological usages are illustrated by the scenario analyses for the 2020 US Presidential election (Isakov and Kuriwaki, 2020) and for the COVID-19 vaccination assessments (Bradley et al., 2021). Its practical implications can be found in epidemiological studies (Dempsey, 2020), particle physics (Courtoy, Houston, Nadolsky, Xie, Yan and Yuan, 2022), and political polling (Bailey, 2023).

Not surprisingly, these practical applications found the notion of ddc and the underlying error decomposition (2.2) helpful because of the non-probability samples they need to deal with, either due to distortions to the probability samples such as by a biased non-response mechanism or due to selection biases in the first place such as selective COVID-19 testing. Professor Wu’s overview, and the many references cited there and in this discussion, should make it clear that non-probability samples are almost surely everywhere. I am invoking this strong probabilistic phrase not merely for its humorous value. When we consider the unaccountably many possible values for the mean of ddc, the probability ‒ however we construct it to capture the wild west of data collection processes out there ‒ that it will land precisely on zero must be zero. This zero mean is a necessary condition for the sample to be a probability sample, because a probability sample implies that ddc must be of the order of N 1/2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8srps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGabiWadaaakeaacaWGobWaaWbaaSqabeaacqGHsislda WcgaqaaiaaigdaaeaacaaIYaaaaaaaaaa@353B@  order (Meng, 2018), which is impossible when its mean is non-zero (asymptotically). This observation suggests that we should move away from our tradition of treating probability sampling as a centerpiece and then try to model the much larger world of non-probability samples as “deviations” from it. Instead, we should start with studying samples with general collection mechanisms using tools or concepts such as ddc, and then treat (design) probability samples as the very special, ideal case ‒ always an aspiration, but never the only prescription for action.

Acknowledgements

I am grateful to Editor Jean-François Beaumont for inviting me to discuss Changbao Wu’s timely and thought-provoking overview. I thank James Bailie, Radu Craiu, Adel Daoud, Andrew Gelman, Stas Kolenikov, Rod Little, Cory McCartan, Kelly McConville, James Robins, Zhiqiang Tan, and Li-Chun Zhang for moral endorsement and for constructive criticisms. I also thank NSF for partial financial support, and Steve Finch for careful proofreading.

References

Bailey, M.A. (2023). Polling at a Crossroads ‒ Rethinking Modern Survey Research. Cambridge University Press.

Beaumont, J.-F., and Rao, J.N.K. (2021). Pitfalls of making inferences from non-probability samples: Can data integration through probability samples provide remedies? Survey Statistician, 83, 11-22.

Blei, D.M., Kucukelbir, A. and McAuliffe, J.D. (2017). Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518), 859-877.

Bradley, V.C., Kuriwaki, S., Isakov, M., Sejdinovic, D., Meng, Z.-L. and Flaxman, S. (2021). Unrepresentative big surveys significantly overestimated US vaccine uptake. Nature, 600(7890), 695-700.

Buelens, B., Burger, J. and van den Brakel, J.A. (2018). Comparing inference methods for nonprobability samples. International Statistical Review, 86(2), 322-343.

Courtoy, A., Houston, J., Nadolsky, P., Xie, K., Yan, M. and Yuan, C.-P. (2022). Parton distributions need representative sampling. arXiv preprint arXiv:2205.10444.

Craiu, R.V., Gong, R. and Meng, X.-L. (2022). Six statistical senses. arXiv preprint arXiv:2204.05313.

David Peat, F. (2002). From Certainty to Uncertainty: The Story of Science and Ideas in the Twentieth Century. Joseph Henry Press.

Dempsey, W. (2020). The hypothesis of testing: Paradoxes arising out of reported coronavirus case-counts. arXiv preprint arXiv:2005.10425.

Dwork, C. (2008). Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation, Springer, 1-19.

Elliott, M.R., and Valliant, R. (2017). Inference for nonprobability samples. Statistical Science, 32(2), 249-264.

Gelman, A. (2007). Struggles with survey weighting and regression modeling. Statistical Science, 22(2), 153-164.

Gong, R. (2022). Transparent privacy is principled privacy. Harvard Data Science Review, (Special Issue 2), June 24, 2022. https://hdsr.mitpress.mit.edu/pub/ld4smnnf.

Gong, R., Groshen, E.L. and Vadhan, S. (2022). Harnessing the known unknowns: Differential privacy and the 2020 Census. Harvard Data Science Review, (Special Issue 2), June 24 2022. https://hdsr.mitpress.mit.edu/pub/fgyf5cne.

Han, P., and Wang, L. (2013). Estimation with missing data: Beyond double robustness. Biometrika, 100(2), 417-430.

Hartley, H.O., and Ross, A. (1954). Unbiased ratio estimators. Nature, 174(4423), 270-271.

Hickernell, F.J. (2016). The trio identity for Quasi-Monte Carlo error. In International Conference on Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing, Springer, 3-27.

Hinkins, S., Oh, H.L. and Scheuren, F. (1997). Inverse sampling design algorithms. Survey Methodology, 23, 1, 11-21. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/1997001/article/3101-eng.pdf.

Isakov, M., and Kuriwaki, S. (2020). Towards principled unskewing: Viewing 2020 election polls through a corrective Lens from 2016. Harvard Data Science Review, 2(4), Nov. 3, 2020. https://hdsr.mitpress.mit.edu/pub/cnxbwum6.

Kang, J.D.Y., and Schafer, J.L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4), 523-539.

Kish, L. (1965). Survey Sampling. New York: John Wiley & Sons, Inc.

Li, X., and Meng, X.-L. (2021). A multi-resolution theory for approximating infinite-p-zero-n: Transitional inference, individualized predictions, and a world without bias-variance tradeoff. Journal of the American Statistical Association, 116(533), 353-367.

Little, R., and An, H. (2004). Robust likelihood-based analysis of multivariate data with missing values. Statistica Sinica, 14(3), 949-968.

Liu, Y., Gelman, A. and Chen, Q. (2021). Inference from non-random samples using Bayesian machine learning. arXiv preprint arXiv:2104.05192.

Lo, A.W. (2017). Adaptive markets. In Adaptive Markets. Princeton University Press.

Lohr, S., and Rao, J.N.K. (2006). Estimation in multiple-frame surveys. Journal of the American Statistical Association, 101(475), 1019-1030.

Lohr, S.L. (2021). Sampling: Design and Analysis. Chapman and Hall/CRC.

Luque-Fernandez, M.A., Schomaker, M., Rachet, B. and Schnitzer, M.E. (2018). Targeted maximum likelihood estimation for a binary treatment: A tutorial. Statistics in Medicine, 37(16), 2530-2546.

Meng, X.-L. (2014). A trio of inference problems that could win you a Nobel prize in statistics (if you help fund it). In Past, Present, and Future of Statistical Science, (Eds., Lin et al.), CRC Press.

Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (i) Law of large populations, big data paradox, and the 2016 US presidential election. The Annals of Applied Statistics, 12(2), 685-726.

Meng, X.-L. (2021). Enhancing (publications on) data quality: Deeper data minding and fuller data confession. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4), 1161-1175.

Msaouel, P. (2022). The big data paradox in clinical practice. Cancer Investigation, 1-27.

Pfeffermann, D. (2017). Bayes-based non-bayesian inference on finite populations from non-representative samples: A unified approach. Calcutta Statistical Association Bulletin, 69(1), 35-63.

Rao, J.N.K., Scott, A.J. and Benhin, E. (2003). Undoing complex survey data structures: Some theory and applications of inverse sampling. Survey Methodology, 29, 2, 107-128. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2003002/article/6787-eng.pdf.

Robins, J.M. (2000). Robust estimation in sequentially ignorable missing data and causal inference models. In Proceedings of the American Statistical Association, Indianapolis, IN, 1999, 6-10.

Robins, J.M., Rotnitzky, A. and Zhao, L.P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427), 846-866.

Rubin, D.B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.

Scharfstein, D.O., Rotnitzky, A. and Robins, J.M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models (with discussions). Journal of the American Statistical Association, 94(448), 1096-1146.

Slavkovic, A., and Seeman, J. (2022). Statistical data privacy: A song of privacy and utility. arXiv preprint arXiv:2205.03336.

Tan, Y.V., Flannagan, C.A.C. and Elliott, M.R. (2019). “Robust-Squared” imputation models using Bart. Journal of Survey Statistics and Methodology, 7(4), 465-497.

Tan, Z. (2007). Comment: Understanding OR, PS and DR. Statistical Science, 22(4), 560-568.

Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika, 97(3), 661-682.

Tan, Z. (2013). Simple design-efficient calibration estimators for rejective and high-entropy sampling. Biometrika, 100(2), 399-415.

Van Buuren, S., and Oudshoorn, K. (1999). Flexible Multivariate Imputation by MICE. Leiden: TNO.

van der Laan, M.J., and Gruber, S. (2009). Collaborative double robust targeted penalized maximum likelihood estimation. UC Berkeley Division of Biostatistics Working Paper Series, 246.

van der Laan, M.J., and Gruber, S. (2010). Collaborative double robust targeted maximum likelihood estimation. The International Journal of Biostatistics, 6(1).

van der Laan, M.J., and Rubin, D. (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1).

Wang, W., Rothschild, D., Goel, S. and Gelman, A. (2015). Forecasting elections with non-representative polls. International Journal of Forecasting, 31(3), 980-991.

Wu, C. (2022). Statistical inference with non-probability survey samples (with discussions). Survey Methodology, 48, 2, 283-311. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2022002/article/00002-eng.pdf.

Wu, C., and Thompson, M.E. (2020). Sampling Theory and Practice. Springer.

Yang, S., Kim, J.K. and Song, R. (2020). Doubly robust inference when combining probability and non-probability samples with high dimensional data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(2), 445-465.

Zhang, G., and Little, R. (2009). Extensions of the penalized spline of propensity prediction method of imputation. Biometrics, 65(3), 911-918.

Zhang, L.-C. (2019). On valid descriptive inference from non-probability sample. Statistical Theory and Related Fields, 3(2), 103-113.


Date modified: