- Release date: December 15, 2022
Non-probability samples are deprived of the powerful design probability for randomization-based inference. This deprivation, however, encourages us to take advantage of a natural divine probability that comes with any finite population. A key metric from this perspective is the data defect correlation (ddc), which is the model-free finite-population correlation between the individual’s sample inclusion indicator and the individual’s attribute being sampled. A data generating mechanism is equivalent to a probability sampling, in terms of design effect, if and only if its corresponding ddc is of N-1/2 (stochastic) order, where N is the population size (Meng, 2018). Consequently, existing valid linear estimation methods for non-probability samples can be recast as various strategies to miniaturize the ddc down to the N-1/2 order. The quasi design-based methods accomplish this task by diminishing the variability among the N inclusion propensities via weighting. The super-population model-based approach achieves the same goal through reducing the variability of the N individual attributes by replacing them with their residuals from a regression model. The doubly robust estimators enjoy their celebrated property because a correlation is zero whenever one of the variables being correlated is constant, regardless of which one. Understanding the commonality of these methods through ddc also helps us see clearly the possibility of “double-plus robustness”: a valid estimation without relying on the full validity of either the regression model or the estimated inclusion propensity, neither of which is guaranteed because both rely on device probability. The insight generated by ddc also suggests counterbalancing sub-sampling, a strategy aimed at creating a miniature of the population out of a non-probability sample, and with favorable quality-quantity trade-off because mean-squared errors are much more sensitive to ddc than to the sample size, especially for large populations.
Key Words: Data defect index; Design probability; Divine probability; Device probability; Design-based inference; Model-assisted survey estimators; Non-response bias.
Table of contents
- Section 1. Distinguish among design, divine, and device probabilities
- Section 2. A finite-population deterministic identity for actual error
- Section 3. A unifying strategy based on data defect correlation
- Section 4. Quasi-randomization or super-population implementations
- Section 5. Quasi-randomization and super-population implementations
- Section 6. Counterbalancing sub-sampling
- Section 7. Probability sampling as aspiration, not prescription
How to cite
Meng, X.-L. (2022). Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 48, No. 2. Paper available at http://www.statcan.gc.ca/pub/12-001-x/2022002/article/00006-eng.htm.