Comments on “Statistical inference with non-probability survey samples”
Section 2. Additional approaches to combining data from probability and non-probability surveys
Dr. Wu’s
paper follows the general prescription of 1) using model estimation and
subsequent calibration to probability-sample-estimated covariate distributions,
2) developing propensity score estimates based on discrepancies between the
probability- and non-probability sample data, and 3) doubly-robust methods that
combine 1) and 2) in a manner such that only one of the two underlying models
needs to be correct.
2.1 Propensity score estimators
Rivers
(2007) appears to have been the first to suggest estimating propensity score
using logistic regression with membership in the non-probability sample as the
outcome and taking the reciprocal of the resulting propensity scores to use as
inclusion weights. This approach was formalized further in Valliant and Dever
(2011). Separately, using simple results from Bayes’ theorem and discriminant
analysis first described in Elliott and Davis (2005), Elliott, Resler,
Flannagan and Rupp (2010) and Elliott (2013) developed a somewhat different
estimator of the form
can be obtained using logistic regression, or
using one of the suite of machine learning-type approaches such as support
vector machines (Soentpiet, 1999), targeted maximum likelihood estimation (Van Der Laan
and Rubin, 2006), or Bayesian Additive Regression Trees (BART) (Chipman, George
and McCulloch, 2010), and obtained as In principle is known since sampling probabilities are
known for all elements of the population, including those in the
non-probability sample, but in practice analysts with access only to public use
data may have to estimate this as well. (In addition, may include calibration and non-response
adjustments that are not known for the non-probability sample elements.) This
last point is critical as use of the probability sample to develop propensity
scores using only the discrepancies between the non-probability sample and the
probability sample will be biased unless the probability sample used an equal
probability (epsem) design, as noted by Wu.
In
contrast, Chen, Li and Wu (2020) shows that using a pseudo-likelihood approach
to estimating directly from the population likelihood for
the indicators as a function of yields an estimator that does not require for elements in the non-probability sample
under the restriction that follows a generalized linear model with a
canonical link, i.e., logistic regression.
(None
of these approaches actually has the correct intercept to obtain a true
propensity score; however, as noted in Wu, weighted estimation usually uses Hájek-type
estimators [using weights to estimate a population total for denominators; Hájek, 1971] so that propensity scores estimated up to a
normalizing constant are sufficient.)
2.2 Doubly-robust estimators
If
inference is focused on a particular variable available only in the non-probability sample,
we can return to the model-assisted estimators that date back to Cassel,
Särndal and Wretman (1976), which posit a model for the expectation Combining this with propensity score estimates
of the probability of being in the non-probability sample (which we are
treating as an “unknown probability sample” more about this under Assumptions below)
yields estimators of the form
corresponding to of (4.11) in Wu. The intuition is that any
bias due to the model misspecification in estimation of in will be equal to and opposite in sign of if the model for is correctly specified. Conversely, if the
model for is misspecified but is correctly specified, will be iid with mean zero and consequently will also have mean 0, yielding an unbiased
estimator. Chen, Valliant and Elliott (2019) used LASSO for prediction in
combination with generalized regression estimators (McConville, Breidt, Lee and
Moisen, 2017) when is of high dimension. As Wu notes, Wu and
Sitter (2001) show the equivalence between GREG applied to predicted values and
DR estimators of the form in (2.2), which indicates that the Chen et al.
(2019) approach was equivalent to (2.2) with LASSO estimation for and an assumption of simple random sampling
for the non-probability sample.
A
disadvantage of using (2.1) as opposed to Chen et al. (2020) as the
estimator of and thus of is the requirement that the probability sample
weights be known or at least estimated for the
non-probability sample. An advantage of using (2.1), is that non-linear models
and machine learning methods can be used in estimation. Rafei, Flannagan and
Elliott (2020) uses BART to estimate both and reducing the impact of potential model
misspecification. Simulations showed considerable improvement in bias and
variance reduction over the method of Chen et al. (2020) when the linear
models is misspecified. Variance estimation can proceed by adapting Rubin’s
multiple imputation rules: from independent draws from BART, the mean of the
variances computed treating the draw of as known using standard complex sample design
estimators and added to times the variance of the point estimates
computed across the draws of yield an approximately unbiased variance
estimator.
An
alternative approach to doubly-robust estimation uses the fact that the
propensity score is the coarsest possible “balancing score” that contains all of the information about the
association between the sampling indicator and the outcome of interest. This
has led to the development of mean estimators that use smooth functions of
weights to produce consistent estimators that can be more efficient when
weights are highly variable or only weakly related to the outcome (Elliott and
Little, 2000; Zheng and Little, 2005). Zhou, Elliott and Little (2019) extended
this idea into the causal inference setting in non-randomized settings, in
which probability of assignment to a treatment or exposure (propensity score)
is estimated as a function of covariates using logistic regression, and then
non-observed potential outcomes under treatment arm for observed treatment are imputed from
where is the logit transformation of
denotes a penalized spline with fixed knots (Eilers
and Marx, 1996) of propensity, and is a general function of covariates including
the propensity scores. The resulting estimator is doubly robust in the sense
that if either or is correctly specified, will be approximately unbiased; see Zhang and
Little (2009). This can be implemented in the non-probability setting by
replacing in the mean model for (2.3) with estimated using (2.1) to obtain a draw of (Note this requires obtaining for the probability sample elements requiring
prediction.) Inference can proceed by obtaining draws from the posterior distribution of the
estimated population quantity of interest, e.g., for the population mean
where now is a estimate of the population represented by
the weight obtained from a finite population Bayesian
bootstrap (Little and Zheng, 2007); more complete FBPP extensions to complex
sample designs that include clustering and stratification are available in Dong,
Elliott and Raghunathan (2014).
As
in the estimation of (2.1), the non-parametric (spline) component of (2.3) can
be replaced with other machine-learning estimators; see Chapter 4 of Rafei
(2021) for implementation using Gaussian processes. Also, extensions to
non-normal models are direct, although not necessarily computational easy.
2.3 Poststratified estimators
Wu
also describes the use of poststratified estimators in the context of quota
sampling, which is not only a very old form of non-probability sampling but
indeed the standard before Neyman made the case for stratified random sampling
(Neyman, 1934). Wu’s Section 5 suggests a robust alternative to the
propensity score estimates obtained by ordering observations in the probability
sample by stratifying into strata based on this ordering, and computing
the predicted proportion of the population belonging to the stratum as proportion of the sample weights in this stratum using the probability sample,
with
where is the mean within the stratum in the non-probability sample. Wu
notes the tradeoff between choosing to be large enough to retain homogeneity
within units but small enough to obtain stable estimates of suggesting 30 as the old “rule of thumb” for
“large [enough] sample sizes”. I would add that a more formal approach
discussed in Little (1986) suggests a method to generate strata (there in the
context of non-response adjustment) that minimizes mean square error by
maximizing the between-stratum-to-within-stratum variance. It would seem such
an approach would be appropriate to consider in the non-probability
post-stratified estimator as well.
A
more direct approach to obtain estimates using a post-stratified type estimator
is multilevel regression and poststratification (Wang, Rothschild, Goel and
Gelman, 2015; Downes and Carlin, 2020). Here only data from the non-probability
sample is used in the outcome model:
where indexes the poststratum developed from variables, for and maps the postratum cell to the appropriate category of variable The poststratifed estimator is still given by (2.4)
with now replaced with known population totals posterior inference is obtained though
posterior draws of and to obtain a draw of
Though not technically doubly-robust, it has been shown to work well in
some applications where is large enough to capture all of the
important discrepancies between the probability and non-probability sample, and
the non-probability sample is sufficiently large to allow reasonably accurate
estimation of In the absence of known joint distributions of
a high dimensional this approach has the weakness of relying on
estimated distributions, which are unstable. A possible alternative might be
replace the simple with (2.5) in Wu’s poststratified estimator (2.4),
using the fact that the sampling weights summarize the information about in the probability sample similar to that of the propensity
score for non-probability sample.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa