Comments on “Statistical inference with non-probability survey samples”
Section 2. Additional approaches to combining data from probability and non-probability surveys

Table of contents

Dr. Wu’s paper follows the general prescription of 1) using model estimation and subsequent calibration to probability-sample-estimated covariate distributions, 2) developing propensity score estimates based on discrepancies between the probability- and non-probability sample data, and 3) doubly-robust methods that combine 1) and 2) in a manner such that only one of the two underlying models needs to be correct.

2.1 Propensity score estimators

Rivers (2007) appears to have been the first to suggest estimating propensity score using logistic regression with membership in the non-probability sample as the outcome and taking the reciprocal of the resulting propensity scores to use as inclusion weights. This approach was formalized further in Valliant and Dever (2011). Separately, using simple results from Bayes’ theorem and discriminant analysis first described in Elliott and Davis (2005), Elliott, Resler, Flannagan and Rupp (2010) and Elliott (2013) developed a somewhat different estimator of the form

${\hat{π}}_{i}^{A} (x_{i}, α) = \hat{P} (i \in S_{A}) \propto P (i \in S_{B}) \frac{\hat{P} (i \in S_{A} | i \in S_{A} or i \in S_{B}, x_{i}, α)}{\hat{P} (i \in S_{B} | i \in S_{A} or i \in S_{B}, x_{i}, α)} . (2.1)$

$\hat{P} (i \in S_{A} | i \in S_{A} or i \in S_{B}, x_{i}, α)$ can be obtained using logistic regression, or using one of the suite of machine learning-type approaches such as support vector machines (Soentpiet, 1999), targeted maximum likelihood estimation (Van Der Laan and Rubin, 2006), or Bayesian Additive Regression Trees (BART) (Chipman, George and McCulloch, 2010), and $\hat{P} (i \in S_{A} | i \in S_{B} or i \in S_{B}, x_{i}, α)$ obtained as $1 - \hat{P} (i \in S_{A} | i \in S_{A} or i \in S_{B}, x_{i}, α) .$ In principle $P (i \in S_{B}) = 1 / d_{i}^{B}$ is known since sampling probabilities are known for all elements of the population, including those in the non-probability sample, but in practice analysts with access only to public use data may have to estimate this as well. (In addition, $d_{i}^{B}$ may include calibration and non-response adjustments that are not known for the non-probability sample elements.) This last point is critical as use of the probability sample to develop propensity scores using only the discrepancies between the non-probability sample and the probability sample will be biased unless the probability sample used an equal probability (epsem) design, as noted by Wu.

In contrast, Chen, Li and Wu (2020) shows that using a pseudo-likelihood approach to estimating ${\hat{π}}_{i}^{A} (x_{i}, α)$ directly from the population likelihood for the indicators $I (i \in S_{A})$ as a function of $x_{i}$ yields an estimator that does not require $P (i \in S_{B})$ for elements in the non-probability sample under the restriction that $π_{i}^{A} (x_{i}, α)$ follows a generalized linear model with a canonical link, i.e., logistic regression.

(None of these approaches actually has the correct intercept to obtain a true propensity score; however, as noted in Wu, weighted estimation usually uses Hájek-type estimators [using weights to estimate a population total for denominators; Hájek, 1971] so that propensity scores estimated up to a normalizing constant are sufficient.)

2.2 Doubly-robust estimators

If inference is focused on a particular variable $Y$ available only in the non-probability sample, we can return to the model-assisted estimators that date back to Cassel, Särndal and Wretman (1976), which posit a model for the expectation $E (y_{i} | x_{i}) = m_{i} .$ Combining this with propensity score estimates of the probability of being in the non-probability sample (which we are treating as an “unknown probability sample” $-$ more about this under Assumptions below) yields estimators of the form

$\frac{1}{{\hat{N}}^{A}} \sum_{i \in S_{A}} \frac{y_{i} - {\hat{m}}_{i}}{{\hat{π}}_{i}^{A}} + \frac{1}{{\hat{N}}^{B}} \sum_{i \in S_{B}} d_{i}^{B} {\hat{m}}_{i} (2.2)$

corresponding to ${\hat{μ}}_{DR 2}$ of (4.11) in Wu. The intuition is that any bias due to the model misspecification in estimation of $m_{i}$ in $\frac{1}{{\hat{N}}^{B}} \sum_{i \in S_{B}} d_{i}^{B} {\hat{m}}_{i}$ will be equal to and opposite in sign of $\frac{1}{{\hat{N}}^{A}} \sum_{i \in S_{A}} \frac{y_{i} - {\hat{m}}_{i}}{{\hat{π}}_{i}^{A}}$ if the model for $π_{i}^{A}$ is correctly specified. Conversely, if the model for $π_{i}^{A}$ is misspecified but $m_{i}$ is correctly specified, $y_{i} - {\hat{m}}_{i}$ will be iid with mean zero and consequently $\frac{1}{{\hat{N}}^{A}} \sum_{i \in S_{A}} \frac{y_{i} - {\hat{m}}_{i}}{{\hat{π}}_{i}^{A}}$ will also have mean 0, yielding an unbiased estimator. Chen, Valliant and Elliott (2019) used LASSO for prediction in combination with generalized regression estimators (McConville, Breidt, Lee and Moisen, 2017) when $X$ is of high dimension. As Wu notes, Wu and Sitter (2001) show the equivalence between GREG applied to predicted values and DR estimators of the form in (2.2), which indicates that the Chen et al. (2019) approach was equivalent to (2.2) with LASSO estimation for $m_{i}$ and an assumption of simple random sampling for the non-probability sample.

A disadvantage of using (2.1) as opposed to Chen et al. (2020) as the estimator of $π_{i}^{A},$ and thus of $d_{i}^{A},$ is the requirement that the probability sample weights $d_{i}^{B}$ be known or at least estimated for the non-probability sample. An advantage of using (2.1), is that non-linear models and machine learning methods can be used in estimation. Rafei, Flannagan and Elliott (2020) uses BART to estimate both $m_{i}$ and $π_{i}^{A},$ reducing the impact of potential model misspecification. Simulations showed considerable improvement in bias and variance reduction over the method of Chen et al. (2020) when the linear models is misspecified. Variance estimation can proceed by adapting Rubin’s multiple imputation rules: from $M$ independent draws from BART, the mean of the variances computed treating the draw of $d_{i}^{A}$ as known using standard complex sample design estimators and added to $\frac{M + 1}{M}$ times the variance of the point estimates computed across the draws of $d_{i}^{A}$ yield an approximately unbiased variance estimator.

An alternative approach to doubly-robust estimation uses the fact that the propensity score is the coarsest possible “balancing score” that contains all of the information about the association between the sampling indicator and the outcome of interest. This has led to the development of mean estimators that use smooth functions of weights to produce consistent estimators that can be more efficient when weights are highly variable or only weakly related to the outcome (Elliott and Little, 2000; Zheng and Little, 2005). Zhou, Elliott and Little (2019) extended this idea into the causal inference setting in non-randomized settings, in which probability of assignment to a treatment or exposure (propensity score) is estimated as a function of covariates $P_{Z} (x_{i}, α)$ using logistic regression, and then non-observed potential outcomes $Y^{z}$ under treatment arm $z_{i}^{^{'}} \neq z_{i}$ for observed treatment $z_{i}$ are imputed from

Y_{i}^{Z} ~ N (s ({\hat{P}}_{Z}^{*} (x_{i}, \hat{α}) | θ_{Z})) + g_{Z} ({\hat{P}}^{*} (x_{i}, \hat{α}), x_{i} | β_{Z}), σ^{2} (2.3)

where $P^{*}$ is the logit transformation of $P,$ $s ({\hat{P}}_{Z}^{*} | θ_{Z})$ denotes a penalized spline with fixed knots (Eilers and Marx, 1996) of propensity, and $g_{Z} ({\hat{P}}^{*}, x_{i} | β_{Z})$ is a general function of covariates including the propensity scores. The resulting estimator is doubly robust in the sense that if either $P_{Z} (x_{i}, α)$ or $E (Y^{z}) = g_{Z} ({\hat{P}}^{*} , x_{i} | β_{Z})$ is correctly specified, $Y^{(z)}$ will be approximately unbiased; see Zhang and Little (2009). This can be implemented in the non-probability setting by replacing ${\hat{P}}_{Z} (x_{i}, α)$ in the mean model for (2.3) with ${\hat{π}}_{i}^{A}$ estimated using (2.1) to obtain a draw of $Y_{i}^{(b)} .$ (Note this requires obtaining ${\hat{π}}_{i}^{A}$ for the probability sample elements requiring prediction.) Inference can proceed by obtaining $b = 1, \dots, B$ draws from the posterior distribution of the estimated population quantity of interest, e.g., for the population mean

$Y^{(b)} = \frac{\sum_{i \in S_{R}} N_{i}^{(b)} Y_{i}^{(b)} + \sum_{i \in S_{A}} (y_{i} - Y_{i}^{(b)})}{N}$

where now $N_{i}^{(b)}$ is a estimate of the population represented by the weight $d_{i}^{R}$ obtained from a finite population Bayesian bootstrap (Little and Zheng, 2007); more complete FBPP extensions to complex sample designs that include clustering and stratification are available in Dong, Elliott and Raghunathan (2014).

As in the estimation of (2.1), the non-parametric (spline) component of (2.3) can be replaced with other machine-learning estimators; see Chapter 4 of Rafei (2021) for implementation using Gaussian processes. Also, extensions to non-normal models are direct, although not necessarily computational easy.

2.3 Poststratified estimators

Wu also describes the use of poststratified estimators in the context of quota sampling, which is not only a very old form of non-probability sampling but indeed the standard before Neyman made the case for stratified random sampling (Neyman, 1934). Wu’s Section 5 suggests a robust alternative to the propensity score estimates obtained by ordering observations in the probability sample by ${\hat{π}}_{i},$ stratifying into $K$ strata based on this ordering, and computing the predicted proportion of the population belonging to the $k^{th}$ stratum as proportion of the sample weights $W_{k}$ in this stratum using the probability sample, with

${\hat{μ}}_{PST} = \sum_{k} {\hat{W}}_{k} {\bar{y}}_{k} (2.4)$

where ${\bar{y}}_{k}$ is the mean within the $k^{th}$ stratum in the non-probability sample. Wu notes the tradeoff between choosing $K$ to be large enough to retain homogeneity within units but small enough to obtain stable estimates of ${\bar{y}}_{k},$ suggesting 30 as the old “rule of thumb” for “large [enough] sample sizes”. I would add that a more formal approach discussed in Little (1986) suggests a method to generate strata (there in the context of non-response adjustment) that minimizes mean square error by maximizing the between-stratum-to-within-stratum variance. It would seem such an approach would be appropriate to consider in the non-probability post-stratified estimator as well.

A more direct approach to obtain estimates using a post-stratified type estimator is multilevel regression and poststratification (Wang, Rothschild, Goel and Gelman, 2015; Downes and Carlin, 2020). Here only data from the non-probability sample is used in the outcome model:

$E (Y_{k [i]}) = β_{0} + x_{k}^{T} β + \sum_{j} a_{l [k]}^{j} (2.5)$

where $k = 1, \dots, K$ indexes the poststratum developed from $j = 1, \dots, J$ variables, $a_{l [k]}^{j} ~ N (0, σ_{j}^{2})$ for $l = 1, \dots, L_{j}$ and $l [k]$ maps the postratum cell $k$ to the appropriate category $l$ of variable $j .$ The poststratifed estimator is still given by (2.4) with ${\hat{W}}_{k}$ now replaced with known population totals $W_{k};$ posterior inference is obtained though posterior draws of $β_{0},$ $β,$ and $a_{l [k]}^{j}$ to obtain a draw of

{\hat{μ}}_{PST}^{(b)} = \sum_{k} W_{k} [\frac{1}{n_{k}} \sum_{i \in k} (β_{0}^{(b)} + x_{k}^{T} β^{(b)} + \sum_{j} a_{l [k]}^{j (b)})] .

Though not technically doubly-robust, it has been shown to work well in some applications where $J$ is large enough to capture all of the important discrepancies between the probability and non-probability sample, and the non-probability sample is sufficiently large to allow reasonably accurate estimation of $a_{l [k]}^{j} .$ In the absence of known joint distributions of a high dimensional $X,$ this approach has the weakness of relying on estimated distributions, which are unstable. A possible alternative might be replace the simple ${\bar{y}}_{k}$ with (2.5) in Wu’s poststratified estimator (2.4), using the fact that the sampling weights $d_{i}^{R}$ summarize the information about $X$ in the probability sample similar to that of the propensity score for non-probability sample.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-12-15

Language selection

Search and menus

Search

Comments on “Statistical inference with non-probability survey samples”
Section 2. Additional approaches to combining data from probability and non-probability surveys

2.1 Propensity score estimators

2.2 Doubly-robust estimators

2.3 Poststratified estimators

Comments on “Statistical inference with non-probability survey samples” Section 2. Additional approaches to combining data from probability and non-probability surveys

2.1 Propensity score estimators

2.2 Doubly-robust estimators

2.3 Poststratified estimators

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Comments on “Statistical inference with non-probability survey samples”
Section 2. Additional approaches to combining data from probability and non-probability surveys