Statistical inference with non-probability survey samples
Section 4. Propensity scores based approach
The propensity scores
for the
non-probability survey sample
are
theoretically defined for all the units in the target population. Estimation of
the propensity scores for units in
which plays the
most crucial role for propensity scores based methods, requires an assumed
model on the propensity scores and auxiliary information at the population
level. In this section, we first discuss estimation procedures for the
propensity scores under the setting and assumptions described in Section 2,
and then provide an overview of estimation methods proposed in the recent
literature on the finite population mean
involving the
estimated propensity scores.
4.1 Estimation of
propensity scores
Under assumption A1, the propensity scores
are a function
of the auxiliary variables
but the
functional form can be complicated and is completely unknown. Three popular
parametric forms
in dealing with
a binary response can be considered: (i) the inverse logit function
(ii) the
inverse probit function
where
is the
cumulative distribution function of
and (iii) the
inverse complementary log-log function
Nonparametric
techniques without assuming an explicit functional form for
are attractive
alternatives for the estimation of propensity scores.
4.1.1 The pseudo maximum likelihood method
Let
be a specified
parametric form with unknown model parameters
Under the ideal
situation where the complete auxiliary information
is available
and with the independence assumption A3,
the full log-likelihood function on
can be written
as (Chen et al., 2020)
The
maximum likelihood estimator of
is the
maximizer of
Under the
current setting where the population auxiliary information is supplied by the
reference probability sample
we replace
by the pseudo
log-likelihood function (Chen et al., 2020)
The
maximum pseudo-likelihood estimator
is the
maximizer of
and can be
obtained as the solution to the pseudo score equations given by
If the inverse
logit function is assumed for
the pseudo
score functions are given by
In
general, the pseudo score functions
at the true
values of the model parameters
are unbiased
under the joint
randomization
in the sense that
which implies
that the estimator
is
-consistent for
(Tsiatis,
2006).
Valliant and Dever (2011)
made an earlier attempt to estimate the propensity scores by pooling the
non-probability sample
with the
reference probability sample
Let
be the pooled
sample without removing any potential duplicated units. Let
if
and
if
Valliant and
Dever (2011) proposed to fit a survey weighted logistic regression model to the
pooled dataset
where the
weights are defined as
if
and
if
The key
motivation behind the creation of the weights
is that the
total weight
for the pooled
sample matches the estimated population size, and the hope is that the survey
weighted logistic regression model would lead to valid estimates for the
propensity scores. It was shown by Chen et al. (2020) that the pooled
sample approach of Valliant and Dever (2011) does not lead to consistent
estimators for the parameters of the propensity scores model unless the
non-probability sample
is a simple
random sample from the target population.
The method of Valliant
and Dever (2011) reveals a fundamental difficulty with approaches based on the
pooled sample
If the units in
the non-probability sample
are treated as
exchangeable in the pooled sample
which was
reflected by the equal weights
used in the
method of Valliant and Dever (2011), the resulting estimates for the propensity
scores will be invalid unless
is a simple
random sample. This observation has implications to the validity of
nonparametric methods or regression tree-based methods to be discussed in
Section 4.1.3.
In a recent paper, Wang,
Valliant and Li (2021) proposed an adjusted logistic propensity (ALP) weighting
method. The method involves two steps for computing the estimated propensity
scores. The initial estimates, denoted as
for
are obtained by
fitting the survey weighted logistic regression model to the pooled sample
similar to
Valliant and Dever (2011), with the weights defined as
if
and
if
The final estimated
propensity scores are computed as
The key
theoretical argument is the equation
where
and
is a copy of
but is viewed
as a different set. However, there are conceptual issues with the arguments
since the probabilities
are defined
under the assumed propensity scores model with the given finite population
and the assumed
model does not lead to a meaningful interpretation of the probabilities
The latter
require a different probability space and are conditional on the given
As a matter of
fact, one can easily argue that under the assumed propensity scores model and
conditional on the given
we have
if
and
otherwise.
4.1.2 Estimating equations based methods
The pseudo score
equations
derived from
the pseudo likelihood function
may be replaced
by a system of general estimating equations. Let
be a
user-specified vector of functions with the same dimension of
Let
It follows that
for any chosen
In principle,
an estimator
of
can be obtained
by solving
with the chosen
parametric form
and the chosen
functions
and the
estimator
is consistent.
The estimator
using arbitrary
user-specified functions
is typically
less efficient than the one based on the pseudo score functions, due to the
optimality of the maximum likelihood estimator (Godambe, 1960). Some limited
empirical results also show that the solution to
can be unstable
for certain choices of
Nevertheless,
the estimating equations based methods provide a useful tool for the estimation
of the propensity scores under more restricted scenarios. For instance, if we
let
the estimating
functions given in (4.4) reduce to
The form of
in (4.5) looks
like a “distorted” version of the pseudo score functions given in (4.3) under a
logistic regression model for the propensity scores. The most practically
important difference between the two versions, however, is the fact that the
given in (4.5)
only requires the estimated population totals for the auxiliary variables
There are
scenarios where the population totals of the auxiliary variables
can be accessed
or estimated from an existing source but values of
at the unit
level for the entire population or even a probability sample are not available.
The use of estimating functions
given (4.5) makes
it possible to obtain valid estimates of the propensity scores for units in the
non-probability sample. Section 6.3 describes an example where the
estimating equations based approach leads to a valid variance estimator for the
doubly robust estimator of the population mean.
4.1.3 Nonparametric methods and regression-tree based methods
The propensity scores
are the mean
function
for the binary
response
Nonparametric
methods for estimating
can be an
attractive alternative. The major challenge is to develop estimation procedures
which provide valid estimates of the propensity scores. As noted in
Section 4.1.1, estimation methods based on the pooled sample
may lead to
invalid estimates. Strategies similar to the one used by Chen et al.
(2020) can be theoretically justified under the joint
framework,
where the estimation procedures are first derived using data from the entire
finite population and unknown population quantities are then replaced by
estimates obtained from the reference probability sample.
We consider the kernel
regression estimator of
Suppose that
the dataset
is available
for the finite population. Let
be a chosen
kernel with a bandwidth
The
Nadaraya-Watson kernel regression estimator (Nadaraya, 1964; Watson, 1964) of
is given by
A kernel estimator
in the form of
given in (4.6)
usually has no practical values since we do not have complete auxiliary
information for the finite population. It turns out that for the estimation of
propensity scores the numerator in (4.6) only requires observations from the
non-probability sample due to the binary variable
and the
denominator is a population total and can be estimated by using the reference
probability sample. The nonparametric kernel regression estimator of the
propensity scores is given by (Yuan, Li and Wu, 2022)
The estimator
given in (4.7)
is consistent under the joint
framework and
the
-model for the
propensity scores is very flexible due to the nonparametric assumption on
The estimated
propensity scores are easy to compute when the dimension of
is not too
high. Issues with high dimensional
and the choices
of the kernel
and the
bandwidth
remain as in
general applications of kernel-based estimation methods. Simulation results
reported by Yuan et al. (2022) show that the kernel estimation method
provides robust results for the propensity scores using the normal kernel and
popular choices for the bandwidth.
Chu and Beaumont (2019)
considered regression-tree based methods for estimating the propensity scores.
Their proposed TrIPW method is a variant of the CART algorithm (Breiman,
Friedman, Olshen and Stone, 1984) and uses data from the combined sample of the
non-probability sample and the reference probability sample. The method aims to
construct a classification tree with the terminal nodes of the final tree
treated as homogeneous groups in terms of the propensity scores. The estimator
of
is constructed
based on the final tree and post-stratification. Section 5 contains
further details on poststratified estimators.
Statistical learning
techniques such as classification and regression trees and random forests have
been developed primarily for the purpose of prediction. Their use for
estimating the propensity scores of non-probability samples requires further
research. It is not a desirable approach to naively apply the methods over the
pooled sample
without
theoretical justifications on the consistency of the final estimators. Further
research towards this direction should be encouraged.
4.2 Inverse
probability weighting
Let
be an estimate
of
under a chosen
method for the estimation of the propensity scores. Two versions of the inverse
probability weighted (IPW) estimator of
are constructed
as
where
is the
population size and
is the
estimated population size. The estimator
is a version of
the Horvitz-Thompson estimator and
corresponds to
the Hájek estimator as discussed in design-based estimation theory. There are
ample evidences from both theoretical justifications and practical observations
that the Hájek estimator
performs better
than the Horvitz-Thompson estimator and should be used in practice even if the
population size
is known.
The validity of the IPW
estimators
and
depends on the
validity of the estimated propensity scores. Under the assumptions A1 and A2 and the parametric model
the consistency
of
follows a
standard two-step argument. Let
which is not a
computable estimator but an analytic tool useful for asymptotic purposes. It
follows that
and the order
holds under the
condition that
is bounded away
from zero. As a consequence, we have
in probability
as
Under the
correctly specified model
for the
propensity scores, the typical root-
order
holds for
commonly encountered scenarios. We can show by treating
as a function
of
and using a
Taylor series expansion that
under some mild
finite moment conditions. The consistency of
can be
established using standard arguments for a ratio estimator (Section 5.3,
Wu and Thompson, 2020) where
4.3 Doubly robust
estimation
The dependence of the IPW
estimator on the validity of the assumed propensity score model is viewed as a
weakness of the method. The issue is not unique to the IPW estimators and is
faced by many other approaches involving an assumed statistical model. Robust
estimation procedures which provide certain degrees of protection against model
misspecifications have been pursued by researchers, and the so-called doubly
robust estimators have been a successful story since the work of Robins,
Rotnitzky, and Zhao (1994).
The doubly robust (DR)
estimator of
is constructed
using both the propensity score model
and the outcome
regression model
The DR
estimator with the given propensity scores
and the mean
responses
has the
following general form,
The second term on
the right hand side of (4.9) is the model-based prediction of
The first term
is a propensity score based adjustment using the errors
from the
outcome regression model. The magnitude of the adjustment term is negatively
correlated to the “goodness-of-fit” of the outcome regression model. It can be
shown that
is an exactly
unbiased estimator of
if one of the
two models
and
is correctly
specified and hence it is doubly robust. The estimator
has an
identical structure to the generalized difference estimator of Wu and Sitter
(2001). It is important to note that the double robustness property of
does not
require the knowledge of which of the two models being correctly specified. It
is also apparent that the estimator
given in (4.9)
is not computable in practical applications.
Let
and
be respectively
the estimators of
and
under the
assumed models
and
Under the
two-sample setting described in Section 2, the two DR estimators of
proposed by
Chen et al. (2020) are given by
and
where
are the design
weights for the probability sample
and
The estimator
using the
estimated population size has better performance in terms of bias and mean
squared error and should be used in practice.
The probability survey
design
is an integral
part of the theoretical framework for assessing the two estimators
and
It is assumed
that
and
are selected
independently, which implies that
Consistency of
the estimators
and
can be
established under either the
or the
framework. It
should be noted that even if the non-probability sample
is a simple
random sample with
the doubly
robust estimator in the form of (4.9) does not reduce to the model-based
prediction estimator
given in (3.3).
4.4 The pseudo
empirical likelihood approach
The pseudo empirical
likelihood (PEL) methods for probability survey samples have been under
development over the past two decades. Two early papers on the topic are Chen
and Sitter (1999) on point estimation incorporating auxiliary information and
Wu and Rao (2006) on PEL ratio confidence intervals. The PEL approaches are
further used for multiple frame surveys (Rao and Wu, 2010a) and Bayesian
inferences with survey data (Rao and Wu, 2010b; Zhao, Ghosh, Rao and Wu,
2020b). Using the PEL methods for general inferential problems with complex
surveys has been studied in two recent papers (Zhao and Wu, 2019; Zhao, Rao and
Wu, 2020a).
Chen, Li, Rao and Wu
(2022) showed that the PEL provides an attractive alternative approach to
inference with non-probability survey samples. Let
be the
estimated propensity scores under an assumed parametric or non-parametric
model,
The PEL
function for the non-probability survey sample
is defined as
where
is a discrete
probability measure over the
selected units
in
and
which is
defined earlier in Section 4. Without using any additional information,
maximizing
under the
normalization constraint
leads to
The maximum PEL
estimator of
is given by
which is
identical to the IPW estimator
given in (4.8).
The PEL approach to
non-probability survey samples provides flexibilities in combining information
through additional constraints and constructing confidence intervals and
conducting hypothesis tests using the PEL ratio statistic. The maximum PEL
estimator
is doubly
robust if
is the
maximizer of
under both the
normalization constraint and the model-calibration constraint given by
where
is computed
using the fitted values
from an assumed
outcome regression model,
The equation
(4.14) is a modified version of the original model-calibration constraint of Wu
and Sitter (2001) using the probability sample
. Chen et al. (2022) contain further details on
the asymptotic distributions of the PEL ratio statistic and simulation studies
on the performances of PEL ratio confidence intervals on a finite population
proportion.