Statistical inference with non-probability survey samples
Section 6. Variance estimation
Variance estimation under
the two sample
and
setup involves
at least two different sources of variation. The probability sampling design
for the reference sample
remains one of
the sources regardless of the approaches used for non-probability survey
samples. Estimation of the variance component due to the use of
requires either
suitable variance approximation formulas or replication weights as part of the
dataset from the reference probability sample. Our discussion in this section
assumes that a design-based variance estimator for the survey weighted point
estimator based on
is available.
6.1 Variance
estimation for mass imputation estimators
Variance estimation for
the model-based prediction estimator
involves first
deriving the asymptotic variance formula for
under the
assumed outcome regression model or the imputation model
and the
probability sampling design
and then using
plug-in estimators for various unknown population quantities.
The mass imputation
estimator
given in (3.5)
is a special type of model-based prediction estimator, where the model
refers to the
one used for imputation and is not necessarily the same as the outcome
regression model. The imputation method plays a key role in deriving the
asymptotic variance formula, and the variance estimator needs to be constructed
accordingly. Noting that
is a Hájek type
estimator due to the use of the estimated population size
derivations of
the asymptotic variance formula start with putting the true value
in first and
then dealing with
as a ratio
estimator. Kim et al. (2021) considered variance estimation for
where
is the imputed
value for
based on the
semiparametric model (3.1). The asymptotic variance formula is developed in two
steps. First, a linearized version of
is obtained by
using a Taylor series expansion at
where
is the
probability limit of
such that
Second, two
variance components are derived for
based on the
linearized version using the semiparametric model (3.1) and the sampling design
for
The process is
tedious, which is the case for most model-based variance estimation methods. A
bootstrap variance estimator turns out to be more attractive for practical
applications. See Kim et al. (2021) for further details.
6.2 Variance
estimation for IPW estimators
The commonly used IPW
estimator
given in (4.8)
is valid under the assumed model
for the
propensity scores. An explicit asymptotic variance formula for
can be derived
under the joint
-framework when
the propensity scores are estimated using the pseudo maximum likelihood method
or an estimating equation based method as discussed in Section 4.1. The
theoretical tool is the sandwich-type variance formula for point estimators
defined as the solution to a combined system of estimating equations for both
and
Consider the parametric
form
for the
propensity scores, where the model parameters
are estimated
through the estimating equations (4.4) with user-specified functions
The first major
step in deriving the asymptotic variance formula for
is to write
down the system of joint estimating equations for both
and
Let
be the vector
of the combined parameters. The estimator
is the solution
to the system of joint estimating equations
where
The factor
is redundant
but useful in facilitating asymptotic orders. The estimating functions defined
by (6.1) are unbiased under the joint
-framework,
i.e.,
where
There are two
major consequences from the unbiasedness of the estimating equations system.
First, consistency of the estimator
can be argued
using the theory of general estimating functions similar to those presented in
Section 3.2 of Tsiatis (2006). Second, the asymptotic variance-covariance
matrix of
denoted as
has the
standard sandwich form and is given by
where
which depends
on the forms of
and
The term
consists of two
components, one due to the propensity score model
and the other
from the probability sampling design for
More
specifically, we have
where
denotes the
variance under the propensity score model
and
represents the
design-based variance under the probability sampling design
and
The analytic
expression for
follows
immediately from
and the
independence among
The
design-based variance component
requires
additional information on the survey design for
or a suitable
variance approximation formula with the given design.
The asymptotic variance
formula for the IPW estimator
is the first
diagonal element of the matrix
The final
variance estimator for
can then be
obtained by replacing various population quantities with sample-based moment
estimators. Chen et al. (2020) presented the variance estimator with
explicit expressions when
are modelled by
the logistic regression and the
is obtained by
the pseudo maximum likelihood method.
6.3 Variance
estimation for doubly robust estimators
It turns out that
variance estimation for the doubly robust estimator is a challenging problem.
While double robustness is a desirable property for point estimation, it
creates a dilemma for variance estimation. The estimator
given in (4.11)
is consistent if either the propensity score model
or the outcome
regression model
is correctly
specified. There is no need to know which model is correctly specified, which
is the most crucial part behind double robustness. This ambiguous feature,
however, becomes a problem for variance estimation. The asymptotic variance
formula under the model
is usually
different from the one under the model
and
consequently, it is difficult to construct a consistent variance estimator with
unknown scenarios on model specifications.
There have been several
strategies proposed in the literature on variance estimation for the doubly
robust estimators. A naive approach is to use the variance estimator derived
under the assumed propensity score model
and take the
risk that such a variance estimator might have non-negligible biases under the
outcome regression model. One good news is that, under the propensity score
model, the estimation of the parameters
for the outcome
regression model has no impact asymptotically on the variance of doubly robust
estimators. This can be seen by using
of (4.10) as an
example. Let
where
is obtained
based on the working model (3.1) which is not necessarily correct. Let
be the
probability limit of
such that
regardless of
the true outcome regression model (White, 1982). Let
and
It can be seen
that
where
Since the two terms
on the right hand side of (6.2) are both consistent estimators of
we conclude
that
and
It
follows that
The same arguments
apply to
We can treat
as if it is
fixed in deriving the asymptotic variance for
and
under the
assumed propensity score model. The techniques described in Section 6.2
can be directly used where the first estimating function in (6.1) is replaced
by the one for defining
or
See Theorem 2
of Chen et al. (2020) for further details. The variance estimator derived
under the propensity score model, however, is generally biased under the
outcome regression model.
Chen et al. (2020)
also described a technique using the original idea presented in Kim and Haziza
(2014) for the construction of the so-called doubly robust variance estimator.
The technique is a delicate one with some theoretical attractiveness but has
various issues for practical applications. We use
as an example
to illustrate the steps for the construction of the doubly robust variance
estimator. Let
It
follows that
if
and
are from the
original estimation methods. The first step is to modify the estimation of
and
such that
and
are obtained as
solutions to
Under the logistic
regression model
where
and the linear
regression model
where
the equation
system (6.3) becomes
The estimating
equations in (6.5) are unbiased under the joint
-framework.
They are identical to (4.5) discussed in Section 4.1.2. The estimating
equations in (6.4) are also unbiased under the outcome regression model, but
they are different from the quasi score equations given in (3.2). The
estimators
and
obtained as
solutions to (6.4) and (6.5) are less stable than those from standard methods.
In addition, the equations system (6.4) and (6.5) will not have a solution if
and
are not of the
same dimension, since the number of equations in (6.4) is decided by the
dimension of
and the number
of equations in (6.5) is the same as the dimension of
The final
estimator
also suffers
from efficiency losses when
and
are estimated
by solving (6.4) and (6.5).
The reason behind the use
of the equations system (6.3) is purely technical. It can be shown through a
first order Taylor series expansion that the estimators
and
obtained from
(6.3) have no impact asymptotically on the variance of
This technical
maneuver enables that simple explicit expressions for the variance
under the
framework and
for the prediction variance
under the
framework can
easily be obtained. Construction of the doubly robust variance estimator for
starts with the
plug-in estimator for
under the
propensity scores model
A bias-correction
term is then added to obtain a valid estimator for
under the
outcome regression model
The happy
ending of the story is that the bias-correction term has the analytic form
where
which is
negligible under the propensity score model
The
bias-corrected variance estimator is valid under either the propensity score
model or the outcome regression model.
A doubly robust variance
estimator for the commonly used
is not
available in the literature. A practical solution is to use bootstrap methods.
Chen et al. (2022) demonstrated that standard with-replacement bootstrap
procedures applied separately to
and
provide doubly
robust confidence intervals using the pseudo empirical likelihood approach to
non-probability survey samples when the reference sample is selected by single
stage unequal probability sampling designs. Complications will arise when the
probability sample
uses stratified
multi-stage sampling methods, a known challenge for variance estimation with
complex surveys. Construction of doubly robust variance estimators for the
doubly robust estimator
under general
settings deserves efforts in future research.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa