Replication variance estimation after sample-based calibration
Section 2. Sample-based regression calibration
We consider a survey of a population
with sample
weights
target variables
For a given survey estimator
constructed using the weights
inference is conducted by replication,
implemented through the provision of
sets of replicate weights
and variance estimation formula
where the
are computed in the same manner as
but replacing
by the
The constant
depends on the replication method. For
simplicity, we focus in what follows on the Horvitz-Thompson estimator of
denoted by
In this case, many replication methods of the
form (2.1) lead to a design consistent estimator of
We will refer to this survey as the “primary
survey”.
We are interested in creating adjusted weights
that are calibrated to a set of control totals
from a secondary survey of
with sample
weights
An estimator from this survey is denoted by
For the second survey, a replication-based
variance estimator is also provided,
with replicate weights
and replication-specific constant
The control variables will be denoted by
with estimated totals
Using regression estimation as a framework for
calibration, the adjusted estimator is
where
with
a matrix with
row equal to
a diagonal matrix with
entry
and
a vector containing the
Hence, the calibrated weights can be written
as
We note that post-stratification is a special case
of regression estimation, see Särndal, Swensson and Wretman (1992, Chapter 7.6).
To obtain a variance estimator, we follow the
traditional linearization approach for regression estimators with respect to
the sampling design (see e.g., Särndal
et al., 1992, Chapter 5.5). Under mild regularity conditions
(such as design consistency of Horvitz-Thompson estimators and invertibility of
required matrices), the linearized version of the regression estimator (2.2) is
equal to the difference estimator,
where
is the population target of
and
The variance of
is equal to
since the two surveys are independent. This
“linearized variance” is the variance of the asymptotic distribution of the
regression estimator (2.2). In expression (2.5), the first variance term can be
estimated using the replicates from the primary survey and the
variance-covariance of the control totals in the second term can be estimated
using the replicates from the secondary survey. Hence, the plug-in variance
estimator
where
can be used for asymptotically valid inference
for
However, it is often not practical to maintain the two
datasets and associated sets of replicates for variance estimation purposes. In
the context of survey calibration, the organization in charge of creating the
adjusted weights for the primary survey would often prefer to continue
providing their dataset unchanged except for the new calibrated weights and
associated replicate weights, so that data users can perform their analyses
using traditional survey tools. Hence, it is of interest to create a single set
of replicates for the primary survey that can be used to estimate the variance,
while accounting for the fact that the control totals are themselves estimated
from a different survey.
We therefore propose to construct new replicates for the
primary survey to estimate (2.5). Assume for now that
Starting from the replicate weights
for the primary survey variance estimator, a
replicate variance estimator of the first term in (2.6) is obtained by using
the calibrated replicate weights
These replicate weights are obtained by repeating
the calibration for each of the replicate weights
and lead to consistent variance estimation for
regression estimators, as discussed for the general case in Fuller (2009,
Chapter 4). See also Valliant (1993) for the special case of
post-stratification.
The replicate weights
can be further modified to capture the second
term in (2.6) as follows:
with the constants
to be further defined below. Combining (2.7)
and (2.8), the resulting replicate weights are
These weights are again obtained by applying the
same calibration as for the original weights to each of the replicates, but
with replicate control totals
The resulting replicate estimates are
The constants
are chosen to account for the difference
between the primary and control replication methods, in particular between
and
and
and
by letting
This implies that for
the replicate weights
in (2.8), i.e. the unadjusted control totals
are used to calibrate the replicate weights. While the
are written with the first
values non-zero, this is for notational
convenience only. The assignment of the replicates from the control survey to
those of the primary survey should be randomized, to ensure that estimators and
replicate estimators from both surveys remain independent regardless of the
replication methods.
Using the replicate weights (2.9) with constants (2.10),
the replicate variance estimator (2.1) becomes
Ignoring terms of smaller order as well as those
with
this is approximately equal to
The cross-term is likewise of smaller order because
of the independence of the two surveys and the fact that
and
Hence, the replicate variance estimator (2.11)
inherits the design consistency of the original replication methods for both
surveys and is design consistent for the linearized variance (2.5).
Finally, we discuss the case when
The above approach is readily extended to this
case by repeating the
replicates of the primary survey
times, such that
with
the smallest positive integer for which this
inequality is satisfied. The resulting replicate variance estimator is of the
same form as (2.1) but with
replaced by
and
is replaced by
Then, the method discussed above applies
directly to this new replicate variance estimator for the primary survey. For
instance, if
and
each replicate in the primary survey will be
repeated
times, leading to 240 replicates for the
primary survey of which 150 will be modified.