“Optimal” calibration weights under unit nonresponse in survey sampling
Section 2. Calibration estimation
2.1 Calibration estimators under full response
Starting
with the full response situation
and following the procedure as established by
Deville and Särndal (1992), the calibration estimator is defined as
where the sample dependent weights
are chosen so that
while also minimizing the quadratic distance measure
where
and
is diagonal. (Alternative distance measures
are considered in both Deville and Särndal (1992) and Haziza and Lesage
(2016).)
In
other words, given the constraint (2.1) the
should be “as close as possible” to the design
weights
which is desirable since
is an unbiased estimator of
The resulting
weights are
It
turns out that the model assisted homoskedastic GREG estimator
(Särndal, Swensson and Wretman (1992)) is a
calibration estimator for which
where
is the unit diagonal matrix of size
Another
calibration estimator is the optimal regression estimator
(see e.g., Rao (1994) and Montanari (1998)),
for which
as shown by Andersson and Thorburn (2005).
Asymptotically,
this estimator has (in a design-based sense) minimum variance among linear
regression type estimators.
2.2 Calibration estimators under nonresponse
In
the nonresponse case, a possible calibration estimator is
where it
should hold that
where
if the auxiliary information is known up to
the population level. Otherwise,
the unbiased estimator of
(We can also combine the two types of
information in the constraint
For
a variety of cases weights fulfilling the requirement (2.2) are presented by
e.g., Särndal and Lundström (2005). Using the direct approach, where all
information is used in one single calibration, we get
The resulting estimator will henceforth be denoted
(Other approaches, including two-step
procedures, are presented and investigated by e.g., Andersson and Särndal
(2016).)
An
evident question to ask is: What is the underlying distance measure generating
these weights? Särndal and Lundström (2005) do not comment on this particular
issue, but according to Lundström and Särndal (1999), we should choose
‘as close as possible’ to the
which does not seem quite adequate under
nonresponse. Going back to Lundström (1997) we will find that the corresponding
distance measure is actually
where
and
If
we have a random mechanism generating the response set
from the sample
with probabilities
of inclusion, we can view the nonresponse
situation as a two-phase design and this is the assumption we will make in the
following. Then we should minimize the distance between
and
Using some modelling
can be estimated by
to be put to use for the distance
minimization. But in this paper we will not go in the direction of model-based
inference. In order to reduce the bias effect under nonresponse one could
instead in the distance measure think of comparing
not with
but with
where
is a constant larger than 1, aiming to
compensate for the “average” nonresponse effect.
However,
Lundström (1997) shows that in many important cases, namely when one can find a
vector
for which
for all
the multiplicative increase in
implies the same resulting calibration weights
This follows from the result that if
for all
we can simplify the expression (2.3) of
as
Thus, we have an invariance property for the weights. The result holds
also when the population is partitioned into groups and the initial weights are
inflated with a constant within each group. Note that if we include a constant,
e.g., “
1”
,
as a first component of the auxiliary vector
we can simply let
to achieve
With
this as a background we propose to use alternative “optimal” weights resulting
from the distance measure
leading to
denotes the inclusion probability for the pair
It
is to be observed that as for the full response situation, there are cases for
which the “optimal” weights are identical to (2.3), as e.g., under simple
random sampling.
Using
quotation marks around optimal is deliberate, but under full response optimal has a very clear meaning. As mentioned earlier, the optimal regression
estimator has asymptotically minimum variance among linear regression
estimators. Adding nonresponse where the nonresponse mechanism is at least
partially unknown, makes it difficult to define optimality criteria in a proper
way.
For
this “optimal” measure it might be fruitful to replace
with
where we include in
the reciprocal of an estimate of the average
response probability
One simple candidate is
thus yielding
Another natural choice is
since
and
which lead to
The resulting modified estimator is denoted by
(Also observe that
In
the following simulation study we will focus on a sampling design where
generally
namely Poisson sampling. The independence of
drawings simplifies the “optimal” distance measure:
and
minimization yields
For the modified “optimal” estimator
is replaced by
with
as in (2.4).
2.2.1 Bias for calibration estimators under
nonresponse
We can write
as
where
In order to arrive at an approximate
expression for the bias of
and subsequently
and
we follow the derivation in Särndal and
Lundström (2005) and first note that
can be rewritten as
where
If
we let
where
and
it can further be shown that
where
and
Then
since it can be argued that
is a consistent estimator of
and therefore
The
approximation for the bias of
is called the nearbias:
The nearbias of
is zero if
for all
and/or
for all
Then,
if we consider
we have that
where
Since
can be written as (2.6), which is of the same
form as for
in (2.5), we will again arrive at the nearbias
expression
where
and with
denoting the response probability for the pair
If
we use the alternative weighting
we get that
nearbias
where
to be compared with (2.7), where
Unless
for all
an equivalent expression can be obtained for
On the other hand, if the restriction
for all
does hold, it can be shown (Särndal and Lundström (2005)) that
which holds independently of the sampling design and which is a result
completely in line with the aforementioned invariance property of the
calibration weights.