Comparison of the conditional bias and Kokic and Bell methods for Poisson and stratified sampling
Section 2. The processing of influential units by winsorization following the approach of Kokic and Bell
In this section,
we present the method initially proposed by Kokic and Bell (1994), which
applies to samples selected through stratified simple random sampling, and an
extension of this method to the case of samples selected through Poisson
sampling.
2.1 Case of stratified simple random sampling
Consider a finite
is a population
of
size
and a
variable of interest
observed
on a sample
of fixed
size
and for
which we are looking to estimate the total
on the
population. The approach of Kokic and Bell (1994) is based on the following
hypotheses:
-
is a positive or nil variable;
-
is selected according to a stratified simple
random sampling design
following strata
In each stratum of size
a sample
of size
is selected according to a simple random
design without replacement. The expectation with respect to the sampling design
will be denoted
afterwards;
- in each stratum
the values of
in the population are derived from random
variables
that are independent and identically
distributed according to a law
(or of the same model
with expectation
The expectation and the variance with respect to this
model will be denoted
and
respectively hereafter;
- we have, for each stratum
realizations
of the variable
derived from the same law
but independent of the sample
In this context,
Kokic and Bell (1994) propose applying a Type II winsorization; they
associate with each stratum
a
threshold
independent of the sample
and define the winsorized variable
for
by:
The winsorized
estimator of the total
is then
the expansion estimator of the total of the winsorized variable
The thresholds
are
determined so as to obtain the estimator
with the
lowest mean square error with respect to both the sampling design and the law
of
in each
stratum, i.e.,
The optimal
thresholds must therefore protect the winsorized estimator on average over all
possible samples in the population, and on average on the law of the variable
of interest, i.e., on average over all the possible populations considering the
law of
Kokic and Bell
(1994) place themselves in an asymptotic framework by considering a set of
populations, sampling designs and samples indexed by
such as:
-
-
-
- the number of strata
is fixed.
They also propose
denoting
the
winsorization indicator. To reduce the notations, we will omit in the rest of
the article the indicator
as well
as the indicator
in the
expression of the expectations and variances
and
of the
random variables and
under
the law of
in the
stratum
Insofar
as these variables are assumed to be independent and identically distributed in
each stratum,
for
example, is indeed the same, regardless of the observation considered.
In this context,
Kokic and Bell (1994) show that, at the optimum and asymptotically, all the
thresholds are linked to one another by the relation:
with
the bias
of the winsorized estimator. The notation
corresponds to an asymptotic equivalence when
tends
toward infinity (which is equivalent to saying when
tends
toward infinity).
If we denote
and
then we
can notice that at the optimum given (2.1),
and the
bias
is the
opposite of the zero-point of the function
defined
by:
Determining the zero-point of the function
requires
estimates of
and
To do
this, Kokic and Bell (1994) rely on observations of the variable
in each
stratum. These observations must come from a source independent of the sample,
since the demonstration of formulas (2.1) and (2.2) is based on the fact that
the thresholds
are
assumed to be independent of the sample
If we assume that for each stratum
we have
realizations
of
then we
can estimate
by:
with
and estimate
the optimal bias
as the
opposite of the zero-point of
Now,
is an
increasing function and is linear by sections, which admits only one zero-point.
This can be estimated simply by denoting
the
values of
sorted
in ascending order and by calculating
until
sign
changes.
Indeed,
is
negative because
is by
definition lower than all the others
and
because
is
negative, since
However,
for
similar reasons by denoting
By denoting
the
indicator such as
and
can be
estimated by linear interpolation, i.e., by
2.2 Extension to the case of the Poisson sampling
design
We now place
ourselves in the situation in which the sampling design
by which
is
selected is a Poisson sampling design, in which each unit
of the
population can belong to the sample with a probability
We are
always interested in estimating the total in the population
of a
variable
The
extension of the Kokic and Bell method to this sampling design assumes:
- that
is a positive or nil variable;
- that it is possible to partition the population and the sample into
subpopulations
and
in which all the values
are independent realizations from the same
model verifying:
-
- with
-
- where
and
designates the expectation and variance with
respect to the model (2.5).
In this context,
we propose, as in the original method applied to stratified simple random
sampling, associating a threshold
with
each part
and
defining:
- the winsorized variable
by
-
- where
is the weight of the unit
in part
- the winsorized estimator of the total
as the usual expansion estimator of the total
-
In the article by
Kokic and Bell (1994), the subpopulations with which the thresholds are
associated are the drawing strata, which respect two properties: the draws are
independent between strata, and the authors postulate an identical population
model for all observations in the same stratum. In the case of Poisson
sampling, the drawings are by nature independent between units.
The strong
hypothesis underlying model (2.5) is that values
multiplied by weights
are
assumed to have constant expectation in each stratum. This means that the
inclusion probabilities within each stratum are defined proportionally to the
variable of interest
In
practice, these inclusion probabilities are often defined proportionally to a
known auxiliary variable that is strongly correlated with
which
makes it possible to be close to the hypothesis underlying model (2.5).
Note also that
model (2.5) is the one under which the Horvitz-Thompson estimator is optimal in
the sense of minimizing the mean square error with respect to the model.
In the following,
the random variables
being
assumed to be independent and identically distributed within each stratum, we
will denote
We also place
ourselves in the same asymptotic framework as Kokic and Bell (1994) by adapting
the hypothesis on the inclusion probabilities:
As in the approach
presented in the previous section, the thresholds
are
determined so as to minimize the mean square error of the winsorized estimator
with
respect to both the model of the variable
and the
sampling design
i.e., on
average across all possible populations, given the super-population model
applied to
and on
average for all samples drawn from these populations, given the sampling design
It is possible to
show (see Appendix A) that at the optimum and asymptotically, denoting as
previously
and
omitting the indicator
in the
expression of expectations and variances under model (2.5) of the variables
and
with
and
is the
bias of the optimal winsorized estimator
at the optimum the threshold
is
therefore equal to a near positive term, in contrast to the bias multiplied by
the term
If we denote
and
then
asymptotically
using
relation (2.9).
By injecting
equivalence relation (2.9) into formula (2.10) defining
we
obtain only optimally and asymptotically,
is the
opposite of the zero-point of the function
defined
by:
As in the previous
section, we assume finally that we have, for each subpopulation
of
realizations
drawn
from the law of
and
independent of the sample
With
these observations, we can estimate
by:
and estimate
by the
opposite of the zero-point of
We will denote
the
values of the
placed
in ascending order. Then, between two successive values
and
the
indicators
as
functions of
remain
constant and with a positive slope.
is
therefore a linear and increasing function of
In addition,
and,
when
exceeds
with
To
determine the zero-point of
it is
necessary to operate using a method similar to that proposed by Kokic and Bell
(1994) in the case of stratified simple random sampling:
- calculate
- identify the value
such as
and
assuming that
-
is then estimated by interpolation, as in the
previous section:
-