2 A general procedure for constructing fully efficient replication weights
Jae Kwang Kim and Changbao Wu
Previous | Next
In principle,
we can construct replication weights for any measurable sampling design, using
the method outlined in Fay (1984) and Fay and Dippo (1989), such that the
resulting replication variance estimators are algebraically equivalent to the
standard linearization variance estimators.
Let be the set of units in the finite population and be the set of units in the sample, selected according to a
probability sampling design. Let be the basic design weight, where is the first order inclusion probability for
unit
Let be the value of the study variable for unit and be the population total of interest. The
Horvitz-Thompson estimator of is given by
(2.1)
The
estimator given in (2.1) is also called the expansion
estimator, with the basic design weight denoting the number of units in the population
represented by unit in the sample. The standard variance estimator
of can be written as
(2.2)
where and is the second order joint inclusion
probability for It is assumed that for all Note that The standard
variance estimator is often viewed as fully efficient since it is
the Horvitz-Thompson estimator of the design-based variance
Let be an matrix. We can re-write (2.2) as where is the vector of sampled The matrix is nonnegative definite and can be decomposed
as
(2.3)
for some and some dimensional
vectors The most well-known decomposition (2.3) is
given by the spectral decomposition where is the eigenvector associated with the
eigenvalue In practice, very small eigenvalues are often
ignored for computational reasons. For stratified sampling, the matrix is block-diagonal so the computational burden
may be alleviated. However, we do not restrict (2.3) to the spectral
decomposition. Any decomposition satisfying (2.3) can be used.
Suppose that we
want to express the fully efficient variance estimator given by (2.2) as a replication variance
estimator in the form of
(2.4)
where is the set of replication weights, is the factor associated with the set of replication weights and is the total number of replications; see Kim,
Navarro and Fuller (2006) for further discussion.
The form given
by (2.4) does not include all replication variance estimators. For instance,
Campbell (1980) provided a jackknife variance estimator where the pseudovalues
are derived based on the von Mises approximation to the parameter of interest.
Nevertheless, most replication variance estimators can be put in this form.
We have the
following result on the construction of for based on the decomposition (2.3).
Theorem 1. The fully efficient variance estimator and the replication variance estimator are algebraically identical if we let and where is the set of original basic design weights.
Proof. The proof follows directly from the fact that and that
The choices of can be arbitrary and bear no impact on the
validity and efficiency of the replication variance estimators. However,
certain choices of will result in replication weights with
negative values, which is undesirable as it may produce negative replicates for
the parameters that are always positive. In practical situations one can always
choose relatively large to avoid negative values for replication
weights. In our simulation study (Case I)
reported in Section 5, the problem of negative replication weights can be
eliminated with the choice of
The replication
variance estimator with and replication weights is fully efficient for an arbitrary variable It also provides fully efficient variance
estimator for when is a smooth function of population means or
totals. Practical implementation of the method depends crucially on two related
issues: (i) the feasibility of the decomposition of the matrix specified in (2.3); and (ii) the number of
sets of replication weights required to achieve the full efficiency determined
by
As for the
first issue, modern advances in computational power and improved performances
of available software packages make it possible to do the spectral
decomposition with relatively large For instance, on a 12-CPU unix machine with 96
gigabytes of memory, the R function eigen()
can handle matrices of sizes at least as large as 4,000.
Note that the computational task involved here is for survey runners at the
data preparation stage and is not for users of the data files. As for the
second issue, the value of is related to the given sampling design. For
simple random sampling without replacement, we have
where is the identity matrix and is the vector of It follows that This is typically the case for single stage
unequal probability sampling designs. For stratified simple random sampling, we
have where is the total number of strata.
It should be
noted that for any sampling design and the exact value of
is not required for the proposed procedure to
be implemented. However, since the values of and have the same order of magnitude, the proposed
method requires a large number of replicates whenever is large. Under the current practices in
sample surveys, the fully efficient replication weights described above become
immediately implementable if 500
and the second order inclusion probabilities are available. When is large, a two-stage procedure to be
described in Section 3 can be used to produce a small number sets of replication weights for public-use
data files.
In some cases,
the spectral decomposition (2.3) can be avoided. For example, Deville (1999)
argued that the variance estimator of under unequal probability sampling designs
with fixed sample size can be approximated by
where and More generally, we consider the following form
of matrix in where
(2.6)
where for all and is a vector of design and auxiliary variables.
Many elementary single-stage sampling designs take the form (2.6) for variance
estimation. In particular, Deville's formula in (2.5) can be expressed as with given by (2.6), where in and The conditional Poisson sampling design to be
discussed in Section 5 also takes the form (2.6) where are the design variables in the design
constraint
For the matrix
given by (2.6), it can be shown that
where Thus, we have
(2.7)
which is
useful in deriving an expression for replication variance estimator in the form
given by (2.4). The fully efficient variance estimator in (2.7) and the replication variance
estimator in (2.4) are algebraically identical if we let
and where is the set of original basic design weights
and with
The proof
follows directly from the fact that and that
Previous | Next