4. Composite estimation for matrix sampling design (d)
Takis Merkouris
Previous | Next
4.1 Core set of
variables with known totals
We
discuss first a special case of the matrix sampling design (d) in which the
variables that are common to the three samples have known totals. In this very
realistic sampling setting, all samples collect also information on the same
vector of auxiliary variables
for which the
vector of population totals
is known. For
illustration we consider again three samples, as in Figure 2.1 (but with
added in all
subsamples). Then, the CGR estimator
in (3.1) may be
augmented with the ordinary regression terms
where
is the HT
estimator of
based on sample
similarly for
This estimator
has improved efficiency, as it incorporates additional information, and is
generated by a calibration procedure that includes the additional three constraints
and has the
design matrix
in (2.7)
augmented with the block-diagonal matrix
In the simplest
case when the sample matrices
reduce to the
unit columns
(with
corresponding total the size of the population), the calibration scheme is the
one specified in Corollary 1 above. As shown in the proof of the next theorem,
an application of Lemma 1 to the present calibration procedure, with
partitioned design matrix
and calibration
totals
gives a modified
CGR form of (3.1) with GR estimators incorporating information on
in place of HT
estimators. This is compactly written as
where
with
and
expressed
similarly, and where
with
Replacing
by
in the
calibration procedure gives the optimal composite regression estimator,
compactly written as
with optimal
regression estimators incorporating information on
in place of GR
estimators, and with
where
Noticing that
is the matrix of
residuals corresponding to
and that
and similarly
for
it follows that
in analogy with (2.4), or with (2.5) in non-nested sampling. Thus,
is optimal in
the sense of minimizing the approximate variance of the estimator
which is then
asymptotically BLUE. An alternative estimator, of weaker optimality, has the
form
where the
coefficient
has the form (4.1)
but with GR estimators in place of OR estimators. This estimator, differing
from the CGR only in the regression coefficient, is optimal in the restricted
sense of being the composite of GR estimators incorporating information on
that has minimum
approximate variance. In general, this later composite estimator cannot be
obtained as a calibration estimator. The following theorem gives conditions
under which the CGR estimator is optimal in one of the two senses in non-nested
matrix sampling; the proof is given in the Appendix. The nested sampling
version of the theorem, with subsampling schemes and proof as in Theorem 1, is
omitted for brevity.
Theorem 2 Consider the following sampling strategies.
- For all three samples
and
assume SRS with sampling fractions
and specify all constants
in
as
Consider the augmented design matrix
in (2.7), where
and with the corresponding augmented vector of
calibration totals
Further, suppose that
for constant vectors
- Then, the
calibration procedure gives the CGR as
i.e., the CGR estimator is the optimal
composite of GR estimators incorporating information on
- For all three samples
and
assume STRSRS with sampling fraction
in stratum
of sample
and
denoting stratum size, and specify the
constants in
as
for all units of stratum
Further, assume that within each sample the
units are sorted by stratum, and consider the augmented design matrix
in (2.7), with corresponding augmented vector
of calibration totals
The definition of
and
is as before.
- Then, the
calibration procedure gives the CGR as
i.e., the CGR estimator is the optimal
composite of optimal regression estimators incorporating information on
- For all three samples
and
assume stratified Poisson sampling and specify
the constants
in the entries of
as
for the units of stratum
- Then, the calibration
procedure, with
and
as in
gives the CGR as
i.e., GR and OR estimators are identical, and
the CGR estimator is the optimal composite of optimal regression estimators
incorporating information on
The
condition
in
of Theorem 2 is
customarily satisfied when the vector
contains
categorical variables. Results analogous to Corollaries 1 and 2 of the previous
section hold also for parts
and
of Theorem 2.
Here too, for sampling designs other than those assumed in Theorem 2, the value
in the entries
of
should be used.
Finally,
by analogy to (3.2), and with the appropriate decomposition of the vector of
calibrated weights
the composite
estimator
takes now the
form
where
and
are GR
estimators using information on
from
and on
and
from
and
respectively,
and
is the
corresponding matrix regression coefficient. Similar is the expression for
Of course,
and
can be obtained
directly through this modified
in the simple
linear forms
and
4.2 Core set of
variables with unknown totals
We
turn now to the case of matrix sampling design (d) in which the variables
that are common
to the three samples have unknown totals. Estimation in this setting includes
the construction of a composite estimator of the vector of totals
In line with the
formulation of Section 2, composite estimators of
and
that are best
linear unbiased combinations of the HT estimators
are given by
The estimators in (4.2) can be written in the matrix regression form
with the variance-minimizing matrix of coefficients given by
where
With estimated
covariance and variance matrices we obtain the estimated optimal matrix
and (4.3)
becomes then an optimal multivariate regression estimator. Then, proceeding as
in Section 2, it can be shown that
where
is the design matrix corresponding to the regression estimator (4.3),
is the matrix
with the second
column eliminated and the first two rows set equal to zero, and
is as in Section
2.
Replacing
the matrix
with the weighting
matrix
gives the
generalized regression coefficient
and (4.3)
becomes the CGR estimator of
The estimator (4.5) can be conveniently obtained through a calibration
procedure that gives a vector of calibrated weights for the combined sample
having the form
as before, but
now satisfying the additional constraint
Expression (4.5)
is then obtained simply as
based on sample
The
explicit expression (4.2), different for the optimal regression and the
generalized regression variants only in the form of the linear coefficients,
shows that the composite estimators of
and
are more
efficient than their counterparts in matrix sampling design (c), equation (2.2),
because they incorporate information on the common variables
assuming
non-zero correlation with
and
Particularly
remarkable is the expression for the composite estimator of
it involves a
linear combination of the three HT estimators of
derived from the
three samples, plus the two regression terms implying additional efficiency
through the correlation of
with
and
One would expect
the additional terms to be zero because an optimal combination of the three
estimators should incorporate all information on
available in the
three samples. In general, however, the associated coefficients are not zero.
In non-nested sampling, conditions under which these coefficients are zero are
given by the following proposition, the proof of which is given in the
Appendix. The result should also hold in nested sampling.
Proposition 1 The coefficients
and
in the estimator
in (4.2) are zero only if
This can happen only if the
sampling designs for the three samples are identical, including equal sample
sizes, or only if the sampling design across samples is the same design with
equal inclusion probability for all units, but not necessarily with the same
sample size.
Noticing
that the quantities on each side of the equations (4.6) are regression
coefficients, according to Proposition 1 the terms of the estimator
incorporating
the correlation of
with
and
are zero only if
the effect of the regression of
and
on
is identical in
samples
and
and in samples
and
respectively.
The essence of this finding is that estimation of
using only
information on
from the three
samples, but ignoring information on
and
will be
suboptimal when there is differential regression effect of
and
on
in the various
samples. The efficiency of
relative to the
composite estimator
that uses only
information on
was possible to
gauge in the simple setting involving scalar
and
simple random
sampling for
and
and Bernoulli
sampling for
and equal
sampling rates for all three samples. Then only the first equation of (4.6)
holds. After much tedious algebra the efficiency of
relative to
was derived to
be
with
where
and
denote
population correlation coefficients, and
denote
coefficients of variation. Although in this setting the departure from the
conditions of Proposition 1 is minimal, different configurations of admissible
values for
and
show that the
efficiency gain may be substantial, making up for the inefficiency of the HT
estimator of
based on the
Bernoulli sample
For example,
when
and
the efficiency
gain is 23%. In the case of the composite optimal regression estimator
with estimated
coefficients
and
the regression
coefficients in (4.6) are estimated, and thus the equalities in (4.6) would
never hold exactly because of the sample differences. Likewise in the case of
the CGR estimator
for which
equations formally identical to (4.6) are given in terms of sample generalized
regression coefficients.
Regarding
the efficiency of the CGR estimator (4.5), an exact analogue of Theorem 1 holds
in the present setting, with the same sampling strategies for which the CGR
estimator is optimal regression estimator and asymptotically BLUE.
Composite
estimation for a matrix sampling scheme involving a core set of variables with
both known and unknown totals can be carried out using the obvious extended
calibration scheme.
Previous | Next