On combining independent probability samples
Section 3. Combining samples
Here
we derive the design elements (e.g., inclusion probabilities of first and
second order) for the combined sample. There are however different options to
combine samples. We must e.g., choose between multiple or single count for the
combined design. When combining independent samples selected from the same
population we need to know the inclusion probabilities of all units in the
samples, for all designs. Second order inclusion probabilities are needed for
variance estimation. In some cases we also need to have unique identifiers
(labels) for the units so they can be matched, e.g., when we use single count
or when at least one separate design has unequal probabilities. Bankier (1986) considered the single
count approach for the special case of combining two independently selected
stratified simple random samples from the same frame. Roberts and Binder (2009) and O’Muircheartaigh and Pedlow (2002) discussed
different options for combining independent samples from the same frame, but
not with general sampling designs.
A
somewhat similar problem is estimation based on samples from multiple
overlapping frames, see e.g., the review articles by Lohr (2009, 2011) and the referenced articles therein. Even though
having the same frame can be considered as a special case of multiple frames,
we have not found derivations of the design elements (in particular second
order inclusion probabilities and second order of expected number of
inclusions) for the combination of general sampling designs. Below we present,
for general probability sampling designs, in detail two main ways to combine
probability samples and derive corresponding design features needed for
unbiased estimation and unbiased variance estimation.
3.1 Combining with single count
Here
we first combine two independent samples
and
selected from the same population, and look at
the union of the two samples as our combined sample. Thus, the inclusion of a
unit is only counted once even if it is included in more than one sample. The
first order inclusion probabilities are
where
and
for
We let
and
be the inclusion indicator for unit
in
and
respectively. The resulting design is no
longer a fixed size design (even if the separate designs are of fixed size).
The expected size of the union
is given by
where
denotes the random size of the union. If we
are interested in how much the samples will overlap on average, the expected
size of the overlap is given by the sum
The
second order inclusion probabilities
for the union
can be written in terms of first and second
order inclusion probabilities of the two respective designs. Let
then
By conditioning on the outcomes for
and
in
we get the following four cases
Table 1
Table summary
This table displays the results of Table 1. The information is grouped by (equation) (appearing as row headers), (equation) (appearing as column headers).
|
|
|
|
| 1 |
|
|
1 |
| 2 |
|
|
|
| 3 |
|
|
|
| 4 |
|
|
|
where
for
The events
are disjoint and
Thus, by the law of total probability, we have
This gives us
The equations (3.1) and (3.2) can be generalized to recursively obtain
first and second order inclusion probabilities of the union of an arbitrary
number
of independent samples. After having derived
probabilities for the union of the first two samples, we can combine the result
with the probabilities of the third design using the same formulas and so on.
To exemplify, let
be the first order inclusion probability of
unit
in the union of the first
samples. Then we have
as the first order inclusion probability of unit
in the union of the first
samples. Similarly, for the second order
inclusion probabilities we get the recursive formula
Henceforth, for the combination of
independent samples, we use the simplified
notation
and
Since the individual samples may overlap, the
resulting design is not of fixed size. The unbiased combined single count (SC)
estimator, which has Horvitz-Thompson form, is given by
The variance
is
and an unbiased variance estimator is
For the combination of independent samples with positive first order
inclusion probabilities we always have
for all pairs
which is the requirement for the above
variance estimator to be unbiased. In terms of MSE it may be beneficial not to
use the single count estimator, but instead use an estimator that accounts for
the random sample size. However, here we restrict ourselves to using only
unbiased estimators.
3.2 Combining with multiple count
We first look at how to combine two independent samples
and
selected from the same population, where we
allow for each unit to possibly be included multiple times. The number of
inclusions of unit
in the combined sample is denoted by
and it is the sum of the number of inclusions
of unit
in the two samples we combine, i.e.,
where
is the number of inclusions of unit
in sample
The expected number of inclusions of unit
in the combination is given by
where
is the expected number of inclusions for unit
in sample
The (possibly random) sample size is the sum
of all individual inclusions and the expected
sample size is the sum
of all individual expected number of
inclusions. It can be shown that
where
are the second order of expected number of
inclusions in sample
Obviously
if the design for sample
is without replacement. Note that as
may take other values than 0 or 1 we have that
is generally not equal to
but
The equations (3.3) and (3.4) can be used
recursively to obtain
and
for the combination of an arbitrary number
of independent samples. We then get the
recursive formulas
and
The previous results and (3.4) follow from the fact that
and that
and
are independent. For example, we have
For the combination of
independent samples we now use the simplified
notation
and
The total
can be estimated without bias with the multiple
count (MC) estimator, of which the Hansen-Hurwitz estimator (Hansen and Hurwitz,
1943) is a
special case. It is given by
We get the Hansen-Hurwitz estimator if
where
is the number of units drawn and
with
are probabilities for a single independent
draw. The variance of
can be shown to be
A variance
estimator is
It follows directly that the above variance estimator is unbiased, because
when combining independent samples with positive first order inclusion
probabilities we always have
for all pairs
3.3 Comparing the combined and separate estimators
Two examples that illustrate that the combined estimator
is not necessarily as good as the best separate estimator.
Example 3: Assume that the first sample,
is of fixed size with
and that the second is
a simple random sample with
Then the
Horvitz-Thompson estimator
has zero variance, but
the combined single count estimator with
has positive variance.
Thus the combined estimator is worse than the best separate estimator.
Example 4: Assume that the design for the
first sample is stratified in such a way that there is no variation within
strata. Then the separate estimator
has zero variance. If
the first sample is combined with a non-stratified second sample, then the
resulting design does not have fixed sample sizes for the strata. Thus, the
combined estimator has a positive variance.
These examples tell us that we need to be careful before
combining very different designs, such as an unequal probability design with an
equal probability design or a stratified with a non-stratified sampling design.
Especially, we need to be careful if we plan to estimate the total directly
based on the combined sample. When combining samples from relatively similar
designs, it is however likely that the combined estimator becomes better than
the best of the separate estimators.
Next, we investigate how to use the combined approach
for estimation of the separate variances and then use the linear combination
estimator. In fact, as we will see later, using the combined approach for
variance estimation of separate variances can act stabilizing for the weights
in the linear combination with weights based on estimated variances. There is a
sort of pooling effect for the variance estimators when they are estimated with
the same set of information.
3.4 Using the combined sample for estimation of
variances of separate estimators
An alternative to estimating directly the total
based on the combined design is to use the
combined design to estimate the variances of the separate estimators, and then
proceed with a linear combination of the separate estimators. We assume access
to
independent samples and that we want to
estimate the variance of a separate estimator, whose variance is a double sum
over the population units. There are two main options for the variance
estimator; multiply by
in the variance formula to obtain an unbiased estimator of the variance
based on the combination of all the
samples
For example, assuming that the variance of
is
we can use the combination of
to estimate
by the single count estimator
or the
multiple count estimator
Note that
and
so the above variance estimators use all
available information on the target variable. Hence, these variance estimators
can be thought of as general pooled variance estimators. It follows directly
that both estimators are unbiased because all designs have positive first order
inclusion probabilities, which imply that all
and all
are strictly positive. Interestingly, the
above variance estimators are unbiased even if the separate design 1 has some
second order inclusion probabilities that are zero, which prevent unbiased
variance estimation based on the sample
alone.
Despite the appealing property of producing an unbiased
variance estimator for any design, the above variance estimators cannot be
recommended for designs with a high degree of zero second order inclusion
probabilities (such as systematic sampling). The estimators can be very
unstable for such designs and can produce a high proportion of negative
variance estimates.
As we will see, if we intend to use a linear combination
estimator, it is important that all variances are estimated in the same way. Then
it is likely that the ratios, e.g.,
become stable (have small variance). The ratios become more stable because
the estimators in the numerator and denominator are based on the same
information and are estimated with the same weights for all the pairs
in all estimators. With estimated variances we
get
so if the ratios of variance estimators have small variance then
has small variance. The weighting in the
linear combination
then becomes stabilized. As the following
example demonstrates, the ratio of the variance estimators can even have zero
variance. Thus it can sometimes provide the optimal weighting even if the
variances are unknown.
Example 5: Assume we want to combine estimates
resulting from two simple random samples of different sizes. This can of course
be done optimally without estimating the variances, but as an example we will
use the above approach to estimate the separate variances by use of the
combined sample. In this case the use of the estimators
and
provides the optimal
weighting, and so does
and
This result follows
from the fact that if both designs are simple random sampling we have
which is straightforward to
verify. For two simple random samples the situation corresponds to using a
pooled estimate for
(the population variance of
in the expressions for
the variance estimates, and this pooled estimate is then cancelled out in the
calculation of the weights.
The conclusion is that this procedure is likely to
provide a more stable weighting also for designs that deviate from simple
random sampling as long as the involved designs have large entropy (a high
degree of randomness). The problem of bias for the linear combination estimator
with estimated variances will be reduced compared to using separate and thus
independent variance estimators.
We believe that this can be a very interesting
alternative, because the estimator of the total based on the combined design
does not necessarily provide a smaller variance than the best of the separate
estimators. With this strategy we can improve the separate variance estimators,
especially for a smaller sample (if data is available for a larger sample).
Hence the resulting linear combination with jointly estimated variances can be
a very competitive strategy.
With single count we might use a ratio type variance
estimator such as the following
where
For multiple count we can replace
with
This ratio estimator uses the known size of
the population of pairs
which is
and divides by the sum of the sample weights
for the pairs. Note that
This correction is useful because the number
of pairs in the estimator may be random (since the union of the samples may
have random size). This rescales the sample (of pairs) weights to sum to
This will introduce some bias (as usual for
ratio estimators), but the idea is that this will reduce the variance of the
variance estimator. However, this approach is only useful if we are interested
in the separate variance as the correction term will be the same for all
separate variance estimators. Hence it does not change the weighting of a
linear combination estimator with estimated variances.