On combining independent probability samples
Section 3. Combining samples

Table of contents

Here we derive the design elements (e.g., inclusion probabilities of first and second order) for the combined sample. There are however different options to combine samples. We must e.g., choose between multiple or single count for the combined design. When combining independent samples selected from the same population we need to know the inclusion probabilities of all units in the samples, for all designs. Second order inclusion probabilities are needed for variance estimation. In some cases we also need to have unique identifiers (labels) for the units so they can be matched, e.g., when we use single count or when at least one separate design has unequal probabilities. Bankier (1986) considered the single count approach for the special case of combining two independently selected stratified simple random samples from the same frame. Roberts and Binder (2009) and O’Muircheartaigh and Pedlow (2002) discussed different options for combining independent samples from the same frame, but not with general sampling designs.

A somewhat similar problem is estimation based on samples from multiple overlapping frames, see e.g., the review articles by Lohr (2009, 2011) and the referenced articles therein. Even though having the same frame can be considered as a special case of multiple frames, we have not found derivations of the design elements (in particular second order inclusion probabilities and second order of expected number of inclusions) for the combination of general sampling designs. Below we present, for general probability sampling designs, in detail two main ways to combine probability samples and derive corresponding design features needed for unbiased estimation and unbiased variance estimation.

3.1 Combining with single count

Here we first combine two independent samples $S^{(1)}$ and $S^{(2)}$ selected from the same population, and look at the union of the two samples as our combined sample. Thus, the inclusion of a unit is only counted once even if it is included in more than one sample. The first order inclusion probabilities are

$π_{i}^{(1, 2)} = π_{i}^{(1)} + π_{i}^{(2)} - π_{i}^{(1)} π_{i}^{(2)}, (3.1)$

where $π_{i}^{(1, 2)} = \Pr (i \in S^{(1)} \cup S^{(2)})$ and $π_{i}^{(l)} = \Pr (i \in S^{(l)})$ for $l =1, 2.$ We let $I_{i}^{(1)},$ $I_{i}^{(2)}$ and $I_{i}^{(1, 2)}$ be the inclusion indicator for unit $i$ in $S^{(1)},$ $S^{(2)}$ and $S^{(1)} \cup S^{(2)}$ respectively. The resulting design is no longer a fixed size design (even if the separate designs are of fixed size). The expected size of the union $S^{(1)} \cup S^{(2)}$ is given by $E (n^{(1, 2)}) = \sum_{i =1}^{N} π_{i}^{(1, 2)},$ where $n^{(1, 2)} = \sum_{i =1}^{N} I_{i}^{(1, 2)}$ denotes the random size of the union. If we are interested in how much the samples will overlap on average, the expected size of the overlap is given by the sum $\sum_{i =1}^{N} π_{i}^{(1)} π_{i}^{(2)} .$

The second order inclusion probabilities $π_{i j}^{(1, 2)}$ for the union $S^{(1)} \cup S^{(2)}$ can be written in terms of first and second order inclusion probabilities of the two respective designs. Let $B = (i \in S^{(1)} \cup S^{(2)}, j \in S^{(1)} \cup S^{(2)}),$ then $π_{i j}^{(1, 2)} = \Pr (B) .$ By conditioning on the outcomes for $i$ and $j$ in $S^{(1)}$ we get the following four cases

Table 1
Table summary
This table displays the results of Table 1. The information is grouped by (equation) (appearing as row headers), (equation) (appearing as column headers).
$m$	$A_{m}$	$P r (A_{m})$	$P r (B \| A_{m})$
1	$i \in S^{(1)}, j \in S^{(1)}$	$π_{i j}^{(1)}$	1
2	$i \in S^{(1)}, j \notin S^{(1)}$	$π_{i}^{(1)} - π_{i j}^{(1)}$	$π_{j}^{(2)}$
3	$i \notin S^{(1)}, j \in S^{(1)}$	$π_{j}^{(1)} - π_{i j}^{(1)}$	$π_{i}^{(2)}$
4	$i \notin S^{(1)}, j \notin S^{(1)}$	$1 - π_{i}^{(1)} - π_{j}^{(1)} + π_{i j}^{(1)}$	$π_{i j}^{(2)}$

where $π_{i j}^{(l)} = \Pr (i \in S^{(l)}, j \in S^{(l)})$ for $l =1, 2.$ The events $A_{m},$ $m =1, 2, 3, 4,$ are disjoint and $\sum_{m =1}^{4} \Pr (A_{m}) =1.$ Thus, by the law of total probability, we have $π_{i j}^{(1, 2)} = \Pr (B) = \sum_{m =1}^{4} \Pr (B | A_{m}) \Pr (A_{m}) .$ This gives us

$π_{i j}^{(1, 2)} = π_{i j}^{(1)} + π_{j}^{(2)} (π_{i}^{(1)} - π_{i j}^{(1)}) + π_{i}^{(2)} (π_{j}^{(1)} - π_{i j}^{(1)}) + π_{i j}^{(2)} (1 - π_{i}^{(1)} - π_{j}^{(1)} + π_{i j}^{(1)}) . (3.2)$

The equations (3.1) and (3.2) can be generalized to recursively obtain first and second order inclusion probabilities of the union of an arbitrary number $k$ of independent samples. After having derived probabilities for the union of the first two samples, we can combine the result with the probabilities of the third design using the same formulas and so on. To exemplify, let $π_{i}^{(1, \dots, l)}$ be the first order inclusion probability of unit $i$ in the union of the first $l$ samples. Then we have

$π_{i}^{(1, \dots, l + 1)} = π_{i}^{(1, \dots, l)} + π_{i}^{(l + 1)} - π_{i}^{(1, \dots, l)} π_{i}^{(l + 1)},$

as the first order inclusion probability of unit $i$ in the union of the first $l + 1$ samples. Similarly, for the second order inclusion probabilities we get the recursive formula

$\begin{array}{l} π_{i j}^{(1, \dots, l + 1)} & = π_{i j}^{(1, \dots, l)} + π_{j}^{(l + 1)} (π_{i}^{(1, \dots, l)} - π_{i j}^{(1, \dots, l)}) + π_{i}^{(l + 1)} (π_{j}^{(1, \dots, l)} - π_{i j}^{(1, \dots, l)}) \\ + π_{i j}^{(l + 1)} (1 - π_{i}^{(1, \dots, l)} - π_{j}^{(1, \dots, l)} + π_{i j}^{(1, \dots, l)}) . \end{array}$

Henceforth, for the combination of $k$ independent samples, we use the simplified notation $π_{i} = π_{i}^{(1, \dots, k)},$ $π_{i j} = π_{i j}^{(1, \dots, k)}$ and $I_{i} = I_{i}^{(1, \dots, k)} .$ Since the individual samples may overlap, the resulting design is not of fixed size. The unbiased combined single count (SC) estimator, which has Horvitz-Thompson form, is given by

${\hat{Y}}_{SC} = \sum_{i \in S} \frac{y_{i}}{π_{i}} .$

The variance is

$V ({\hat{Y}}_{SC}) = \sum_{i =1}^{N} \sum_{j =1}^{N} (π_{i j} - π_{i} π_{j}) \frac{y_{i}}{π_{i}} \frac{y_{j}}{π_{j}},$

and an unbiased variance estimator is

$\hat{V} ({\hat{Y}}_{SC}) = \sum_{i =1}^{N} \sum_{j =1}^{N} (π_{i j} - π_{i} π_{j}) \frac{y_{i}}{π_{i}} \frac{y_{j}}{π_{j}} \frac{I_{i} I_{j}}{π_{i j}} .$

For the combination of independent samples with positive first order inclusion probabilities we always have $π_{i j} >0$ for all pairs $(i, j),$ which is the requirement for the above variance estimator to be unbiased. In terms of MSE it may be beneficial not to use the single count estimator, but instead use an estimator that accounts for the random sample size. However, here we restrict ourselves to using only unbiased estimators.

3.2 Combining with multiple count

We first look at how to combine two independent samples $S^{(1)}$ and $S^{(2)}$ selected from the same population, where we allow for each unit to possibly be included multiple times. The number of inclusions of unit $i$ in the combined sample is denoted by $S_{i}^{(1, 2)},$ and it is the sum of the number of inclusions of unit $i$ in the two samples we combine, i.e., $S_{i}^{(1, 2)} = S_{i}^{(1)} + S_{i}^{(2)},$ where $S_{i}^{(l)}$ is the number of inclusions of unit $i$ in sample $l .$ The expected number of inclusions of unit $i$ in the combination is given by

$E (S_{i}^{(1, 2)}) = E_{i}^{(1, 2)} = E_{i}^{(1)} + E_{i}^{(2)}, (3.3)$

where $E_{i}^{(l)} = E (S_{i}^{(l)})$ is the expected number of inclusions for unit $i$ in sample $S^{(l)},$ $l =1, 2.$ The (possibly random) sample size is the sum $\sum_{i =1}^{N} S_{i}^{(1, 2)}$ of all individual inclusions and the expected sample size is the sum $\sum_{i =1}^{N} E_{i}^{(1, 2)}$ of all individual expected number of inclusions. It can be shown that

$E (S_{i}^{(1, 2)} S_{j}^{(1, 2)}) = E_{i j}^{(1, 2)} = E_{i j}^{(1)} + E_{i}^{(1)} E_{j}^{(2)} + E_{i}^{(2)} E_{j}^{(1)} + E_{i j}^{(2)}, (3.4)$

where $E_{i j}^{(l)} = E (S_{i}^{(l)} S_{j}^{(l)}),$ $l =1, 2$ are the second order of expected number of inclusions in sample $l .$ Obviously $E_{i j}^{(l)} = π_{i j}^{(l)}$ if the design for sample $l$ is without replacement. Note that as $S_{i}^{(\cdot)}$ may take other values than 0 or 1 we have that $E_{i i}^{(\cdot)}$ is generally not equal to $E_{i}^{(\cdot)},$ but $π_{i i}^{(\cdot)} = π_{i}^{(\cdot)} .$ The equations (3.3) and (3.4) can be used recursively to obtain $E_{i}^{(\cdot)}$ and $E_{i j}^{(\cdot)}$ for the combination of an arbitrary number $k$ of independent samples. We then get the recursive formulas

$E_{i}^{(1, \dots, l + 1)} = E_{i}^{(1, \dots, l)} + E_{i}^{(l + 1)}$

and

$E_{i j}^{(1, \dots, l + 1)} = E_{i j}^{(1, \dots, l)} + E_{i}^{(1, \dots, l)} E_{j}^{(l + 1)} + E_{j}^{(1, \dots, l)} E_{i}^{(l + 1)} + E_{i j}^{(l + 1)} .$

The previous results and (3.4) follow from the fact that $S_{i}^{(1, \dots, l + 1)} = S_{i}^{(1, \dots, l)} + S_{i}^{(l + 1)}$ and that $S_{i}^{(1, \dots, l)}$ and $S_{i}^{(l + 1)}$ are independent. For example, we have

$\begin{array}{l} E_{i j}^{(1, \dots, l + 1)} = E (S_{i}^{(1, \dots, l + 1)} S_{j}^{(1, \dots, l + 1)}) & = E ((S_{i}^{(1, \dots, l)} + S_{i}^{(l + 1)}) (S_{j}^{(1, \dots, l)} + S_{j}^{(l + 1)})) = \\ E (S_{i}^{(1, \dots, l)} S_{j}^{(1, \dots, l)} + S_{i}^{(1, \dots, l)} S_{j}^{(l + 1)} + S_{j}^{(1, \dots, l)} S_{i}^{(l + 1)} + S_{i}^{(l + 1)} S_{j}^{(l + 1)}) = \\ E_{i j}^{(1, \dots, l)} + E_{i}^{(1, \dots, l)} E_{j}^{(l + 1)} + E_{j}^{(1, \dots, l)} E_{i}^{(l + 1)} + E_{i j}^{(l + 1)} . \end{array}$

For the combination of $k$ independent samples we now use the simplified notation $E_{i} = E_{i}^{(1, \dots, k)},$ $E_{i j} = E_{i j}^{(1, \dots, k)},$ and $S_{i} = S_{i}^{(1, \dots, k)} .$ The total $Y$ can be estimated without bias with the multiple count (MC) estimator, of which the Hansen-Hurwitz estimator (Hansen and Hurwitz, 1943) is a special case. It is given by

${\hat{Y}}_{MC} = \sum_{i =1}^{N} \frac{y_{i}}{E_{i}} S_{i} .$

We get the Hansen-Hurwitz estimator if $E_{i} = n p_{i},$ where $n$ is the number of units drawn and $p_{i},$ with $\sum_{i =1}^{N} p_{i} =1,$ are probabilities for a single independent draw. The variance of ${\hat{Y}}_{MC}$ can be shown to be

$V ({\hat{Y}}_{MC}) = \sum_{i =1}^{N} \sum_{j =1}^{N} (E_{i j} - E_{i} E_{j}) \frac{y_{i}}{E_{i}} \frac{y_{j}}{E_{j}} .$

A variance estimator is

$\hat{V} ({\hat{Y}}_{MC}) = \sum_{i =1}^{N} \sum_{j =1}^{N} (E_{i j} - E_{i} E_{j}) \frac{y_{i}}{E_{i}} \frac{y_{j}}{E_{j}} \frac{S_{i} S_{j}}{E_{i j}} .$

It follows directly that the above variance estimator is unbiased, because when combining independent samples with positive first order inclusion probabilities we always have $E_{i j} >0$ for all pairs $(i, j) .$

3.3 Comparing the combined and separate estimators

Two examples that illustrate that the combined estimator is not necessarily as good as the best separate estimator.

Example 3: Assume that the first sample, $S^{(1)},$ is of fixed size with $π_{i}^{(1)} \propto y_{i},$ and that the second is a simple random sample with $π_{i}^{(2)} = n / N .$ Then the Horvitz-Thompson estimator ${\hat{Y}}_{1} = \sum_{i \in S^{(1)}} y_{i} / π_{i}^{(1)},$ has zero variance, but the combined single count estimator with $π_{i} = π_{i}^{(1)} + π_{i}^{(2)} - π_{i}^{(1)} π_{i}^{(2)}$ has positive variance. Thus the combined estimator is worse than the best separate estimator.

Example 4: Assume that the design for the first sample is stratified in such a way that there is no variation within strata. Then the separate estimator ${\hat{Y}}_{1} = \sum_{i \in S^{(1)}} y_{i} / π_{i}^{(1)}$ has zero variance. If the first sample is combined with a non-stratified second sample, then the resulting design does not have fixed sample sizes for the strata. Thus, the combined estimator has a positive variance.

These examples tell us that we need to be careful before combining very different designs, such as an unequal probability design with an equal probability design or a stratified with a non-stratified sampling design. Especially, we need to be careful if we plan to estimate the total directly based on the combined sample. When combining samples from relatively similar designs, it is however likely that the combined estimator becomes better than the best of the separate estimators.

Next, we investigate how to use the combined approach for estimation of the separate variances and then use the linear combination estimator. In fact, as we will see later, using the combined approach for variance estimation of separate variances can act stabilizing for the weights in the linear combination with weights based on estimated variances. There is a sort of pooling effect for the variance estimators when they are estimated with the same set of information.

3.4 Using the combined sample for estimation of variances of separate estimators

An alternative to estimating directly the total $Y$ based on the combined design is to use the combined design to estimate the variances of the separate estimators, and then proceed with a linear combination of the separate estimators. We assume access to $k$ independent samples and that we want to estimate the variance of a separate estimator, whose variance is a double sum over the population units. There are two main options for the variance estimator; multiply by

$\frac{I_{i} I_{j}}{π_{i j}} or \frac{S_{i} S_{j}}{E_{i j}}$

in the variance formula to obtain an unbiased estimator of the variance based on the combination of all the $k$ samples $S^{(l)},$ $l =1, \dots, k .$ For example, assuming that the variance of ${\hat{Y}}_{1}$ is

$V ({\hat{Y}}_{1}) = \sum_{i =1}^{N} \sum_{j =1}^{N} (π_{i j}^{(1)} - π_{i}^{(1)} π_{j}^{(1)}) \frac{y_{i}}{π_{i}^{(1)}} \frac{y_{j}}{π_{j}^{(1)}},$

we can use the combination of $S^{(l)},$ $l =1, \dots, k,$ to estimate $V ({\hat{Y}}_{1})$ by the single count estimator

${\hat{V}}_{SC} ({\hat{Y}}_{1}) = \sum_{i =1}^{N} \sum_{j =1}^{N} (π_{i j}^{(1)} - π_{i}^{(1)} π_{j}^{(1)}) \frac{y_{i}}{π_{i}^{(1)}} \frac{y_{j}}{π_{j}^{(1)}} \frac{I_{i} I_{j}}{π_{i j}}$

or the multiple count estimator

${\hat{V}}_{MC} ({\hat{Y}}_{1}) = \sum_{i =1}^{N} \sum_{j =1}^{N} (π_{i j}^{(1)} - π_{i}^{(1)} π_{j}^{(1)}) \frac{y_{i}}{π_{i}^{(1)}} \frac{y_{j}}{π_{j}^{(1)}} \frac{S_{i} S_{j}}{E_{i j}} .$

Note that $π_{i j} = π_{i j}^{(1, \dots, k)},$ $I_{i} = I_{i}^{(1, \dots, k)},$ $E_{i j} = E_{i j}^{(1, \dots, k)}$ and $S_{i} = S_{i}^{(1, \dots, k)},$ so the above variance estimators use all available information on the target variable. Hence, these variance estimators can be thought of as general pooled variance estimators. It follows directly that both estimators are unbiased because all designs have positive first order inclusion probabilities, which imply that all $π_{i j}$ and all $E_{i j}$ are strictly positive. Interestingly, the above variance estimators are unbiased even if the separate design 1 has some second order inclusion probabilities that are zero, which prevent unbiased variance estimation based on the sample $S^{(1)}$ alone.

Despite the appealing property of producing an unbiased variance estimator for any design, the above variance estimators cannot be recommended for designs with a high degree of zero second order inclusion probabilities (such as systematic sampling). The estimators can be very unstable for such designs and can produce a high proportion of negative variance estimates.

As we will see, if we intend to use a linear combination estimator, it is important that all variances are estimated in the same way. Then it is likely that the ratios, e.g.,

$\frac{{\hat{V}}_{SC} ({\hat{Y}}_{1})}{{\hat{V}}_{SC} ({\hat{Y}}_{2})} and \frac{{\hat{V}}_{MC} ({\hat{Y}}_{1})}{{\hat{V}}_{MC} ({\hat{Y}}_{2})}$

become stable (have small variance). The ratios become more stable because the estimators in the numerator and denominator are based on the same information and are estimated with the same weights for all the pairs $(i, j)$ in all estimators. With estimated variances we get

${\hat{α}}_{i} = {[\sum_{j =1}^{k} \frac{\hat{V} ({\hat{Y}}_{i})}{\hat{V} ({\hat{Y}}_{j})}]}^{- 1},$

so if the ratios of variance estimators have small variance then ${\hat{α}}_{i}$ has small variance. The weighting in the linear combination ${\hat{Y}}_{L}^{*}$ then becomes stabilized. As the following example demonstrates, the ratio of the variance estimators can even have zero variance. Thus it can sometimes provide the optimal weighting even if the variances are unknown.

Example 5: Assume we want to combine estimates resulting from two simple random samples of different sizes. This can of course be done optimally without estimating the variances, but as an example we will use the above approach to estimate the separate variances by use of the combined sample. In this case the use of the estimators ${\hat{V}}_{SC} ({\hat{Y}}_{1})$ and ${\hat{V}}_{SC} ({\hat{Y}}_{2})$ provides the optimal weighting, and so does ${\hat{V}}_{MC} ({\hat{Y}}_{1})$ and ${\hat{V}}_{MC} ({\hat{Y}}_{2}) .$ This result follows from the fact that if both designs are simple random sampling we have

$\frac{{\hat{V}}_{SC} ({\hat{Y}}_{1})}{{\hat{V}}_{SC} ({\hat{Y}}_{2})} = \frac{{\hat{V}}_{MC} ({\hat{Y}}_{1})}{{\hat{V}}_{MC} ({\hat{Y}}_{2})} = \frac{V ({\hat{Y}}_{1})}{V ({\hat{Y}}_{2})},$

which is straightforward to verify. For two simple random samples the situation corresponds to using a pooled estimate for $S^{2}$ (the population variance of $y)$ in the expressions for the variance estimates, and this pooled estimate is then cancelled out in the calculation of the weights.

The conclusion is that this procedure is likely to provide a more stable weighting also for designs that deviate from simple random sampling as long as the involved designs have large entropy (a high degree of randomness). The problem of bias for the linear combination estimator with estimated variances will be reduced compared to using separate and thus independent variance estimators.

We believe that this can be a very interesting alternative, because the estimator of the total based on the combined design does not necessarily provide a smaller variance than the best of the separate estimators. With this strategy we can improve the separate variance estimators, especially for a smaller sample (if data is available for a larger sample). Hence the resulting linear combination with jointly estimated variances can be a very competitive strategy.

With single count we might use a ratio type variance estimator such as the following

${\hat{V}}_{R} ({\hat{Y}}_{1}) = \frac{N^{2}}{γ_{1, \dots, k}} \sum_{i =1}^{N} \sum_{j =1}^{N} (π_{i j}^{(1)} - π_{i}^{(1)} π_{j}^{(1)}) \frac{y_{i}}{π_{i}^{(1)}} \frac{y_{j}}{π_{j}^{(1)}} \frac{I_{i} I_{j}}{π_{i j}},$

where $γ_{1, \dots, k} = \sum_{i =1}^{N} \sum_{j =1}^{N} \frac{I_{i} I_{j}}{π_{i j}} .$ For multiple count we can replace $I_{i} I_{j} / π_{i j}$ with $S_{i} S_{j} / E_{i j} .$ This ratio estimator uses the known size of the population of pairs $(i, j) \in {1, 2, \dots, N}^{2},$ which is $N^{2},$ and divides by the sum of the sample weights for the pairs. Note that $E (γ_{1, \dots, k}) = N^{2} .$ This correction is useful because the number of pairs in the estimator may be random (since the union of the samples may have random size). This rescales the sample (of pairs) weights to sum to $N^{2} .$ This will introduce some bias (as usual for ratio estimators), but the idea is that this will reduce the variance of the variance estimator. However, this approach is only useful if we are interested in the separate variance as the correction term will be the same for all separate variance estimators. Hence it does not change the weighting of a linear combination estimator with estimated variances.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2019-07-04

Language selection

Search and menus

Search

On combining independent probability samples
Section 3. Combining samples

3.1 Combining with single count

3.2 Combining with multiple count

3.3 Comparing the combined and separate estimators

3.4 Using the combined sample for estimation of variances of separate estimators

On combining independent probability samples Section 3. Combining samples

3.1 Combining with single count

3.2 Combining with multiple count

3.3 Comparing the combined and separate estimators

3.4 Using the combined sample for estimation of variances of separate estimators

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

On combining independent probability samples
Section 3. Combining samples