# 2. Estimation for multiple sampling frames

Guillaume Chauvet and Guylène Tandeau de Marsac

A finite population $U$ upon which is defined a variable of interest $y$ of value ${y}_{k}$ for individual $k$ is considered. If a sample $S$ is selected from $U$ with inclusion probabilities ${\pi}_{k}$ , the estimator $\widehat{Y}={\displaystyle {\sum}_{k\in S}}{\pi}_{k}^{-1}{y}_{k}$ proposed by Narain (1951) and Horvitz and Thompson (1952) is unbiased for total $Y={\displaystyle {\sum}_{k\in U}}{y}_{k}$ if all probabilities ${\pi}_{k}$ are strictly positive.

We are interested in the scenario where the population is fully covered by two overlapping sampling frames, ${U}_{A}$ and ${U}_{B}$ . We used Lohr’s (2011) notation, namely $a={U}_{A}\backslash {U}_{B}$ the domain covered by ${U}_{A}$ only; $b={U}_{B}\backslash {U}_{A}$ the domain covered by ${U}_{B}$ only; $ab={U}_{A}\cap {U}_{B}$ the domain covered both by ${U}_{A}$ and ${U}_{B}$. A sample ${S}^{A}$ is selected in ${U}_{A}$ with inclusion probabilities ${\pi}_{k}^{A}>0$. For any domain $d\subset {U}_{A}$ , the sub-total ${Y}_{d}={\displaystyle {\sum}_{k\in d}}{y}_{k}$ is unbiasedly estimated by ${\widehat{Y}}_{d}^{A}={\displaystyle {\sum}_{k\in {S}_{A}}}{d}_{k}^{A}{y}_{k}1\left(k\in d\right)$ with ${d}_{k}^{A}={\left({\pi}_{k}^{A}\right)}^{-1}$. A sample ${U}_{B}$ is selected in ${S}^{B}$ with inclusion probabilities ${\pi}_{k}^{B}>0$. For any domain $d\subset {U}_{B}$ , the sub-total ${Y}_{d}$ is unbiasedly estimated by ${\widehat{Y}}_{d}^{B}={\displaystyle {\sum}_{k\in {S}_{B}}}{d}_{k}^{B}{y}_{k}1\left(k\in d\right)$ with ${d}_{k}^{B}={\left({\pi}_{k}^{B}\right)}^{-1}$. The objective is to combine the samples ${S}^{A}$ and ${S}^{B}$ to get estimation $Y$ as accurate as possible.

## 2.1 Hartley estimator

Hartley (1962) proposes the class of unbiased estimators

$${\widehat{Y}}_{\theta}={\widehat{Y}}_{a}^{A}+\theta {\widehat{Y}}_{ab}^{A}+\left(1-\theta \right){\widehat{Y}}_{ab}^{B}+{\widehat{Y}}_{b}^{B}\mathrm{,}\text{(2}\text{.1)}$$

with $\theta $ one parameter to be determined. The choice $\theta =1/2$ gives samples ${S}^{A}$ and ${S}^{B}$ the same weight for the estimation on the intersection domain $ab$. Hartley (1962) proposes choosing the parameter that minimizes the variance of ${\widehat{Y}}_{\theta}$. This leads to

$${\theta}_{opt}=\frac{Cov\left({\widehat{Y}}_{a}^{A}+{\widehat{Y}}_{ab}^{B}+{\widehat{Y}}_{b}^{B}\mathrm{,}{\widehat{Y}}_{ab}^{B}-{\widehat{Y}}_{ab}^{A}\right)}{V\left({\widehat{Y}}_{ab}^{B}-{\widehat{Y}}_{ab}^{A}\right)}\mathrm{,}\text{(2}\text{.2)}$$

which can be re-expressed as

$${\theta}_{opt}=\frac{V\left({\widehat{Y}}_{ab}^{B}\right)+Cov\left({\widehat{Y}}_{ab}^{B}\mathrm{,}{\widehat{Y}}_{b}^{B}\right)-Cov\left({\widehat{Y}}_{a}^{A}\mathrm{,}{\widehat{Y}}_{ab}^{A}\right)}{V\left({\widehat{Y}}_{ab}^{A}\right)+V\left({\widehat{Y}}_{ab}^{B}\right)}\text{(2}\text{.3)}$$

when the samples ${S}^{A}$ and ${S}^{B}$ are independent. As noted by Lohr (2007), the optimal coefficient ${\theta}_{opt}$ may not be between 0 and 1 if a covariance term present in (2.3) is large. To simplify, let us assume that $Cov\left({\widehat{Y}}_{ab}^{B}\mathrm{,}{\widehat{Y}}_{b}^{B}\right)=0$ , which is the case if $b$ and $ab$ are used as strata in the selection of ${S}^{B}$. Then ${\theta}_{opt}>1$ if and only if $Cov\left({\widehat{Y}}^{A}\mathrm{,}{\widehat{Y}}_{ab}^{A}\right)<0$. When ${S}^{A}$ is selected by simple random sampling, this will be the case, for example, if in ${U}_{A}$ the low values of the variable $y$ are concentrated in the domain $ab$.

In practice, the variance and covariance terms are unknown and must be replaced by estimators, which introduces additional variability. Another disadvantage is that the optimal parameter depends on the variable of interest considered. If optimal estimators are calculated for different variables of interest, estimations may be internally inconsistent (Lohr 2011).

## 2.2 Kalton and Anderson estimator

A more general class of estimators is obtained by noting that total $Y$ can be re-expressed as

$$Y={Y}_{a}+{\displaystyle \sum _{k\in ab}}{\theta}_{k}{y}_{k}+{\displaystyle \sum _{k\in ab}}\left(1-{\theta}_{k}\right){y}_{k}+{Y}_{b}\mathrm{,}$$

with ${\theta}_{k}$ a coefficient specific to the individual $k$. Kalton and Anderson (1986) propose the choice ${\theta}_{k}={\left({d}_{k}^{A}+{d}_{k}^{B}\right)}^{-1}{d}_{k}^{B}$ , which leads to the estimator

$${\widehat{Y}}_{KA}={\displaystyle \sum _{k\in {S}^{A}}}{d}_{k}^{A}{m}_{k}^{A}{y}_{k}+{\displaystyle \sum _{k\in {S}^{B}}}{d}_{k}^{B}{m}_{k}^{B}{y}_{k}\text{(2}\text{.4)}$$

with on one hand ${m}_{k}^{A}=1$ if $k\in a$ and ${m}_{k}^{A}={\theta}_{k}$ if $k\in ab$ , and on the other hand ${m}_{k}^{B}=1$ if $k\in b$ and ${m}_{k}^{B}=1-{\theta}_{k}$ if $k\in ab$ . The estimation weights are the same regardless of the variable of interest, which guarantees internal consistency of the estimations; on the other hand, the Kalton and Anderson estimator is less effective than Hartley’s optimal estimator for a given variable of interest. Note that it is a Hansen-Hurwitz (1943) type estimator, which can be re-expressed as ${\widehat{Y}}_{KA}={\displaystyle {\sum}_{k\in U}}\left[{W}_{k}/E\left({W}_{k}\right)\right]{y}_{k}$ noting ${W}_{k}=1\left(k\in {S}^{A}\right)+1\left(k\in {S}^{B}\right)$ the number of times when unit $k$ is selected in the pooled sample ${S}^{A}\cup {S}^{B}$. In particular this gives $E({W}_{k})={\pi}_{k}^{A}+{\pi}_{k}^{B}$.

## 2.3 Bankier estimator

Bankier (1986) proposes using a Horvitz-Thompson type estimator, calculating the inclusion probabilities in the pooled sample.

$${\pi}_{k}^{HT}\equiv P\left(k\in {S}^{A}\cup {S}^{B}\right)={\pi}_{k}^{A}+{\pi}_{k}^{B}-Pr\left(k\in {S}^{A}\cap {S}^{B}\right)\mathrm{.}$$

If the samples ${S}^{A}$ and ${S}^{B}$ are independent, we get ${\pi}_{k}^{HT}={\pi}_{k}^{A}+{\pi}_{k}^{B}-{\pi}_{k}^{A}{\pi}_{k}^{B}$ and the estimator

$${\widehat{Y}}_{HT}={\displaystyle \sum _{k\in {S}^{A}\cup {S}^{B}}}\frac{{y}_{k}}{{\pi}_{k}^{HT}}={\displaystyle \sum _{k\in {S}^{A}\cap a}}\frac{{y}_{k}}{{\pi}_{k}^{A}}+{\displaystyle \sum _{k\in {S}^{B}\cap b}}\frac{{y}_{k}}{{\pi}_{k}^{B}}+{\displaystyle \sum _{k\in \left({S}^{A}\cup {S}^{B}\right)\cap ab}}\frac{1}{{\pi}_{k}^{A}+{\pi}_{k}^{B}-{\pi}_{k}^{A}{\pi}_{k}^{B}}{y}_{k}\mathrm{.}\text{(2}\text{.5)}$$

- Date modified: