# 5. Conclusion

Guillaume Chauvet and Guylène Tandeau de Marsac

We examined the Hartley (1962), Kalton and Anderson (1986) and Bankier (1986) estimators to pool the samples resulting from two survey waves. More particularly, we studied the case where the first sample represents the entire population (completely representative sample), while the second represents only a part (partially representative sample). Within the framework considered in the simulations (also see the Appendix for a more general framework), using the partially representative sample did not improve accuracy: if its size increases, the accuracy of the estimators in the Hartley class remains stable or improves slightly, while the accuracy of the Kalton and Anderson and Bankier estimators is worsened. Hartley’s optimal estimator itself, although more complex to calculate, offers accuracy that is only slightly improved as compared to the classic Horvitz-Thompson estimator calculated on the fully representative sample. Although our simulation study is limited, the results suggest that the estimator should be chosen carefully when there are multiple survey frames, and that a simple estimator is sometimes preferable, even if it uses only part of the information collected.

# Acknowledgements

The authors would like to thank an associate editor and referee for their careful reading and comments, which helped to significantly improve the article, and David Haziza for the useful discussions.

# Appendix

## A1. Comparison of Hartley’s optimal estimator and the Horvitz-Thompson estimator

Let us take the framework and notations from Section 4: samples ${S}^{A}$ and ${S}^{B}$ are selected using a two-stage frame with common first stage selection. Stratified simple random sampling is used at the first stage, and simple random sampling in each primary sampling unit at the second stage. The sampling frame ${U}_{A}$ corresponds to the entire population, while the sampling frame ${U}_{B}$ covers only part of the population.

With Hartley’s optimal estimator, the formula (3.6) gives

$${\theta}_{opt|{S}_{I}}=\frac{EV\left({\widehat{Y}}_{ab}^{B}|{S}_{I}\right)-ECov\left({\widehat{Y}}_{a}^{A}\mathrm{,}{\widehat{Y}}_{ab}^{A}|{S}_{I}\right)}{EV\left({\widehat{Y}}_{ab}^{B}|{S}_{I}\right)+EV\left({\widehat{Y}}_{ab}^{A}|{S}_{I}\right)}\mathrm{.}$$

After some calculation, we get

$$\begin{array}{l}EV\left({\widehat{Y}}_{ab}^{A}|{S}_{I}\right)={\displaystyle \sum _{h=1}^{H}}\frac{{M}_{h}}{{m}_{h}}{\displaystyle \sum _{{u}_{hi}\in {U}_{Ih}}}{\left({N}_{hi}\right)}^{2}\frac{1-{f}_{hi}^{A}}{{n}_{hi}^{A}}\left\{\frac{{N}_{hi}^{B}-1}{{N}_{hi}-1}{S}_{{u}_{hi}^{B}}^{2}+\frac{{N}_{hi}^{B}\left({N}_{hi}-{N}_{hi}^{B}\right){\left({\overline{y}}_{{u}_{hi}^{B}}\right)}^{2}}{{N}_{hi}\left({N}_{hi}-1\right)}\right\}\mathrm{,}\text{(A}\text{.1)}\\ -ECov\left({\widehat{Y}}_{a}^{A}\mathrm{,}{\widehat{Y}}_{ab}^{A}|{S}_{I}\right)={\displaystyle \sum _{h=1}^{H}}\frac{{M}_{h}}{{m}_{h}}{\displaystyle \sum _{{u}_{hi}\in {U}_{Ih}}}{\left({N}_{hi}\right)}^{2}\frac{1-{f}_{hi}^{A}}{{n}_{hi}^{A}}\left\{\frac{{N}_{hi}^{B}\left({\overline{y}}_{{u}_{hi}^{B}}\right)\left({N}_{hi}{\overline{y}}_{{u}_{hi}}-{N}_{hi}^{B}{\overline{y}}_{{u}_{hi}^{B}}\right)}{{N}_{hi}\left({N}_{hi}-1\right)}\right\}\end{array}$$

with ${\overline{y}}_{{u}_{hi}}={\left({N}_{hi}\right)}^{-1}{\displaystyle {\sum}_{k\in {u}_{hi}}}{y}_{k}$, ${\overline{y}}_{{u}_{hi}^{B}}={\left({N}_{hi}^{B}\right)}^{-1}{\displaystyle {\sum}_{k\in {u}_{hi}^{B}}}{y}_{k}$ and ${S}_{{u}_{hi}^{B}}^{2}={\left({N}_{hi}^{B}-1\right)}^{-1}{\displaystyle {\sum}_{k\in {u}_{hi}^{B}}}{\left({y}_{k}-{\overline{y}}_{{u}_{hi}^{B}}\right)}^{2}$.

The Horvitz-Thompson estimator based on the single sample ${S}^{A}$ and Hartley’s optimal estimator agree if the coefficient ${\theta}_{opt|{S}_{I}}$ is equal to $1$, which is the case if $EV\left({\widehat{Y}}_{ab}^{A}|{S}_{I}\right)=-ECov\left({\widehat{Y}}_{a}^{A}\mathrm{,}{\widehat{Y}}_{ab}^{A}|{S}_{I}\right)$. This condition will be verified in particular if in (A.1) the terms between the brackets agree for each primary sampling unit ${u}_{hi}$. We get therefore ${\theta}_{opt|{S}_{I}}\simeq 1$ if

$$\forall \text{}{u}_{hi}\in {U}_{I}\text{\hspace{1em}}\frac{{N}_{hi}\left({N}_{hi}^{B}-1\right)}{{N}_{hi}^{B}}\frac{{S}_{{u}_{hi}^{B}}^{2}}{{\overline{y}}_{{u}_{hi}^{B}}\left({N}_{hi}{\overline{y}}_{{u}_{hi}}-{N}_{hi}^{B}{\overline{y}}_{{u}_{hi}^{B}}\right)}+\frac{\left({N}_{hi}-{N}_{hi}^{B}\right){\overline{y}}_{{u}_{hi}^{B}}}{{N}_{hi}{\overline{y}}_{{u}_{hi}}-{N}_{hi}^{B}{\overline{y}}_{{u}_{hi}^{B}}}\simeq 1.\text{(A}\text{.2)}$$

Let us suppose that the mean value of $y$ is approximately the same in the frames ${U}_{A}$ and ${U}_{B}$ for each primary sampling unit, i.e. that $\forall \text{}{u}_{hi}\in {U}_{I}$ ${\overline{y}}_{{u}_{hi}^{B}}\simeq {\overline{y}}_{{u}_{hi}}$. Then, the condition (A.2) will be verified approximately if $\forall \text{}{u}_{hi}\in {U}_{I}$ $c{v}_{{u}_{hi}^{B}}^{2}$ is close to $0$, with $c{v}_{{u}_{hi}^{B}}=\sqrt{{S}_{{u}_{hi}^{B}}^{2}}/{\overline{y}}_{{u}_{hi}^{B}}.$

In summary, the Horvitz-Thompson estimator based on the single sample ${S}^{A}$ and Hartley’s optimal estimator will be close if within each primary sampling unit ${u}_{hi}$: (a) there is not much difference in the mean value of $y$ between the two bases, and (b) the variable $y$ has low dispersion within ${u}_{hi}^{B}$. In the simulations, the condition (a) is approximately met since the distribution of individuals between the sampling frames ${U}_{A}$ and ${U}_{B}$ is completely random; the condition (b) is approximately met with values of $c{v}_{{u}_{hi}^{B}}^{2}$ varying from $0.02$ to $0.10$ for population 1, and from $0.001$ to $0.005$ for population 2.

# References

Bankier, M.D.
(1986). Estimators based on several stratified samples with applications to multiple
frame surveys. *Journal of the American
Statistical Association*, 81, p. 1074-1079.

Bourdalle, G., Christine, M. and
Wilms, L. (2000). Échantillons maître et emploi*. Série
INSEE Méthodes*, 21, p. 139-173.

Hansen, M.H. and Hurwitz, W.N.
(1943). On the theory of sampling from finite populations. *Annals of Mathematical Statistics*, 14,
p. 333-362.

Hartley, H.O.
(1962). Multiple frame surveys. *Proceedings
of the Social Statistics Section*, American Statistical Association, p. 203-206.

Horvitz, D.G. and Thompson, D.J.
(1952). A generalization of sampling without replacement from a
finite universe. *Journal of the American
Statistical Association*, 47, p. 663-685.

Kalton, G. and Anderson, D.W. (1986).
Sampling rare populations. *Journal of the Royal Statistical Society*, A, 149, p. 65-82.

Lavallée, P. (2002). *Le sondage indirect, ou la méthode
généralisée du partage des poids*. Éditions de l'Université de Bruxelles (Belgium)
and Éditions Ellipses (France).

Lavallée, P.
(2007). *Indirect sampling*. New York:
Springer.

Lohr, S.L.
(2007). Recent developments in multiple frame surveys. *Proceedings of the Survey Research Methods Section*, American
Statistical Association, 3257-3264.

Lohr, S.L.
(2009). Multiple frame surveys. In *Handbook
of Statistics, Sample Surveys: Design, Methods and Applications*, Eds., D.
Pfeffermann and C.R. Rao. Amsterdam: North Holland, Vol. 29A, p. 71-88.

Lohr, S.L. (2011). Alternative survey
sample designs: Sampling with multiple overlapping frames. *Survey Methodology*, Vol.37 no.2, p. 197-213.

Mecatti, F. (2007). A single frame
multiplicity estimator for multiple frame surveys. *Survey Methodology*, Vol.33 no.2, p. 151-157.

Narain, R.D.
(1951). On sampling without replacement with varying probabilities. *Journal of the Indian Society of
Agricultural Statistics*, 3, p. 169-175.

Rao, J.N.K. and Wu, C. (2010). Pseudo-empirical
likelihood inference for dual frame surveys. *Journal of the American Statistical Association*, 105, p. 1494-1503.

Saigo, H.
(2010). Comparing four bootstrap methods for stratified three-stage sampling*. **Journal of Official Statistics*, Vol. 26, No. 1, 2010, p. 193-207.

- Date modified: