# A note on the concept of invariance in two-phase sampling designs Section 3. Implications of the invariance property

## 3.1 Weak invariance

For an arbitrary two-phase sampling design, the inclusion probability of unit $i,$ ${\pi}_{i}\mathrm{,}i\in {s}_{1}\mathrm{,}$ is generally unknown and is defined as

$$\begin{array}{ll}{\pi}_{i}\hfill & \mathrm{=}\text{E}\left({I}_{1i}{I}_{2i}\right)\hfill \\ \hfill & \mathrm{=}\text{E}\left\{{I}_{1i}\text{E}\left({I}_{2i}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{I}_{1}\right)\right\}\hfill \\ \hfill & \mathrm{=}{\displaystyle \sum _{{i}_{1}\text{}\mathrm{:}\text{\hspace{0.17em}}{i}_{1i}\mathrm{=1}}}{\pi}_{2i}\left({I}_{1}\right)P\left({I}_{1}\mathrm{=}{i}_{1}\right)\mathrm{,}\hfill \end{array}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(3.1)$$

where ${i}_{1}$ denotes a realisation of the random vector ${I}_{1}.$ Therefore, the ${\pi}_{i}\u2019\text{s}$ are generally unknown because they require the knowledge of $P\left({I}_{1}\mathrm{=}{i}_{1}\right)$ for every possible ${I}_{1}$ (in many cases, we do) but also of ${\pi}_{2i}\left({I}_{1}\right)$ for every ${I}_{1}.$ The latter are generally unknown because ${\pi}_{2i}\left({I}_{1}\right)$ may depend on the outcome of phase 1. However, if the sampling design is weakly invariant, then ${\pi}_{2i}\left({I}_{1}\right)\mathrm{=}{\pi}_{2i}$ and (3.1) reduces to

$${\pi}_{i}\mathrm{=}{\pi}_{2i}{\displaystyle \sum _{{i}_{1}\mathrm{:}\text{\hspace{0.17em}}{i}_{1i}\mathrm{=1}}}P\left({I}_{1}\mathrm{=}{i}_{1}\right)\mathrm{=}{\pi}_{1i}{\pi}_{2i}\mathrm{.}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(3.2)$$

Suppose that we are interested in estimating the population total ${t}_{y}\mathrm{=}{\displaystyle {\sum}_{i\in U}}\text{\hspace{0.17em}}{y}_{i}.$ Since the ${\pi}_{i}\u2019\text{s}$ are generally unknown, the Horvitz-Thompson estimator of ${t}_{y},$

$${\widehat{t}}_{HT}\mathrm{=}{\displaystyle \sum _{i\in {s}_{2}}}\text{\hspace{0.17em}}{\pi}_{i}^{-1}{y}_{i}\mathrm{,}$$

cannot be used, in general. Instead, it is common practice to use the double expansion estimator

$${\widehat{t}}_{DE}\mathrm{=}{\displaystyle \sum _{i\in {s}_{2}}}\text{\hspace{0.17em}}{\pi}_{1i}^{-1}{\pi}_{2i}{\left({I}_{1}\right)}^{-1}{y}_{i}\mathrm{.}$$

In general, both ${\widehat{t}}_{HT}$ and ${\widehat{t}}_{DE}$ differ. However, for weakly invariant two-phase designs, it is clear from (3.2), that both are identical.

## 3.2 Strong invariance

Let $\theta $ be a finite population parameter and $\widehat{\theta}$ be an estimator of $\theta .$ The total variance of $\widehat{\theta}$ can be expressed as

$$V\left(\widehat{\theta}\right)\mathrm{=}VE\left(\widehat{\theta}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{I}_{1}\right)+EV\left(\widehat{\theta}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{I}_{1}\right)\mathrm{.}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(3.3)$$

Decomposition (3.3) is often called the two-phase decomposition of the variance; e.g., Särndal et al. (1992). If the two-phase sampling design is strongly invariant, the total variance of $\widehat{\theta}$ can alternatively be decomposed as

$$V\left(\widehat{\theta}\right)\mathrm{=}EV\left(\widehat{\theta}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{I}_{2}\right)+VE\left(\widehat{\theta}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{I}_{2}\right)\mathrm{.}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(3.4)$$

The decomposition (3.4) is often called the reverse decomposition of the variance as the order of sampling is reversed, which can only be justified provided the two-phase design is strongly invariant. The decomposition (3.4) cannot be used in the case of weakly invariant two-phase design as the vector ${I}_{2}$ cannot be generated prior to the vector ${I}_{1}\mathrm{.}$ The reverse decomposition was studied in the context of nonresponse by Fay (1991), Shao and Steel (1999) and Kim and Rao (2009), among others. In a nonresponse context, assuming that the units respond independently of one another, the set of respondents can be viewed as a second-phase sample selected according to Poisson sampling with unknown inclusion probabilities, called response probabilities. If the latter remain the same from one realization of the sample to another, we are essentially in the presence of a strongly invariant two-phase sampling design. Decomposition (3.4) can be used to justify simplified variance estimators for two-phase sampling designs; see Beaumont, Béliveau and Haziza (2015).

## Acknowledgements

The authors are grateful to an Associate Editor and a reviewer for their comments and suggestions, which improved the quality of this paper. David Haziza’s research was funded by a grant from the Natural Sciences and Engineering Research Council of Canada.

## References

Beaumont, J.-F.,
Béliveau, A. and Haziza, D. (2015). Clarifying some aspects of variance
estimation in two-phase sampling. *Journal of Survey Statistics and
Methodology*, 3, 524-542.

Fay, R.E. (1991). A design-based
perspective on missing data variance. *Proceedings of the 1991 Annual
Research Conference*, US Bureau of the Census, 429-440.

Kim, J.K., and Rao,
J.N.K. (2009). A unified approach to linearization variance estimation from
survey data after imputation for item nonresponse. *Biometrika*, 96,
917-932.

Särndal, C.-E., Swensson,
B. and Wretman, J. (1992). *Model Assisted Survey Sampling*.
Springer-Verlag, New York.

Shao, J., and Steel, P.
(1999). Variance estimation for survey data with composite imputation and nonnegligible
sampling fractions. *Journal of the American Statistical Association*, 94,
254-265.