# A note on the concept of invariance in two-phase sampling designs Section 1. Introduction

Two-phase sampling designs are often used in surveys when the sampling frame contains little or no auxiliary information. It consists of first selecting a large sample from the population (typically using a rudimentary sampling design) in order to collect data on variables that are inexpensive to obtain and that are related to the characteristics of interest. The idea behind two-phase sampling is to create a pseudo-sampling frame richer in auxiliary information than the original sampling frame. Then, using the variables observed in the first phase, an efficient sampling procedure can be used to select a (typically small) subsample from the first-phase sample in order to collect the characteristics of interest. Two-phase sampling may also be helpful in a context of nonresponse as the set of respondents is often viewed as a second-phase sample.

We adopt the following notation: consider a population $U$ of size $N.$ A vector ${I}_{1}$ is generated according to the sampling design $F\left({I}_{1}\right)\mathrm{,}$ where ${I}_{1}\mathrm{=}{\left({I}_{11}\mathrm{,}\dots \mathrm{,}{I}_{1N}\right)}^{{\rm T}}$ denotes a vector of indicators such that ${I}_{1i}$ is either equal to 0 or 1. The first-phase sample, denoted by ${s}_{1}\mathrm{,}$ is the set of population units for which ${I}_{1i}\mathrm{=1}$ and ${n}_{1}\mathrm{=}{\displaystyle {\sum}_{i\in U}}\text{\hspace{0.17em}}{I}_{1i},$ is the size of ${s}_{1}.$ Then, a vector ${I}_{2}$ is generated according to the sampling design $F\left({I}_{2}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{I}_{1}\right)\mathrm{,}$ where ${I}_{2}\mathrm{=}{\left({I}_{21}\mathrm{,}\dots \mathrm{,}{I}_{2N}\right)}^{{\rm T}}$ denotes the vector of indicators such that ${I}_{2i}$ is either equal to 0 or 1. The second-phase sample, denoted by ${s}_{2}$ is the set of population units for which both ${I}_{1i}\mathrm{=1}$ and ${I}_{2i}\mathrm{=1}$ and ${n}_{2}\mathrm{=}{\displaystyle {\sum}_{i\in U}}\text{\hspace{0.17em}}{I}_{1i}{I}_{2i}$ is the size of ${s}_{2}.$ In practice, note that the indicators ${I}_{2i}$ are not generated for the population units belonging to the set $U-{s}_{1}.$ However, at least conceptually, nothing precludes defining these indicators for the units outside the first-phase sample.

Let ${\pi}_{1i}\mathrm{=}P\left({I}_{1i}\mathrm{=1}\right)$ and ${\pi}_{1ij}\mathrm{=}P\left({I}_{1i}\mathrm{=1,}{I}_{1j}\mathrm{=1}\right)$ be the first-order and second-order selection probabilities at the first-phase. Similarly, let ${\pi}_{2i}\left({I}_{1}\right)\mathrm{=}P\left({I}_{2i}\mathrm{=1}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{I}_{1}\right)$ and ${\pi}_{2ij}\left({I}_{1}\right)\mathrm{=}P\left({I}_{2i}\mathrm{=1,}{I}_{2j}\mathrm{=1}\text{\hspace{0.17em}}|\text{\hspace{0.17em}}{I}_{1}\right)$ be the first-order and second-order selection probabilities at the second-phase. Note that the (first-order and second-order) selection probabilities at the second-phase may depend on the realized sample ${s}_{1}.$

The paper is organized as follows. In Section 2, we define the concepts of weak and strong invariance and provide some examples. In Section 3, we discuss the implications of weak and strong invariance from an inferential point of view. In particular, we discuss the reverse decomposition of the variance in the case of a strongly invariant two-phase sampling design.