# Register-based sampling for household panels 1. IntroductionRegister-based sampling for household panels 1. Introduction

Statistics Netherlands conducts two important sample surveys to describe the income and wealth situation of the Dutch population. First, the Dutch Regional Income Survey (RIS) provides a description of the income and wealth situation, being accurate at a very detailed regional level. This is accomplished by publishing accurate income distributions for persons and households at a level of neighbourhoods on a yearly basis, using a large sample based on a small set of the main income components derived in a relatively straightforward manner from tax administration. Second, the Income Panel Survey (IPS) publishes yearly income and wealth characteristics of the Dutch population at a more aggregated regional level. This survey is based on a large set of variables using all possible income components of households that can be derived from the available administrative data in the Netherlands. The derivation of the variables for this survey is more time consuming. Therefore the sample size of this survey is considerably smaller than the RIS. Both surveys are designed as a household panel where both person and household based variables about income and wealth are observed.

Households are often considered as the sampling units in panels conducted to collect information at the level of households and persons (Lynn 2009; Smith, Lynn and Elliot 2009). Such panels are used for longitudinal analysis as well as the production of cross-sectional estimates. Using households as the sampling units in a panel design has, however, some major disadvantages due to their instability over time. As time proceeds, households might disintegrate, join or split, new members might enter the households and other members might leave the households for different reasons. Kalton and Brick (1995) explain that these changes can affect the selection probabilities of the households in the sample. Reconstruction of the correct inclusion probabilities of the sampling units is essential to derive correct weights for analysis purposes, in particular if the panel is used for producing cross-sectional estimates.

Consider a panel where households are selected by means of simple random sampling, say at time $t=0.$ In many panels, people that enter a sampled household at a later stage are also included in the panel. These individuals are called cohabitants by Lavallée (1995). As time proceeds, more and more cohabitants are included in the sample and disturb the equal probability design that is used to select the initial sample (Kalton and Brick 1995). Consider for example household A, which is selected in the sample when the panel started at $t=0.$ If after some period of time this household merges with another household B, which was initially not selected for the panel at time $t=0,$ then the selection probability of this new household is the sum of the selection probabilities of households A and B at time $t=0.$ Not correcting for differences in selection probabilities due to the gradual increasing share of cohabitants in the sample leads to biased inference. Ernst (1989) proposes the Weight Share method to overcome these problems. Lavallée (1995) extends this method to the Generalized Weight Share method as a solution for drawing inference about target populations that are sampled through the use of a frame that refers to a different population.

The RIS and the IPS are both based on a panel and are conducted to collect information about households and persons. To avoid the problems with panels using households as sampling units, an alternative design is applied. Instead of households, so-called core persons are drawn with an equal probability design, who are followed over time. All household members belonging to the household of a core person at each particular period are included in the sample. This results in a sample design where households are drawn proportionally to the household size and households can be selected more than once, but with a maximum that is equal to the household size. This design is an application of indirect sampling (Lavallée 1995, 2007; Deville and Lavallée 2006).

The purpose of this paper is to describe a sample design with an estimation technique that is useful for panels that collect information at person and household level. The methodology employed in this paper is particularly useful for register based sampling, since the core persons are included in the sample indefinitely. The sample design is also useful for Web panels, but might require some form of rotating design to avoid problems with panel attrition. This means that sampling units enter the panel, are observed multiple times and leave the panel according to a pre-specified pattern (Smith et al. 2009). The main contribution of this paper to the existing literature is that explicit expressions for the variance of the target parameters are derived using inclusion expectations instead of inclusion probabilities under the aforementioned sample design. A measure of the minimum accuracy for an estimated income distribution is proposed and explicit expressions for the minimum sample size are derived. The RIS is used throughout the paper to illustrate the described sampling techniques.

The paper is organized as follows. A description of the sample design of the RIS is given in Section 2. In Section 3 the concept of inclusion expectations is introduced as a convenient practical alternative for inclusion probabilities. Subsequently, first and second order inclusion expectations are derived for the proposed sampling design. These inclusion expectations are required to construct the $\pi \text{\hspace{0.17em}}-$ estimator or Horvitz-Thompson (HT) estimator (Narain 1951; Horvitz and Thompson 1952). It is also shown that the same weights can be derived as a special case of the Generalized Weight Share method for indirect sampling (Lavallée 1995, 2007). The key target variables for the RIS are estimated income distributions. In Section 4 formulas for the minimum required sample size are derived based on a precision measure for estimated income distributions. Since households can be selected more than once, an expression for the expected number of unique households is derived in Section 4. The estimation procedure of the RIS is based on linear weighting using the general regression (GREG) estimator (Särndal, Swensson and Wretman 1992) and is described in Section 5. The integrated weighting method of Lemaître and Dufour (1987), Nieuwenbroek (1993) and Steel and Clark (2007) is applied to obtain equal weights for persons belonging to the same household. In Section 6 variance approximations for the GREG estimator under the proposed sample design are derived. An application to the RIS is provided in Section 7. The paper concludes with a discussion in Section 8.

Is something not working? Is there information outdated? Can't find what you're looking for?