# Register-based sampling for household panels 3. Inclusion weightsRegister-based sampling for household panels 3. Inclusion weights

## 3.1 Weighting with inclusion expectations

For design-based inference, first and second order inclusion probabilities for households and persons are required. Let $M$ denote the number of households in the population, $N$ the number of persons in the population aged 15 years or over and ${g}_{k}$ the number of persons aged 15 years or over that belong to the ${k}^{\text{th}}$ household. With the sample design described in Section 2, households $k$ can be included more than once but a maximum of ${g}_{k}$ times. This complicates the derivation of inclusion probabilities since the probability of selecting household $k$ is equal to the selection probability of the union of its household members $\left(k,j\right)$ aged 15 years and over. This probability is defined as:

$\begin{array}{ll}P\left(k\in s\right)=P\left(\underset{j=1}{\overset{{g}_{k}}{\cup }}\left[\left(k,j\right)\in s\right]\right)\hfill & =\sum _{j=1}^{{g}_{k}}P\left(\left(k,j\right)\in s\right)\hfill \\ \hfill & -\sum _{j=1}^{{g}_{k}}\sum _{{j}^{\prime }=j+1}^{{g}_{k}}P\left(\left[\left(k,j\right)\cap \left(k,{j}^{\prime }\right)\right]\in s\right)\hfill \\ \hfill & +\sum _{j=1}^{{g}_{k}}\sum _{{j}^{\prime }=j+1}^{{g}_{k}}\sum _{{j}^{″}=j+{j}^{\prime }+1}^{{g}_{k}}P\left(\left[\left(k,j\right)\cap \left(k,{j}^{\prime }\right)\cap \left(k,{j}^{″}\right)\right]\in s\right)-...\hfill \end{array}$

This kind of computation can be avoided by using the concept of inclusion expectations instead of inclusion probabilities. Bethlehem (2009), Chapter 2, generalizes the HT estimator to the concept of inclusion expectation for sampling with replacement. Let ${a}_{k}$ denote the number of times that household $k$ is selected in the sample. In the proposed sample design ${a}_{k}\in \left[0,1,\dots ,{g}_{k}\right].$ Let $\text{E}\left(.\right)$ denote the expectation with respect to the sample design. Now ${\pi }_{k}=\text{E}\left({a}_{k}\right)$ denotes the inclusion expectation of sampling unit $k.$ Since ${a}_{k}$ can be larger than one, ${\pi }_{k}$ can also take values larger than one and can therefore no longer be interpreted as an inclusion probability. It can, however, be interpreted as an expectation.

The parameter of interest is the population total, which is defined as

${t}_{y}=\sum _{k=1}^{M}\sum _{j=1}^{{N}_{k}}{y}_{kj}\equiv \sum _{k=1}^{M}{y}_{k}.\text{ }\text{ }\text{ }\text{ }\text{ }\left(3.1\right)$

The HT estimator for the population total in (3.1) can be defined as

${\stackrel{^}{t}}_{y}=\sum _{k=1}^{M}\frac{{a}_{k}{y}_{k}}{{\pi }_{k}}.\text{ }\text{ }\text{ }\text{ }\text{ }\left(3.2\right)$

Since $\text{E}\left({a}_{k}\right)={\pi }_{k},$ it follows that this HT estimator is design unbiased. Let ${\pi }_{k{k}^{\prime }}$ denote the inclusion expectation of units $k$ and ${k}^{\prime },$ i.e., ${\pi }_{k{k}^{\prime }}=\text{E}\left({a}_{k}{a}_{{k}^{\prime }}\right).$ The variance of the HT estimator is by definition equal to

$\begin{array}{ll}\text{V}\left({\stackrel{^}{t}}_{y}\right)\hfill & =\sum _{k=1}^{M}\sum _{{k}^{\prime }=1}^{M}\text{Cov}\left({a}_{k}{a}_{{k}^{\prime }}\right)\frac{{y}_{k}}{{\pi }_{k}}\frac{{y}_{{k}^{\prime }}}{{\pi }_{{k}^{\prime }}}\hfill \\ \hfill & =\sum _{k=1}^{M}\sum _{{k}^{\prime }=1}^{M}\left[\text{E}\left({a}_{k}{a}_{{k}^{\prime }}\right)-\text{E}\left({a}_{k}\right)\text{E}\left({a}_{{k}^{\prime }}\right)\right]\frac{{y}_{k}}{{\pi }_{k}}\frac{{y}_{{k}^{\prime }}}{{\pi }_{{k}^{\prime }}}\hfill \\ \hfill & =\sum _{k=1}^{M}\sum _{{k}^{\prime }=1}^{M}\left({\pi }_{k{k}^{\prime }}-{\pi }_{k}{\pi }_{{k}^{\prime }}\right)\frac{{y}_{k}}{{\pi }_{k}}\frac{{y}_{{k}^{\prime }}}{{\pi }_{{k}^{\prime }}}.\hfill \end{array}$

Note that in the case of sampling without replacement ${a}_{k}$ is a dummy taking values zero or one indicating whether unit $k$ is selected in the sample. In this case ${\pi }_{k}$ and ${\pi }_{k{k}^{\prime }}$ are the usual first and second order inclusion probabilities. This illustrates that the standard HT estimator, based on inclusion probabilities, can be extended easily to inclusion expectations. In the case of sample designs where units can be selected more than once, it is more convenient to work with inclusion expectations, since they are derived relatively easily. In the remainder of this subsection, first and second order inclusion expectations for the sample design described in Section 2 are derived.

Core persons are drawn by means of stratified simple random sampling. Since stratification is based on geographical regions, all members of a household $k$ belong to the same stratum $h$ at the moment of drawing core persons. Let ${N}_{h}$ denote the number of persons in the population of stratum $h$ aged 15 years or over, ${n}_{h}$ the number of core persons selected in the sample from stratum $h$ and ${g}_{k}$ the number of persons aged 15 years or over, belonging to household $k.$ Finally, ${a}_{jk}$ denotes an indicator that is equal to one if person $j$ from household $k$ is selected in the sample and zero otherwise. The first order inclusion expectation of the ${k}^{\text{th}}$ household equals

${\pi }_{kh}=\text{E}\left({a}_{k}\right)=E\left(\sum _{j=1}^{{g}_{k}}{a}_{jk}\right)=\sum _{j=1}^{{g}_{k}}E\left({a}_{jk}\right)={g}_{k}\frac{{n}_{h}}{{N}_{h}}.\text{ }\text{ }\text{ }\text{ }\text{ }\left(3.3\right)$

Second order inclusion expectations for households $k$ and ${k}^{\prime }$ for $k\ne {k}^{\prime }$ belonging to the same stratum $h,$ equal

${\pi }_{k{k}^{\prime }}=\text{E}\left({a}_{k}{a}_{{k}^{\prime }}\right)=E\left(\sum _{j=1}^{{g}_{k}}{a}_{jk}\sum _{{j}^{\prime }=1}^{{g}_{{k}^{\prime }}}{a}_{{j}^{\prime }{k}^{\prime }}\right)=\sum _{j=1}^{{g}_{k}}\sum _{{j}^{\prime }=1}^{{g}_{{k}^{\prime }}}E\left({a}_{jk}{a}_{{j}^{\prime }{k}^{\prime }}\right)={g}_{k}{g}_{{k}^{\prime }}\frac{{n}_{h}\left({n}_{h}-1\right)}{{N}_{h}\left({N}_{h}-1\right)}.\text{ }\text{ }\text{ }\text{ }\left(3.4\right)$

The second order inclusion expectation for household $k={k}^{\prime }$ from the same stratum $h,$ is given by

$\begin{array}{ll}{\pi }_{kk}\hfill & =\text{E}\left({a}_{k}{a}_{k}\right)=E\left(\sum _{j=1}^{{g}_{k}}{a}_{jk}\sum _{{j}^{\prime }=1}^{{g}_{k}}{a}_{{j}^{\prime }k}\right)=E\left(\sum _{j=1}^{{g}_{k}}{a}_{jk}+\sum _{j=1}^{{g}_{k}}\sum _{{j}^{\prime }\ne j=1}^{{g}_{k}}{a}_{jk}{a}_{{j}^{\prime }k}\right)\hfill \\ \hfill & =\sum _{j=1}^{{g}_{k}}E\left({a}_{jk}\right)+\sum _{j=1}^{{g}_{k}}\sum _{{j}^{\prime }\ne j=1}^{{g}_{k}}E\left({a}_{jk}{a}_{{j}^{\prime }k}\right)={g}_{k}\frac{{n}_{h}}{{N}_{h}}+{g}_{k}\left({g}_{k}-1\right)\frac{{n}_{h}\left({n}_{h}-1\right)}{{N}_{h}\left({N}_{h}-1\right)}.\hfill \end{array}\text{ }\text{ }\text{ }\text{ }\left(3.5\right)$

Second order inclusion expectations for households $k$ and ${k}^{\prime }$ for $k\ne {k}^{\prime }$ belonging to two different strata $h$ and ${h}^{\prime }$ equal

${\pi }_{k{k}^{\prime }}=\text{E}\left({a}_{k}{a}_{{k}^{\prime }}\right)=\text{E}\left(\sum _{j=1}^{{g}_{k}}{a}_{jk}\sum _{{j}^{\prime }=1}^{{g}_{{k}^{\prime }}}{a}_{{j}^{\prime }{k}^{\prime }}\right)=\sum _{j=1}^{{g}_{k}}\sum _{{j}^{\prime }=1}^{{g}_{{k}^{\prime }}}E\left({a}_{jk}{a}_{{j}^{\prime }{k}^{\prime }}\right)={g}_{kh}{g}_{{k}^{\prime }{h}^{\prime }}\frac{{n}_{h}{n}_{{h}^{\prime }}}{{N}_{h}{N}_{{h}^{\prime }}}.\text{ }\text{ }\text{ }\text{ }\left(3.6\right)$

An alternative proof based on the definition of an expected value, which does not use the rule that the expected value of a sum of mutual dependent variables is equal to the sum over the expected values of these variables is given by van den Brakel (2013).

As time proceeds the household composition of the core persons changes, which affects the inclusion expectations of the households in the sample. If sampling fractions differ between strata, the inclusion expectations (3.3) through (3.6) become more complicated and require information of stratum membership for all persons belonging to the household of the core persons. This complication is avoided by choosing a self-weighted sampling design. In this case each household member of a core persons has the same inclusion probability and the only household specific information required to derive household inclusion expectations is the number of persons aged 15 years and over in the household of the core person.

Since all members of a selected household are included in the sample, it follows that the first order inclusion expectations for persons belonging to household $k$ are equal to the first order inclusion expectation of household $k$ defined in (3.3). The second order inclusion expectations for persons from two different households $k$ and ${k}^{\prime },$ are equal to (3.4) for two households from the same stratum or (3.6) for two households from two different strata. The second order inclusion expectations for persons from the same household are defined by (3.5).

During the review the question was raised whether the inclusion expectations themselves have a variance that should be taken into account in the variance of HT or GREG estimators when they are based on inclusion expectations instead inclusion probabilities. In the finite population each person and each household has a pre-specified inclusion expectation. For the households observed in the sample these expectations can be calculated exactly without uncertainty since all information required to evaluate the true value of these expectations is available. Substituting inclusion probabilities for expectations, therefore does not result in an additional variance component.

## 3.2 Generalized Weight Share method

The sample design described in Section 2 can be considered as a special case of indirect sampling (Lavallée 2007). Indirect sampling refers to the situation where the population of interest is sampled through the use of a frame that refers to a different population. Lavallée (1995) develops the Generalized Weight Share method to construct weights for these situations and can be used to derive design weights for households and persons in the sample design described in Section 2.

Following the notation of Lavallée (1995) for the case of indirect sampling, there is a population ${U}^{A}$ of size ${N}^{A}$ from which a sample ${s}^{A}$ of size $n$ is drawn with selection probabilities ${\pi }_{i}^{A}.$ In addition, there is the target population ${U}^{B}$ of size ${N}^{B}.$ This population can be divided in ${M}^{B}$ clusters. Each cluster $k$ contains ${N}_{k}^{B}$ units, such that ${N}^{B}={\sum }_{k=1}^{{M}^{B}}{N}_{k}^{B}.$ The situation for the sample design described in Section 2 is depicted in Figure 3.1. The clusters are households, ${U}^{A}$ is the population of persons aged 15 years and over, and ${U}^{B}$ is the population of all persons residing in the Netherlands. Persons in ${U}^{A}$ and ${U}^{B}$ are depicted as circles, households in ${U}^{B}$ are depicted as shaded squares, and the circles within a shaded square visualise persons belonging to the same household. Figure 3.1 shows respectively, a single person household, a two person household containing for example a divorced parent with a child younger than 15, a two person household containing two adults without children, and a four person household containing two parents with two children and one of the children is younger than 15 while the other is 15 years or older. The arrows depict the links between the units of ${U}^{A}$ and ${U}^{B}.$ In the sample design considered in Section 2, each unit in ${U}^{A}$ has exactly one unique link with a unit in ${U}^{B}.$ Clusters in ${U}^{B}$ have at least one link with units in ${U}^{A}.$ Links are identified with an indicator variable

If a unit $i$ in ${U}^{A}$ is selected in the sample, the entire cluster $k$ to which this unit belongs, is included in the sample. The parameter of interest is the population total in ${U}^{B}$ and is similar to (3.1) defined as ${t}_{y}={\sum }_{k=1}^{{M}^{B}}{\sum }_{j=1}^{{N}_{k}^{B}}{y}_{kj}.$ An estimator for ${t}_{y}$ is defined as

${\stackrel{^}{t}}_{y}={\sum }_{k=1}^{m}{\sum }_{j=1}^{{N}_{k}^{B}}{w}_{kj}{y}_{kj},\text{ }\text{ }\text{ }\text{ }\text{ }\left(3.7\right)$

with $m$ the number of unique clusters (households) included in the sample and ${w}_{kj}$ the weight attached to each unit $j$ of cluster $k.$ Generally the inverse of the selection probabilities of units $\left(k,j\right)$ observed in the sample are used as weights in the HT estimator. In this situation not all units in the sample have a known inclusion probability. Firstly not all units in ${U}^{B}$ have a link to ${U}^{A}.$ Secondly, as time proceeds household compositions change due to marriages, divorces, departures of children and cohabitation. As a result, as time proceeds, units with a link to ${U}^{A}$ enter the clusters in the sample although they are not initially included in the sample drawn from ${U}^{A}.$ For these units inclusion probabilities are not necessarily known. They affect, however, the inclusion expectations of the clusters included in the sample. Reconstruction of the inclusion probabilities requires information of selection probabilities of all units in the population at the moment that the sample is drawn. In many practical situations this information is not available.

Description of Figure 3.1

Figure representing the links between units from the sample frame ${U}^{A}$ and units from the target population ${U}^{B}.$ Persons in ${U}^{A}$ and ${U}^{B}$ are depicted as circles, households in ${U}^{B}$ are depicted as shaded squares, and the circles within a shaded square visualise persons belonging to the same household. Person number 1 from ${U}^{A}$ is linked to person number 1 from ${U}^{B},$ who’s the only person in her shaded square (a single person household). Person number 2 from ${U}^{A}$ is linked to person number 2 from ${U}^{B},$ who’s with person number 3 in her shaded square (a two person household containing for example a divorced parent with a child younger than 15). People number 3 and 4 from ${U}^{A}$ are linked to people number 4 and 5 from ${U}^{B},$ sharing a shaded square (a two person household containing two adults without children). People number 5, 6 and 7 from ${U}^{A}$ are linked to people number 6, 7 and 9 from ${U}^{B},$ sharing a shaded square (a four person household containing two parents with two children and one of the children is younger than 15 while the other is 15 years or older).

The Generalized Weight Share method can be used to derive non-zero weights for all units in the sample. This method starts by deriving initial weights, which are defined as

with ${\delta }_{i}^{A}$ an indicator variable that is equal to one if $i$ is included in the sample ${s}^{A}$ and zero otherwise. This expression follows directly from Lavallée (1995), equation (2) in combination with the fact that in this application each unit in ${U}^{A}$ has exactly one unique link with a unit in ${U}^{B},$ see Figure 3.1. In a second step a so-called basic weight for each cluster $k$ is derived as the mean of all initial weights within each cluster

${w}_{k}=\frac{{\sum }_{j=1}^{{N}_{k}^{B}}{w}_{kj}^{*}}{{\sum }_{j=1}^{{N}_{k}^{B}}{l}_{kj}},$

which follows from Lavallée (1995), equation (7). Finally all persons $j$ that belong to the same household $k$ receive the same weight assigned to their household, i.e., ${w}_{kj}={w}_{k}$ for all $j\in k.$ A proof that the use of the basic weights in (3.7) is an unbiased estimator for the population total is also given by Lavallée (1995).

Let ${\sum }_{j=1}^{{N}_{k}^{B}}{l}_{kj}={g}_{k}$ denote the number of persons in household $k$ aged 15 years and older and ${a}_{k}$ the number of core persons in household $k,$ i.e., the number of persons in household $k$ that are included in sample ${s}^{A}.$ Since ${s}^{A}$ is drawn by means of stratified simple random sampling, it follows that ${\pi }_{i}^{A}={n}_{h}^{A}/{N}_{h}^{A}$ with ${N}_{h}^{A}$ the number of persons aged 15 years and older in the population of stratum $h,$ and ${n}_{h}^{A}$ the number of core persons selected in the sample from stratum $h.$ Then it follows that

${w}_{k}=\frac{{a}_{k}}{{g}_{k}}\frac{{N}_{h}^{A}}{{n}_{h}^{A}}.\text{ }\text{ }\text{ }\text{ }\text{ }\left(3.8\right)$

Inserting the first order inclusion expectation (3.3) into (3.2) gives the same HT estimator as derived with the Generalized Weight Share method, i.e., inserting (3.8) into (3.7).

The derivation of the inclusion expectations in Subsection 3.1 applies to stratified sampling of households with inclusion expectations proportional to household size and is a special case of the Generalized Weight Share method. An argument to apply a design as outlined in Section 2 is that sampling households proportional to household size is efficient for target variables that are positively correlated with household size.

Lavallée (1995) also provides variance expressions for (3.7) based on the Generalized Weight Share method. This expression is based on the first and second order inclusion probabilities of the sample units drawn from ${U}^{A}$ and a transformation of the target variable. As a result the property that clusters are drawn proportional to their size is not made explicit, nor that the fact they are drawn partially with replacement. In Section 6 it is pointed out that the variance expressions in Lavallée (1995) for this application are equal to the variance expressions based on the inclusion expectations derived in (3.3) through (3.6).

Is something not working? Is there information outdated? Can't find what you're looking for?