# Register-based sampling for household panels 4. Sample size determinationRegister-based sampling for household panels 4. Sample size determination

The purpose of the RIS is to publish income distributions for households and persons at different geographical levels. Income distributions for households for region or area $r$  are defined as

${P}_{lr}=\frac{{M}_{lr}}{{M}_{+r}},\text{ }\text{ }l=1,\dots ,L,\text{ }\text{ }\text{ }\text{ }\text{ }\left(4.1\right)$

where ${M}_{lr}$ denotes the number of households from region $r,$ belonging to the ${l}^{\text{th}}$ income category, and ${M}_{+r}={\sum }_{l}{M}_{lr},$ the total number of households in area $r.$ This income distribution is estimated as

${\stackrel{^}{P}}_{lr}=\frac{{\stackrel{^}{M}}_{lr}}{{M}_{+r}},\text{ }\text{ }l=1,\dots ,L,\text{ }\text{ }\text{ }\text{ }\text{ }\left(4.2\right)$

where ${\stackrel{^}{M}}_{lr}$ denotes an appropriate direct estimator for the total number of households from area $r,$ classified to the ${l}^{\text{th}}$ income category. For the moment the HT estimator is assumed as an appropriate estimator for ${M}_{lr},$ i.e.,

${\stackrel{^}{M}}_{lr}=\sum _{h\in r}\sum _{k=1}^{{m}_{h}}\frac{{y}_{khl}}{{\pi }_{k}},$

where ${y}_{khl}=1$ if household $k$ from stratum $h$ is classified to the ${l}^{\text{th}}$ income class and ${y}_{khl}=0$ otherwise and ${m}_{h}$ the total number of households selected in stratum $h.$ In the RIS $L=10.$ Income distributions for persons are defined and estimated similarly to (4.1), (4.2), with ${M}_{lr}$ the number of persons from area $r,$ belonging to the ${l}^{\text{th}}$ income category. The HT estimator for ${M}_{lr}$ is now defined as

${\stackrel{^}{M}}_{lr}=\sum _{h\in r}\sum _{k=1}^{{m}_{h}}\frac{1}{{\pi }_{k}}\sum _{j=1}^{{N}_{k}}{y}_{kjhl},$

where ${y}_{kjhl}=1$ if person $j$ from household $k$ and stratum $h$ is classified to the ${l}^{\text{th}}$ income class and ${y}_{kjhl}=0$ otherwise.

For sample size determination, precision specifications for the estimated income distributions are required. For stratified sampling designs, Neyman allocations are often considered to determine minimum sample sizes and optimal allocations to meet precision requirements at aggregated levels (Cochran 1977). Power allocations are useful to find the right balance between precision requirements for aggregates and strata (Bankier 1988). In this application the minimum sample size is based on precision requirements for the individual strata, i.e., neighbourhoods, which is the most detailed publication level.

If precision requirements are specified for the separate classes of the income distributions, then the income class with the largest population variance determines the minimum required sample size, resulting in unnecessarily large sample sizes. As an alternative the square root of the mean over the variances of the estimated income classes of an income distribution is proposed as a precision measure for the estimated income distributions. With this measure the influence of the most imprecise income class on the minimum sample size will be reduced. The square root of the mean over the variances of the estimated income classes of an income distribution is called the average standard error measure and is defined as

$s=\sqrt{\frac{1}{L}\sum _{l=1}^{L}V\left({\stackrel{^}{P}}_{lr}\right)}.\text{ }\text{ }\text{ }\text{ }\text{ }\left(4.3\right)$

In this section an exact expression for $s$ will be derived as well as an approximation that can be used to estimate the minimum required sample size which does not require information about income distributions or variances.

Since neighbourhoods are the most detailed areas for which income distributions are published, precision requirements for sample size determination are specified at this level. Since neighbourhoods are used as the stratification variable in the sample design, expressions for $s$ can be derived under simple random sampling without replacement of core persons within each neighbourhood. It is proved in the appendix that an expression for the average standard error measure ${s}_{h}$ in (4.3) for an income distribution is given by

${s}_{h}=\sqrt{\frac{1}{L}\frac{{N}_{h}-{n}_{h}}{{n}_{h}}\frac{1}{{N}_{h}-1}\left(\frac{{N}_{h}}{{M}_{h}^{2}}\sum _{l=1}^{L}\sum _{k=1}^{{M}_{h}}\frac{{y}_{khl}}{{g}_{kh}}-{\sum _{l=1}^{L}\left(\frac{{M}_{lh}}{{M}_{h}}\right)}^{2}\right)},\text{ }\text{ }\text{ }\text{ }\text{ }\left(4.4\right)$

with ${M}_{h}$ the number of households in stratum $h$ and ${M}_{lh}$ the number of households in stratum $h$ belonging to the ${l}^{\text{th}}$ income class. Note that if ${g}_{kh}=1$ for all households in the population of stratum $h,$ then it follows that ${M}_{h}={N}_{h}$ and that formula (4.1) simplifies to

$V\left({\stackrel{^}{P}}_{lh}\right)=\frac{{N}_{h}-{n}_{h}}{{n}_{h}}\frac{1}{{N}_{h}-1}\left({P}_{lh}\left(1-{P}_{lh}\right)\right),$

which can be recognized as the variance of an estimated fraction under simple random sampling without replacement (Cochran 1977, Chapter 3).

Minimum sample size requirements based on (4.4) require information about the income distribution and its variance from preceding periods. Since this information is generally not available at the design phase of a panel, it is useful to have an upper bound for the average standard error measure for the income distribution in (4.4). This is comparable to taking the variance for a parameter defined as a proportion, which reaches a maximum when the proportion is 0.5 for calculating the minimum sample size for a survey. It is shown in the appendix that an upper bound for the average standard error measure ${s}_{h}$ for an income distribution, specified in (4.4) is given by

${s}_{h}\le \sqrt{\frac{1}{L}\frac{{N}_{h}-{n}_{h}}{{n}_{h}}\frac{1}{{N}_{h}-1}\left(\frac{{N}_{h}}{{M}_{h}^{2}}\sum _{t=1}^{T}\frac{{M}_{th}}{t}-\frac{1}{L}\right)},\text{ }\text{ }\text{ }\text{ }\text{ }\left(4.5\right)$

with ${M}_{th}$ the number of households of size $t$ in stratum $h.$

If ${g}_{kh}=1$ for all households in the population of stratum $h$ and the number of classes of the income distribution $L=2,$ then it follows that the approximation for the average standard error measure ${s}_{h}$ in (4.5) can be simplified to

${s}_{h}\le \sqrt{\frac{{N}_{h}-{n}_{h}}{{n}_{h}}\frac{1}{\left({N}_{h}-1\right)}\frac{1}{4}},$

which equals the square root of the maximum variance of an estimated fraction at $\stackrel{^}{P}=0.5$ under simple random sampling. This illustrates that the approximation for the average standard error measure in (4.5) can be interpreted as a generalization of the approximation of the maximum variance of an estimated fraction at $\stackrel{^}{P}=0.5,$ often used in sample size determination. The average standard error measure has its maximum value in the case of an equal distribution of the households over the income categories, i.e., ${\stackrel{^}{P}}_{lh}=1/L$ for $l=1,\dots ,L.$ In this situation the approximation for ${s}_{h}$ is exact, which follows directly from equation (4.3).

Equating the expression for ${s}_{h}$ in (4.5) to a pre-specified maximum value, say ${\Delta }_{h},$ results in the following expression for the minimum sample size of core persons

${n}_{h}\ge \frac{{\left(\frac{{N}_{h}}{{M}_{h}}\right)}^{2}\sum _{t=1}^{T}\frac{{M}_{th}}{t}-\frac{{N}_{h}}{L}}{\left({N}_{h}-1\right)L{\Delta }_{h}^{2}+\frac{{N}_{h}}{{M}_{h}^{2}}\sum _{t=1}^{T}\frac{{M}_{th}}{t}-\frac{1}{L}}.\text{ }\text{ }\text{ }\text{ }\text{ }\left(4.6\right)$

The information required to estimate the minimum sample size is the total number of persons and the total number of equally sized households for neighbourhoods. No information about the expected income distribution or its variance is required. More precise estimates for the minimum sample size can be obtained with the expression in (4.4), but require sample information from, for example, previous periods about the income distributions.

Expression (4.6) gives the minimum sample size for core persons. Subsequently all household members of each core person are included in the sample. As a result, households can be included in the sample more than once and the sample size in terms of unique households and unique persons is random. To plan a survey and control survey costs, it is necessary to know the expected number of unique households and unique persons if a sample of core persons of size ${n}_{h}$ is drawn. In the appendix it is proved that the expected number of unique households in a sample of ${n}_{h}$ core persons, drawn by means of simple random sampling without replacement from a finite population of size ${N}_{h}$ is given by

${D}_{h}=\sum _{t=1}^{T}{M}_{th}\left(1-\frac{\prod _{i=0}^{t-1}\left({N}_{h}-{n}_{h}-i\right)}{\prod _{i=0}^{t-1}\left({N}_{h}-i\right)}\right).\text{ }\text{ }\text{ }\text{ }\text{ }\left(4.7\right)$

The expected number of unique persons in a sample of ${n}_{h}$ core persons, drawn by means of simple random sampling without replacement from a finite population of size ${N}_{h}$ follows directly from (4.7) and is given by

${D}_{h}^{\left[p\right]}=\sum _{t=1}^{T}t{M}_{th}\left(1-\frac{\prod _{i=0}^{t-1}\left({N}_{h}-{n}_{h}-i\right)}{\prod _{i=0}^{t-1}\left({N}_{h}-i\right)}\right).\text{ }\text{ }\text{ }\text{ }\text{ }\left(4.8\right)$

Since the expected numbers of unique households and persons are random variables, it would be useful to have an uncertainty measure for these expected values. Variance expressions for (4.7) and (4.8) are however not straightforward and therefore left for further research.

Sample size calculations are conducted at the level of neighbourhoods. It was finally decided to select core persons with a sampling fraction of 0.16. With this sample size, the maximum value for the average standard error measure ${s}_{h}$ at the level of neighbourhoods amounts to about 0.01 for the estimated household income distributions. With a total population of about 12 million persons, this resulted in a sample size of about 2.1 million core persons and an expected sample size of about 4.6 million unique persons. This sample was drawn in 1994, which was the start of the panel for the Dutch RIS.

Is something not working? Is there information outdated? Can't find what you're looking for?