# 4. Simulation study

Guillaume Chauvet and Guylène Tandeau de Marsac

We are using artificial populations proposed by Saigo (2010). We generate two populations, each containing $M=200$ primary sampling units grouped in $H=4$ strata ${U}_{Ih}$ of size ${M}_{h}=50$. Each primary sampling unit ${u}_{hi}$ contains ${N}_{hi}=100$ secondary units. In each population, we generate for each primary sampling unit ${u}_{hi}\in {U}_{Ih}$:

$${\mu}_{hi}={\mu}_{h}+{\sigma}_{h}{v}_{hi}\text{(4}\text{.1)}$$

where the values ${\mu}_{h}$ and ${\sigma}_{h}$ are those used by Saigo (2010). The term ${\sigma}_{h}^{2}$ makes it possible to control dispersion between the primary sampling units. The ${v}_{hi}$ are iid, generated according to a standard normal distribution $N(0,1)$. For each unit $k\in {u}_{hi}$, we then generate the value ${y}_{k}$ according to the model

$${y}_{k}={\mu}_{hi}+{\left\{{\rho}^{-1}\left(1-\rho \right)\right\}}^{0.5}{\sigma}_{h}{v}_{k}\mathrm{,}\text{(4}\text{.2)}$$

where the ${v}_{k}$ are iid, generated according to standard normal distribution. The variance term in the model (4.2) can give an intra-cluster correlation coefficient approximately equal to $\rho $. In particular, the larger the $\rho $ coefficient, the less the values ${y}_{k}$ are dispersed in the primary sampling units. We use $\rho =0.2$ for population 1 and $\rho =0.5$ for population 2, which reflects less dispersion of the variable $y$ in population 2. The sampling frame ${U}_{A}$ corresponds to all secondary units, and the corresponding part of ${u}_{hi}$ is ${u}_{hi}^{A}={u}_{hi}$, of size ${N}_{hi}^{A}={N}_{hi}$ . For each secondary unit $k$, a value ${u}_{k}$ is generated according to uniform distribution over $\left[0,1\right]$. The sampling frame ${U}_{B}$ corresponds to the secondary units $k$ such that ${u}_{k}\le 0.5$, and the corresponding part of ${u}_{hi}$ is ${u}_{hi}^{B}={u}_{hi}\cap {U}_{B}$ of size ${N}_{hi}^{B}$. This gives, therefore, the situation where $ab={U}_{B}$ and $b=\varnothing $. The framework selected in the simulations is the one used in the INSEE household surveys, with expansion to target a specific sub-population. For these surveys, a sample ${S}_{I}$ of communes (or groups of communes) is first selected in the first stage. A sub-sample ${S}_{i}^{A}$ of dwellings is then selected in each ${u}_{i}\in {S}_{I}$; the pooled sample ${S}^{A}={\displaystyle {\cup}_{{u}_{i}\in {S}_{I}}{S}_{i}^{A}}$ represents the entire population of dwellings ${U}_{A}=U$. A second sub-sample ${S}_{i}^{B}$ of dwellings is then selected from within a sub-population of each ${u}_{i}\in {S}_{I}$, in order to target a specific sub-population ${U}_{B}$ (for example, dwellings located in a Sensitive Urban Area); the pooled sample ${S}^{B}={\displaystyle {\cup}_{{u}_{i}\in {S}_{I}}}{S}_{i}^{B}$ represents only the targeted sub-population ${U}_{B}$.

In each of the two populations created, several samplings are taken concurrently; Table 4.1 presents for each population the eight possible combinations of sample sizes per stratum in the first and second stage, as well as the values ${\mu}_{h}$ and ${\sigma}_{h}$. In the first stage, we select independently in each stratum ${U}_{Ih}:$ either a sample ${S}_{Ih}$ of ${m}_{h}=5$ primary sampling units by simple random sampling; or a sample ${S}_{Ih}$ of ${m}_{h}=25$ primary sampling units by simple random sampling. In the second stage, we select in each ${u}_{hi}\in {S}_{Ih}$: either a sample ${S}_{hi}^{A}$ of size ${n}_{hi}^{A}=10$ by simple random sampling in ${u}_{hi}^{A}$; or a sample ${S}_{hi}^{A}$ of size ${n}_{hi}^{A}=40$ by simple random sampling in ${u}_{hi}^{A}$. In the second stage, we also select in each ${u}_{hi}\in {S}_{Ih}$: either a sample ${S}_{hi}^{B}$ of size ${n}_{hi}^{B}=5$ by simple random sampling in ${u}_{hi}^{B}$; or a sample ${S}_{hi}^{B}$ of size ${n}_{hi}^{B}=20$ by simple random sampling in ${u}_{hi}^{B}$. Also we note ${f}_{hi}^{A}={\left({N}_{hi}^{A}\right)}^{-1}{n}_{hi}^{A}$ and ${f}_{hi}^{B}={\left({N}_{hi}^{B}\right)}^{-1}{n}_{hi}^{B}$ the sampling rates in ${u}_{hi}^{A}$ and ${u}_{hi}^{B}$.

Sample Sizes Per Stratum |
Parameters | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Stratum 1 | Stratum 2 | Stratum 3 | Stratum 4 | ||||||||

${m}_{h}$ | ${n}_{hi}^{A}$ | ${n}_{hi}^{B}$ | ${\mu}_{h}$ | ${\sigma}_{h}$ | ${\mu}_{h}$ | ${\sigma}_{h}$ | ${\mu}_{h}$ | ${\sigma}_{h}$ | ${\mu}_{h}$ | ${\sigma}_{h}$ | |

Population 1 | 5 or 25 | 10 or 40 | 5 or 20 | 200 | 20 | 150 | 15 | 120 | 12 | 100 | 10 |

Population 2 | 5 or 25 | 10 or 40 | 5 or 20 | 200 | 10 | 150 | 7.5 | 120 | 6 | 100 | 5 |

For each sample, Hartley’s estimator given in (3.4) is calculated with either $\theta =1/2$ (HART1), or for value of $\theta $ the optimal coefficient estimator given in (3.7) (HART2), with

$$\widehat{V}\left({\widehat{Y}}_{ab}^{A}\right)={\displaystyle \sum _{h=1}^{H}}{\left(\frac{{M}_{h}}{{m}_{h}}\right)}^{2}{\displaystyle \sum _{{u}_{hi}\in {S}_{Ih}}}{\left({N}_{hi}^{A}\right)}^{2}\frac{1-{f}_{hi}^{A}}{{n}_{hi}^{A}\left({n}_{hi}^{A}-1\right)}{\displaystyle \sum _{k\in {S}_{hi}^{A}}}{\left\{{y}_{k}1\left(k\in ab\right)-{\overline{y}}_{ab;{S}_{hi}^{A}}\right\}}^{2}\mathrm{,}$$ $$\widehat{V}\left({\widehat{Y}}_{ab}^{B}\right)={\displaystyle \sum _{h=1}^{H}}{\left(\frac{{M}_{h}}{{m}_{h}}\right)}^{2}{\displaystyle \sum _{{u}_{hi}\in {S}_{Ih}}}{\left({N}_{hi}^{B}\right)}^{2}\frac{1-{f}_{hi}^{B}}{{n}_{hi}^{B}\left({n}_{hi}^{B}-1\right)}{\displaystyle \sum _{k\in {S}_{hi}^{B}}}{\left\{{y}_{k}1\left(k\in ab\right)-{\overline{y}}_{ab;{S}_{hi}^{B}}\right\}}^{2}\mathrm{,}$$ $$\widehat{Cov}\left({\widehat{Y}}_{a}^{A}\mathrm{,}{\widehat{Y}}_{ab}^{A}\right)={\displaystyle \sum _{h=1}^{H}}{\left(\frac{{M}_{h}}{{m}_{h}}\right)}^{2}{\displaystyle \sum _{{u}_{hi}\in {S}_{Ih}}}{\left({N}_{hi}^{A}\right)}^{2}\frac{1-{f}_{hi}^{A}}{{n}_{hi}^{A}\left({n}_{hi}^{A}-1\right)}{\displaystyle \sum _{k\in {S}_{hi}^{A}}}\left\{{y}_{k}1\left(k\in a\right)-{\overline{y}}_{a;{S}_{hi}^{A}}\right\}\left\{{y}_{k}1\left(k\in ab\right)-{\overline{y}}_{ab;{S}_{hi}^{A}}\right\}\mathrm{,}$$

noting ${\overline{y}}_{d;V}$ the average of variable ${y}_{k}1\left(k\in d\right)$ on a subset $V\subset U$. For each sample, the Kalton and Anderson estimator (KALT) given in (3.8) is also calculated, as well as the Bankier estimator (BANK) given in (3.9), and the Horvitz-Thompson estimator ${\widehat{Y}}^{A}$ based on the single sample ${S}^{A}$ (HTA). The sampling procedure is repeated 10,000 times. To measure the bias of an estimator $\widehat{Y}$, we calculate its relative Monte Carlo bias

$$R{B}_{MC}\left(\widehat{Y}\right)=\frac{{E}_{MC}\left(\widehat{Y}\right)-Y}{Y}\times 100$$

with ${E}_{MC}\left(\widehat{Y}\right)=\left(1/10,000\right){\displaystyle {\sum}_{b=1}^{10,000}}{\widehat{Y}}_{(b)}$, and ${\widehat{Y}}_{(b)}$ the value of estimator $\widehat{Y}$ for sample $b$. To measure the variability of $\widehat{Y}$, we calculate its Monte Carlo mean square error

$$MS{E}_{MC}\left(\widehat{Y}\right)=\frac{1}{10,000}{\displaystyle \sum _{b=1}^{10,000}}{\left({\widehat{Y}}_{(b)}-Y\right)}^{2}\mathrm{.}$$

The results are given in Table 4.2. As emphasized by a referee, the performances of the HTA estimator do not depend on the sample size ${n}_{hi}^{B}$ chosen. For consistency, Table 4.2 indicates the results obtained in the simulations with ${n}_{hi}^{B}=5$ only. For identical sample sizes ${m}_{h}$ and identical ${n}_{hi}^{A}$ , the same results are reported in the case ${n}_{hi}^{B}=20$.

All estimators are virtually unbiased. The HART2 estimator gives better results in terms of mean squared error, as could be expected. The HTA estimator gives almost equivalent results. This result is explained by the fact that the optimal coefficient is near 1 (in the simulations, ${\widehat{\theta}}_{opt}$ is between $0.80$ and $1.06$ ), and that in this case, the formula (2.1) shows that the HART2 and HTA estimators are very close: In the appendix we present some general conditions under which this property is approximately checked. Of the three estimators, HART1 yields the best results, with a mean square error lower than or equivalent to that of KALT and BANK in 11 out of 16 cases.

Pop. |
${m}_{h}$ | ${n}_{hi}^{A}$ | ${n}_{hi}^{B}$ | HART1 | HART2 | KALT | BANK | HTA | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

RB | MSE | RB | MSE | RB | MSE | RB | MSE | RB | MSE | ||||

( % ) | $\times {10}^{9}$ | ( % ) | $\times {10}^{9}$ | ( % ) | $\times {10}^{9}$ | ( % ) | $\times {10}^{9}$ | ( % ) | $\times {10}^{9}$ | ||||

1 | 5 | 10 | 5 | 0.05 | 7.76 | 0.01 | 5.70 | 0.05 | 7.79 | 0.06 | 8.56 | 0.04 | 5.75 |

1 | 5 | 10 | 20 | 0.01 | 7.57 | -0.05 | 5.57 | 0.03 | 11.36 | 0.04 | 12.75 | 0.04 | 5.75 |

1 | 5 | 40 | 5 | 0.01 | 5.01 | -0.02 | 4.51 | -0.02 | 4.57 | -0.02 | 4.81 | -0.02 | 4.52 |

1 | 5 | 40 | 20 | 0.00 | 4.65 | -0.01 | 4.33 | 0.00 | 4.66 | 0.00 | 5.22 | -0.02 | 4.52 |

1 | 25 | 10 | 5 | -0.03 | 1.19 | -0.02 | 0.78 | -0.03 | 1.20 | -0.02 | 1.34 | -0.01 | 0.78 |

1 | 25 | 10 | 20 | -0.01 | 1.17 | 0.00 | 0.78 | -0.03 | 1.94 | -0.03 | 2.22 | -0.01 | 0.78 |

1 | 25 | 40 | 5 | 0.00 | 0.62 | 0.01 | 0.51 | 0.00 | 0.52 | 0.00 | 0.57 | 0.01 | 0.51 |

1 | 25 | 40 | 20 | 0.02 | 0.58 | 0.01 | 0.51 | 0.02 | 0.58 | 0.02 | 0.68 | 0.01 | 0.51 |

2 | 5 | 10 | 5 | 0.00 | 3.59 | 0.01 | 1.15 | 0.00 | 3.56 | 0.02 | 4.38 | 0.01 | 1.15 |

2 | 5 | 10 | 20 | 0.00 | 3.60 | -0.02 | 1.15 | 0.00 | 7.38 | 0.00 | 8.76 | 0.01 | 1.15 |

2 | 5 | 40 | 5 | 0.00 | 1.48 | 0.01 | 1.07 | 0.00 | 1.13 | 0.01 | 1.35 | 0.01 | 1.07 |

2 | 5 | 40 | 20 | 0.00 | 1.49 | -0.01 | 1.09 | 0.00 | 1.49 | 0.00 | 2.03 | 0.01 | 1.07 |

2 | 25 | 10 | 5 | 0.00 | 0.63 | 0.00 | 0.14 | 0.00 | 0.63 | 0.00 | 0.78 | 0.00 | 0.14 |

2 | 25 | 10 | 20 | 0.00 | 0.62 | 0.00 | 0.13 | 0.00 | 1.38 | 0.00 | 1.67 | 0.00 | 0.14 |

2 | 25 | 40 | 5 | 0.00 | 0.20 | 0.00 | 0.12 | 0.00 | 0.13 | 0.00 | 0.18 | 0.00 | 0.12 |

2 | 25 | 40 | 20 | 0.00 | 0.20 | 0.00 | 0.12 | 0.00 | 0.20 | 0.01 | 0.31 | 0.00 | 0.12 |

For each estimator, all other things being equal, the mean square error is lower in population 2 than in population 1. This result comes from the fact that the variance due to the first-stage selection, which is the same for each estimator and is

$$V\left({\displaystyle \sum _{i\in {S}_{I}}}{d}_{Ii}{Y}_{i}\right)={\displaystyle \sum _{h=1}^{H}}{M}_{h}^{2}\left(\frac{1}{{m}_{h}}-\frac{1}{{M}_{h}}\right){S}_{Y;{U}_{Ih}}^{2}\mathrm{,}\text{(4}\text{.3)}$$

is larger in population 1: the dispersion term ${S}_{Y;{U}_{Ih}}^{2}={\left({M}_{h}-1\right)}^{-1}{\displaystyle {\sum}_{{u}_{i}\in {U}_{Ih}}{\left({Y}_{i}-{\overline{Y}}_{{U}_{Ih}}\right)}^{2}}$ increases with ${\sigma}_{h}^{2}$ and, to a lesser degree, increases when $\rho $ decreases. The mean square error decreases for each estimator when the number ${m}_{h}$ of primary sampling units selected in each stratum increases, since in this case the common variance term given in (4.3) decreases. Similarly, the mean square error decreases for each estimator when ${n}^{A}$ increases, since in this case the variance due to the second stage of selection decreases. For the HART1 and HART2 estimators, the mean square error is stable when ${n}^{B}$ increases, and more surprisingly for the KALT and BANK estimators the mean square error increases when ${n}^{B}$ increases. This somewhat counterintuitive result is due to the convergence of two facts. On one hand, the contribution of sample ${S}^{B}$ to the variance due to the second stage of selection is low: the increase of ${n}^{B}$ may reduce this variance, but even in this case, overall reduction of the variance is marginal. On the other hand, with the KALT and BANK estimators, the contribution of sample ${S}^{A}$ to the variance due to the second stage of selection increases when ${n}^{B}$ increases.

In the case of KALT, the estimator can be re-expressed

$${\widehat{Y}}_{KA}={\displaystyle \sum _{h=1}^{H}}\frac{{M}_{h}}{{m}_{h}}{\displaystyle \sum _{i\in {S}_{Ih}}}{\widehat{Y}}_{KA\mathrm{,}i}$$

with

$${\widehat{Y}}_{KA\mathrm{,}i}=\frac{1}{{f}_{hi}^{A}}{\displaystyle \sum _{k\in {S}_{i}^{A}}}{m}_{k|i}^{A}{y}_{k}+\frac{1}{{f}_{hi}^{A}+{f}_{hi}^{B}}{\displaystyle \sum _{k\in {S}_{i}^{B}}}{y}_{k}\text{and}{m}_{k|i}^{A}=\{\begin{array}{ll}1\hfill & \text{if}k\in a\cap {u}_{i}\mathrm{,}\hfill \\ \frac{{f}_{hi}^{A}}{{f}_{hi}^{A}+{f}_{hi}^{B}}\hfill & \text{if}k\in ab\cap {u}_{i}\mathrm{.}\hfill \end{array}\text{(4}\text{.4)}$$

In (4.4), the dispersion of the variable ${m}_{k|i}^{A}$ (and therefore, that of ${m}_{k|i}^{A}{y}_{k}$ ) increases when the factor ${f}_{hi}^{A}/\left({f}_{hi}^{A}+{f}_{hi}^{B}\right)$ moves away from $1$. This factor is near 1 when ${f}_{hi}^{B}$ is small compared to ${f}_{hi}^{A}$ (and therefore, if ${n}^{B}$ is small compared to ${n}^{A}$ ), but moves away from 1 when ${n}^{B}$ increases. Note that the variance (conditional on ${S}_{I}$ ) of the second term of ${\widehat{Y}}_{KA\mathrm{,}i}$ is equal to

$$V\left(\frac{1}{{f}_{hi}^{A}+{f}_{hi}^{B}}{\displaystyle \sum _{k\in {S}_{i}^{B}}}{y}_{k}|{S}_{I}\right)={\left({N}_{hi}^{A}\right)}^{2}{N}_{hi}^{B}\times \frac{{n}_{hi}^{B}\left({N}_{hi}^{B}-{n}_{hi}^{B}\right)}{{\left({N}_{hi}^{B}{n}_{hi}^{A}+{N}_{hi}^{A}{n}_{hi}^{B}\right)}^{2}}\times {S}_{{u}_{hi}^{B}}^{2}$$

with ${S}_{{u}_{hi}^{B}}^{2}={\left({N}_{hi}^{B}-1\right)}^{-1}{\displaystyle {\sum}_{k\in {u}_{hi}^{B}}}{\left({y}_{k}-{\overline{y}}_{{u}_{hi}^{B}}\right)}^{2}$. This variance does not necessarily decrease when ${n}_{hi}^{B}$ increases. For example, one of the cases considered in the simulations corresponds to ${N}_{hi}^{A}=100$, ${N}_{hi}^{B}\simeq 50$ and ${n}_{hi}^{A}=40$. In this case, the term ${n}_{hi}^{B}\left({N}_{hi}^{B}-{n}_{hi}^{B}\right)/{\left({N}_{hi}^{B}{n}_{hi}^{A}+{N}_{hi}^{A}{n}_{hi}^{B}\right)}^{2}$ attains its maximum value for ${n}_{hi}^{B}=11$.

In the case of BANK, the estimator can be re-expressed

$${\widehat{Y}}_{HT}={\displaystyle \sum _{h=1}^{H}}\frac{{M}_{h}}{{m}_{h}}{\displaystyle \sum _{i\in {S}_{Ih}}}{\widehat{Y}}_{HT,i}$$

with

$${\widehat{Y}}_{HT\mathrm{,}i}={\displaystyle \sum _{k\in {S}_{i}^{A}\cup {S}_{i}^{B}}}\frac{{y}_{k}}{{\pi}_{k|i}^{HT}}\text{and}{\pi}_{k|i}^{HT}=\{\begin{array}{ll}{f}_{hi}^{A}\hfill & \text{if}k\in a\mathrm{,}\hfill \\ {f}_{hi}^{A}+{f}_{hi}^{B}\left(1-{f}_{hi}^{A}\right)\hfill & \text{if}k\in ab.\hfill \end{array}\text{(4}\text{.5)}$$

In (4.5), dispersion of the variable ${\pi}_{k|i}^{HT}$ increases when the factor ${f}_{hi}^{B}\left(1-{f}_{hi}^{A}\right)$ increases. This factor is close to $0$ when ${n}_{hi}^{B}$ (and, therefore, ${f}_{hi}^{B}$ ) is low, but increases when ${n}_{hi}^{B}$ increases.

- Date modified: