4. Simulation study
Guillaume Chauvet and Guylène Tandeau de Marsac
Previous | Next
We are using artificial populations proposed by Saigo (2010). We generate two populations, each containing primary
sampling units grouped in strata of size . Each primary
sampling unit contains secondary
units. In each population, we generate
for each primary sampling unit :
where the values and are those used
by Saigo (2010). The term makes it
possible to control dispersion between the primary sampling units. The are iid,
generated according to a standard normal distribution . For each unit , we then generate the value according to
the model
where the are iid,
generated according to standard normal distribution. The variance term in the model (4.2) can give an intra-cluster
correlation coefficient approximately equal to . In particular,
the larger the coefficient,
the less the values are dispersed
in the primary sampling units. We use for population
1 and for population
2, which reflects less dispersion of the variable in population
2. The sampling frame corresponds to
all secondary units, and the corresponding part of is , of size . For each secondary unit , a value is generated
according to uniform distribution over . The sampling
frame corresponds to
the secondary units such that , and the corresponding part of is of size . This gives,
therefore, the situation where and . The framework
selected in the simulations is the one used in the INSEE household surveys,
with expansion to target a specific sub-population. For these surveys, a sample of communes (or
groups of communes) is first selected in the first stage. A sub-sample of dwellings is
then selected in each ; the pooled sample represents the
entire population of dwellings . A second
sub-sample of dwellings is
then selected from within a sub-population of each , in order to target a specific sub-population (for example, dwellings
located in a Sensitive Urban Area); the pooled sample represents only
the targeted sub-population .
In each of the two populations
created, several samplings are taken concurrently; Table 4.1 presents for each
population the eight possible combinations of sample sizes per stratum in the
first and second stage, as well as the values and . In the first
stage, we select independently in each stratum either a
sample of primary
sampling units by simple random sampling; or a sample of primary
sampling units by simple random sampling. In
the second stage, we select in each : either a
sample of size by simple
random sampling in ; or a sample of size by simple
random sampling in . In the second
stage, we also select in each : either a
sample of size by simple
random sampling in ; or a sample of size by simple
random sampling in . Also we note and the sampling
rates in and .
Table 4.1
Parameters used in each stratum to generate both populations and select samples
Table summary
This table displays the results of Parameters used in each stratum to generate both populations and select samples Sample Sizes
Per stratum , Parameters , Stratum 1 , Stratum 2 , Stratum 3 and Stratum 4 , calculated using XXX and XXX units of measure (appearing as column headers).
|
Sample Sizes
Per Stratum |
Parameters |
Stratum 1 |
Stratum 2 |
Stratum 3 |
Stratum 4 |
|
|
|
|
|
|
|
|
|
|
|
Population 1 |
5 or 25 |
10 or 40 |
5 or 20 |
200 |
20 |
150 |
15 |
120 |
12 |
100 |
10 |
Population 2 |
5 or 25 |
10 or 40 |
5 or 20 |
200 |
10 |
150 |
7.5 |
120 |
6 |
100 |
5 |
For each sample, Hartley’s
estimator given in (3.4) is calculated with either (HART1), or for
value of the optimal
coefficient estimator given in (3.7) (HART2), with
noting the average of
variable on a subset . For each
sample, the Kalton and Anderson estimator (KALT) given in (3.8) is also
calculated, as well as the Bankier estimator (BANK) given in (3.9), and the
Horvitz-Thompson estimator based on the
single sample (HTA). The sampling procedure is repeated 10,000
times. To measure the bias of an
estimator , we calculate its relative Monte Carlo bias
with , and the value of
estimator for sample . To measure the
variability of , we calculate its Monte Carlo mean square error
The results are given in Table 4.2. As emphasized by a referee, the
performances of the HTA estimator do not depend on the sample size chosen. For consistency, Table 4.2 indicates the results obtained in the
simulations with only. For identical sample sizes and identical , the same results are reported
in the case .
All estimators are virtually
unbiased. The HART2 estimator gives
better results in terms of mean squared error, as could be expected. The HTA estimator gives almost equivalent
results. This result is explained by the
fact that the optimal coefficient is near 1 (in the simulations, is between and ), and that in
this case, the formula (2.1) shows that the HART2 and HTA estimators are very
close: In the appendix we present some
general conditions under which this property is approximately checked. Of the three estimators, HART1 yields the best
results, with a mean square error lower than or equivalent to that of KALT and
BANK in 11 out of 16 cases.
Table 4.2
Relative bias and mean squared error of five estimators
Table summary
This table displays the results of Relative bias and mean squared error of five estimators. The information is grouped by Pop. (appearing as row headers),
,
,
, HART1 , HART2 , KALT , BANK , HTA , RB and MSE , calculated using ( % ) ,
and
units of measure (appearing as column headers).
Pop. |
|
|
|
HART1 |
HART2 |
KALT |
BANK |
HTA |
RB |
MSE |
RB |
MSE |
RB |
MSE |
RB |
MSE |
RB |
MSE |
( % ) |
|
( % ) |
|
( % ) |
|
( % ) |
|
( % ) |
|
1 |
5 |
10 |
5 |
0.05 |
7.76 |
0.01 |
5.70 |
0.05 |
7.79 |
0.06 |
8.56 |
0.04 |
5.75 |
1 |
5 |
10 |
20 |
0.01 |
7.57 |
-0.05 |
5.57 |
0.03 |
11.36 |
0.04 |
12.75 |
0.04 |
5.75 |
1 |
5 |
40 |
5 |
0.01 |
5.01 |
-0.02 |
4.51 |
-0.02 |
4.57 |
-0.02 |
4.81 |
-0.02 |
4.52 |
1 |
5 |
40 |
20 |
0.00 |
4.65 |
-0.01 |
4.33 |
0.00 |
4.66 |
0.00 |
5.22 |
-0.02 |
4.52 |
1 |
25 |
10 |
5 |
-0.03 |
1.19 |
-0.02 |
0.78 |
-0.03 |
1.20 |
-0.02 |
1.34 |
-0.01 |
0.78 |
1 |
25 |
10 |
20 |
-0.01 |
1.17 |
0.00 |
0.78 |
-0.03 |
1.94 |
-0.03 |
2.22 |
-0.01 |
0.78 |
1 |
25 |
40 |
5 |
0.00 |
0.62 |
0.01 |
0.51 |
0.00 |
0.52 |
0.00 |
0.57 |
0.01 |
0.51 |
1 |
25 |
40 |
20 |
0.02 |
0.58 |
0.01 |
0.51 |
0.02 |
0.58 |
0.02 |
0.68 |
0.01 |
0.51 |
2 |
5 |
10 |
5 |
0.00 |
3.59 |
0.01 |
1.15 |
0.00 |
3.56 |
0.02 |
4.38 |
0.01 |
1.15 |
2 |
5 |
10 |
20 |
0.00 |
3.60 |
-0.02 |
1.15 |
0.00 |
7.38 |
0.00 |
8.76 |
0.01 |
1.15 |
2 |
5 |
40 |
5 |
0.00 |
1.48 |
0.01 |
1.07 |
0.00 |
1.13 |
0.01 |
1.35 |
0.01 |
1.07 |
2 |
5 |
40 |
20 |
0.00 |
1.49 |
-0.01 |
1.09 |
0.00 |
1.49 |
0.00 |
2.03 |
0.01 |
1.07 |
2 |
25 |
10 |
5 |
0.00 |
0.63 |
0.00 |
0.14 |
0.00 |
0.63 |
0.00 |
0.78 |
0.00 |
0.14 |
2 |
25 |
10 |
20 |
0.00 |
0.62 |
0.00 |
0.13 |
0.00 |
1.38 |
0.00 |
1.67 |
0.00 |
0.14 |
2 |
25 |
40 |
5 |
0.00 |
0.20 |
0.00 |
0.12 |
0.00 |
0.13 |
0.00 |
0.18 |
0.00 |
0.12 |
2 |
25 |
40 |
20 |
0.00 |
0.20 |
0.00 |
0.12 |
0.00 |
0.20 |
0.01 |
0.31 |
0.00 |
0.12 |
For each estimator, all other
things being equal, the mean square error is lower in population 2 than in
population 1. This result comes from the fact that the variance due to the
first-stage selection, which is the same for each estimator and is
is larger in population 1: the dispersion term increases with and, to a
lesser degree, increases when decreases. The mean square error decreases for each
estimator when the number of primary
sampling units selected in each stratum increases, since in this case the
common variance term given in (4.3) decreases. Similarly,
the mean square error decreases for each estimator when increases,
since in this case the variance due to the second stage of selection decreases. For the HART1 and HART2 estimators, the mean
square error is stable when increases, and
more surprisingly for the KALT and BANK estimators the mean square error
increases when increases. This somewhat counterintuitive result is due
to the convergence of two facts. On one
hand, the contribution of sample to the variance
due to the second stage of selection is low: the
increase of may reduce this
variance, but even in this case, overall reduction of the variance is marginal. On the other hand, with the KALT and BANK
estimators, the contribution of sample to the variance
due to the second stage of selection increases when increases.
In the case of KALT, the
estimator can be re-expressed
with
In (4.4), the dispersion of
the variable (and therefore,
that of ) increases
when the factor moves away from
. This factor is
near 1 when is small
compared to (and therefore,
if is small compared
to ), but moves
away from 1 when increases. Note that the variance (conditional on ) of the second
term of is equal to
with . This variance
does not necessarily decrease when increases. For example, one of the cases considered in
the simulations corresponds to , and . In this case,
the term attains its
maximum value for .
In the case of BANK, the
estimator can be re-expressed
with
In (4.5), dispersion of the
variable increases when
the factor increases. This factor is close to when (and,
therefore, ) is low, but
increases when increases.
Previous | Next