6. Simulation studies
Cyril Favre Martinoz, David Haziza and Jean-François Beaumont
Previous | Next
6.1 Winsorization in
a simple random sampling without-replacement design
We carried out a
simulation study to examine the properties of several robust estimators using
11 populations. The first 10 of size
consists of a variable of
interest
In each population, the
values were generated according
to the following model:
where
and
are random variables whose
distributions are described in Table 6.1. Population 1 was generated according
to a normal distribution. Populations 2 through 5 were generated using a mixture
of normal distributions with contamination rates ranging from 0.5% to 5%.
Populations 6 through 8 were generated according to skewed distributions.
Populations 9 and 10 were generated using a mixture of lognormal distributions
with contamination rates equal to 0.5% and 5%. Population 11 of size
is from the information
technology survey produced by the French National Institute for Statistics and
Economic Studies (INSEE) in 2011. One of the survey's objectives is to estimate
the e-commerce sales of French companies. We use the "sales� variable in our
simulation. The distribution of
in each population is plotted in
Figure 6.1. In addition, Table 6.2 presents a number of descriptive statistics
for each of the populations used. For confidentiality reasons, the units for
Population 11 are not shown in the plot. Similarly, there are no descriptive
statistics for Population 11 in Table 6.2.
In each
population, we selected
samples according to a simple
random sampling without-replacement design of size
and For each sample, we
calculated the expansion estimator
and the robust estimator (4.8).
Let
be the values of the
variable arranged in ascending
order. We also calculated the first-, second- and third-order winsorized
estimators, where the
order winsorized estimator is
obtained by replacing the
largest values in the sample with
the value
In a classical statistical
context, Rivest (1994) showed that the first-order winsorized estimator has
good mean-square-error properties for a large class of skewed distributions.
As a measure of
the bias of an estimator
we calculated the Monte Carlo
relative bias (in percentage):
where
denotes the estimator
in sample
We also calculated the
relative efficiency of the robust estimators with respect to the expansion
estimator,
The results
are shown in Table 6.3.
The results
presented in Table 6.3 show that the once-winsorized estimator has lower bias
and is generally more efficient than the two times and three times winsorized
estimators, which is consistent with the results obtained by Rivest (1994). It
is interesting to compare the robust estimator
and the once-winsorized
estimator. In the case of Population 1, which does not contain any influential
values, we see that both estimators have low bias and are as efficient as the
expansion estimator. In the case of the populations with a mixture of normal
distributions (Populations 2 to 5), we observe that the once-winsorized estimator is less
efficient than the robust estimator in every scenario except for Population 5
with
In fact, the once-winsorized
estimator is less efficient than the expansion estimator in every scenario
except for Population 2 with
The robust estimator is more
efficient than the expansion estimator except in Populations 4 and 5, for which
we observe values of relative efficiency ranging from 91% to 102%. In the case
of the populations with a mixture of lognormal distributions (Populations 9 and
10), we see that the bias and efficiency performance of the once-winsorized
estimator and the robust estimator is very similar in all scenarios. The same
is true for the skewed populations (Populations 6 to 8), for which the two
estimators produce similar results. In the case of Population 11, the robust
estimator has a lower bias than the once-winsorized estimator for
though it is less efficient (41%
versus 47%). For
and
the robust estimator has a lower
bias and is significantly more efficient than the once-winsorized estimator.
Table 6.1
Models used to generate the populations
Table summary
This table displays the results of Models used to generate the populations. The information is grouped by Population (appearing as row headers),
distribution, Mixture,
distribution and
distribution (appearing as column headers).
| Population |
distribution |
Mixture |
distribution |
distribution |
| 1 |
|
No |
|
|
| 2 |
|
Yes |
|
|
| 3 |
|
Yes |
|
|
| 4 |
|
Yes |
|
|
| 5 |
|
Yes |
|
|
| 6 |
|
No |
|
|
| 7 |
|
No |
|
|
| 8 |
|
No |
|
|
| 9 |
|
Yes |
|
|
| 10 |
|
Yes |
|
|
Table 6.2
Descriptive statistics for the ten simulated populations
Table summary
This table displays the results of Descriptive statistics for the ten simulated populations. The information is grouped by Descriptive statistic (appearing as row headers), Population (appearing as column headers).
| Descriptive statistic |
Population |
| 1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
| min |
132.3 |
314.9 |
105.3 |
275.9 |
187.4 |
23.6 |
7.6 |
2,000.9 |
20.5 |
26.6 |
| max |
3,968 |
79,506 |
78,526 |
80,540 |
78,690 |
252,612 |
379,751 |
2,159 |
305,612 |
1.3×106 |
|
1,639 |
1,667 |
1,664 |
1,666 |
1,685 |
883 |
743 |
200 |
920 |
913 |
| Median |
1,986 |
1,993 |
1,997 |
2,015 |
2,053 |
1,996 |
1,981 |
2,002 |
2,167 |
2,041 |
|
2,330 |
2,337 |
2,339 |
2,349 |
2,421 |
4,505 |
5,337 |
2,004 |
5,018 |
4,927 |
| Mean |
1,985 |
2,267 |
2,536 |
2,976 |
4,661 |
4,005 |
6,118 |
2,004 |
4,738 |
7,883 |
| Standard deviation |
503 |
3,709 |
5,506 |
7,119 |
11,470 |
7,353 |
17,190 |
5,89 |
9,796 |
33,111 |
| Skewness |
0.0 |
14.0 |
10.2 |
7.3 |
4.3 |
4.2 |
11.6 |
11.8 |
12.1 |
18.4 |
| Kurtosis |
3 |
209 |
109 |
56 |
20 |
19 |
196 |
228 |
267 |
570 |
| CV |
0.25 |
1.6 |
2.2 |
2.4 |
2.5 |
1.8 |
2.8 |
2.9×10-3 |
2.0 |
4.2 |
Figure 6.1 Distribution of the variable of interest in the 11 populations

Description for Figure 6.1
Table 6.3
Monte Carlo relative bias (in %) and relative efficiency (in parentheses) of several estimators
Table summary
This table displays the results of Monte Carlo relative bias (in %) and relative efficiency (in parentheses) of several estimators. The information is grouped by Population (appearing as row headers), , and Winsorization (appearing as column headers).
| Population |
|
|
Winsorization |
| Once |
Two times |
Three times |
| 1 |
100 |
-0.1(100) |
-0.1(100) |
-0.2(101) |
-0.3(102) |
| 300 |
0.0(100) |
-0.0(100) |
-0.0(100) |
-0.1(100) |
| 500 |
0.0(100) |
-0.0(100) |
-0.0(100) |
-0.0(100) |
| 2 |
100 |
-4.9(59) |
-7.5(87) |
-10.7(65) |
-11.9(55) |
| 300 |
-2.9(87) |
-3.0(129) |
-6.8(158) |
-9.5(169) |
| 500 |
-1.9(96) |
-1.2(122) |
-3.6(175) |
-6.5(226) |
| 3 |
100 |
-6.9(74) |
-8.9(122) |
-16.5(119) |
-20.0(107) |
| 300 |
-3.5(99) |
-1.9(122) |
-5.6(171) |
-10.6(232) |
| 500 |
-2.4(102) |
-0.9(107) |
-2.2(130) |
-4.5(186) |
| 4 |
100 |
-7.6(91) |
-6.2(131) |
-15.5(169) |
-24.4(194) |
| 300 |
-2.9(101) |
-0.6(103) |
-2.1(118) |
-4.4(154) |
| 500 |
-2.0(102) |
-0.6(102) |
-1.1(101) |
-1.8(108) |
| 5 |
100 |
-5.7(102) |
-1.1(104) |
-4.1(126) |
-9.7(173) |
| 300 |
-2.2(102) |
-0.4(100) |
-0.8(101) |
-1.4(102) |
| 500 |
-1.2(100) |
-0.1(100) |
-0.3(100) |
-0.5(101) |
| 6 |
100 |
-5.7(79) |
-5.4(75) |
-8.2(80) |
-10.6(89) |
| 300 |
-2.6(84) |
-2.6(79) |
-3.9(81) |
-5.1(88) |
| 500 |
-2.0(86) |
-2.0(81) |
-3.0(82) |
-3.8(88) |
| 7 |
100 |
-8.4(72) |
-9.3(73) |
-14.7(72) |
-18.7(79) |
| 300 |
-4.5(86) |
-4.4(95) |
-7.8(91) |
-10.2(95) |
| 500 |
-3.5(94) |
-3.1(105) |
-6.0(106) |
-8.1(109) |
| 8 |
100 |
-0.0(69) |
-0.0(75) |
-0.0(77) |
-0.0(85) |
| 300 |
-0.0(82) |
-0.0(88) |
-0.0(87) |
-0.0(95) |
| 500 |
-0.0(88) |
-0.0(96) |
-0.0(94) |
-0.0(100) |
| 9 |
100 |
-5.7(73) |
-5.8(71) |
-9.5(72) |
-12.4(80) |
| 300 |
-3.5(87) |
-3.5(85) |
-5.4(88) |
-6.8(98) |
| 500 |
-2.4(88) |
-2.4(88) |
-3.8(90) |
-4.9(97) |
| 10 |
100 |
-13.5(68) |
-15.0(70) |
-24.6(76) |
-31.7(89) |
| 300 |
-7.5(80) |
-7.2(79) |
-12.1(85) |
-16.3(97) |
| 500 |
-5.3(85) |
-5.1(83) |
-8.4(91) |
-11.4(103) |
| 11 |
100 |
-22.8(47) |
-32.6(41) |
-42.0(42) |
-47.7(47) |
| 300 |
-14.7(65) |
-20.0(77) |
-29.6(68) |
-34.3(75) |
| 500 |
-11.3(76) |
-14.6(96) |
-24.3(90) |
-29.3(97) |
6.2 Winsorization in
a stratified simple random sampling without-replacement design
We also tested the
calibration method described in Section 5. We generated a population of size
which we divided into five
strata,
of size
respectively; see Table 6.4 for
the values of
In each stratum, we generated a
variable of interest
according to a lognormal
distribution with parameters and
From the
population we selected
samples according to a stratified
simple random sampling without-replacement design. In stratum
we selected a sample
of size
according to a simple random
sampling without-replacement design; see Table 6.4 for the sizes
and the corresponding sampling
fractions,
The objective here
is to estimate the total in the population,
and the stratum totals
In other words, in our example,
the strata correspond to domains of interest. Since the strata form a partition
of the population, we have the consistency relation,
Similarly, the expansion
estimators satisfy the consistency relation
where
and
with
if
For each sample,
we first computed the robust estimator (4.8) in each stratum and aggregated the
robust estimates to produce an aggregate robust estimate,
Independently, we computed the
robust estimator (4.8), denoted
at the population level. To
ensure that the consistency relation (5.1) was satisfied, we performed the
calibration procedure described in Section 5 to obtain the final robust
estimates
We used four systems of
coefficients
(1)
and
(2)
and
(3)
and
where
(4)
and
where
We make the following remarks on the
choice of the coefficients
(i) For all four systems, we
assigned a weight
to estimate
which is equivalent to making no
change in the robust estimate at the population level. In other words, we have
(ii) The first weighting system
assigns an equal weight to all strata regardless of the sample size or sampling
fraction. (iii) In the case of the second system, the coefficient
is a function of the sample size
and the sampling fraction
but it is independent of the
intra-stratum variability
(iv) In the third and fourth
systems, the choice of
depends on the actual CV and the
estimated CV respectively, for the reasons mentioned in Section 5.
Table 6.4
Characteristics of the strata
Table summary
This table displays the results of Characteristics of the strata. The information is grouped by Stratum (appearing as row headers), 1, 2, 3, 4 and 5 (appearing as column headers).
| Stratum |
1 |
2 |
3 |
4 |
5 |
|
2,000 |
1,500 |
1,000 |
400 |
100 |
|
20 |
75 |
100 |
80 |
80 |
|
0.01 |
0.05 |
0.1 |
0.2 |
0.8 |
For each robust
estimator, we computed the Monte Carlo relative bias (as a percentage) and the
relative efficiency (with respect to the expansion estimator); see Section 6.1.
The results are presented in Table 6.5.
The results show
that the initial robust estimators
are biased, as expected. The bias
is larger in strata with a small sampling fraction. For example, in Stratum 1,
for which
the relative bias of
is
compared with only
in Stratum 5, for which
We also note that the
initial robust estimators are all more efficient than the corresponding
expansion estimator, with relative efficiency values ranging from 57% to 97%.
The aggregate estimator
obtained by summing the initial
estimators
shows a modest bias with a value
equal to
but is more efficient than the
population-level expansion estimator
with a relative efficiency of
87%.
The
population-level winsorized estimator,
shows a small bias with a value
equal to
and is significantly more
efficient than the expansion estimator, with a relative efficiency of 81%. The
final estimators
obtained using the system of
coefficients
for
all have lower bias than the
initial estimator
except for Stratum 5. This is due
to the fact that we force the sum of the final estimates
to calibrate on a low-bias
estimator. On the other hand, the decrease in the bias is accompanied by a
slight decrease in efficiency. For example, in Stratum 4, the relative
efficiency is 63% for the robust estimator
and 66% for the final estimator
In the case of Stratum 5, the
first system of coefficients is clearly unsuitable, since it leads to a change
in the estimate for this stratum, like all the other strata, when this stratum
has a very high sampling fraction of 80%. In fact, for this system of
coefficients, the estimator
is less efficient than the
expansion estimator, with a relative efficiency of 104. The second choice of
coefficients
which takes the sampling fraction
and the sample size
into account, leads to some
interesting results. The final robust estimator in Stratum 1,
has an appreciably lower bias
than the initial estimator
and the final estimator based on
the first system of coefficients, at the cost of a slight loss of efficiency.
For Stratum 5, the estimator
has a low bias (a relative bias
of
and the same 97% efficiency as
the initial estimator
The third and fourth
weighting systems lead to similar
relative bias and relative efficiency results. For Stratum 1, they lead to
lower relative biases than the first weighting system, at the cost of a slight
loss of efficiency. For Strata 2, 3 and 4, all four systems of coefficients
exhibit similar relative bias and relative efficiency. For Stratum 5, the final
estimators are virtually unbiased and no less efficient that the expansion
estimator.
Table 6.5
Monte Carol relative bias (in %) and relative efficiency (in parentheses) of the robust estimators at the global level and the stratum level
Table summary
This table displays the results of Monte Carol relative bias (in %) and relative efficiency (in parentheses) of the robust estimators at the global level and the stratum level. The information is grouped by Global estimator (appearing as row headers),
,
,
,
and
(appearing as column headers).
| Global estimator |
|
|
|
|
|
| -5.7(87) |
-2.8(81) |
-2.8(81) |
-2.8(81) |
-2.8(81) |
| |
|
|
|
|
|
| Stratum |
1 |
-11.9(57) |
-9.1(60) |
-0.9(67) |
-5.7(62) |
-6.7(64) |
| 2 |
-6.3(74) |
-3.4(76) |
-3.3(76) |
-3.3(76) |
-3.1(78) |
| 3 |
-6.0(69) |
-3.1(70) |
-3.8(69) |
-3.2(70) |
-3.2(70) |
| 4 |
-6.6(63) |
-3.7(66) |
-4.2(65) |
-3.3(66) |
-3.4(70) |
| 5 |
-1.5(97) |
1.5(104) |
-0.8(97) |
-0.2(98) |
0.1(99) |
Previous | Next