4. Results
Eric Graf and Yves Tillé
Previous | Next
Simulations were conducted on three sets of real data to compare and assess the different density function estimation methods, see (3.1), , see (3.2) and , see (3.4). These methods are required to estimate the variance of certain poverty and inequality indicators.
- The first dataset contains equivalent household
incomes from the EU-SILC survey conducted by the Swiss Federal Statistical
Office in 2009. It includes 17,534 individuals with a non-zero income.
- The second dataset also comes from the 2009
EU-SILC survey, but is limited to salaried individuals. It contains salaries
from the register of the Central Compensation Office that has been linked with
the survey respondents. We therefore have no non-response issues, and there are
7,922 individuals with a non-zero income.
- The third test file, named Ilocos, comes with
the R package ineq (Zeileis 2012). It contains
632 observations, which are household incomes in Ilocos, one of the
16 regions of the Philippines. The data come from two surveys by the
National Statistics Office of the Philippines, in 1997 and in 1998.
The three datasets have a positive skewness
coefficient, which is typical of income distributions. Each data set is
considered to be one population, and we initially selected 10,000 simple random
samples without replacement of various sizes. The values of the various
indicators were calculated for each sample, giving us a Monte Carlo
estimate of their variance,
for a poverty or inequality indicator
The variance estimator using linearization is
denoted
and is calculated using the linearization
variable
estimated for each sample:
where
is the size of the sample used for the
simulations and
where
see (2.1).
The quality of the variance estimator using
linearization is assessed by comparing the expected Monte Carlo value of
the variance estimated using linearization, denoted
with the “true” Monte Carlo variance
in terms of relative bias:
For the second data set (EU-SILC 2009, income of
salaried individuals) we also, in a second step, selected 10,000 random samples
without replacement under a stratified sampling design, and then calibrated the
sampling weights to agree with the eight known sociodemographic marginal totals
for the population of 7,922 individuals. The five strata used correspond to the
age groups of the salaried individuals (see Table 4.1).
The eight calibration cells were obtained by crossing
the three following dichotomous variables (auxiliary calibration variables):
- MARIÉ, which indicates whether or not the
individual is married;
- CHEF, which indicates whether or not the
individual’s job is a management position; and
- HOMME, which indicates the individual’s sex.
The totals for the population of 7,922 individuals
for these calibration cells are shown in Table 4.2.
Table 4.1
Strata used in simulations with 2009 EU-SILC data and three sample sizes (income of salaried individuals,
)
Table summary
This table displays the results of Strata used in simulations with 2009 EU-SILC data and three sample sizes (income of salaried individuals. The information is grouped by Stratum (appearing as row headers), Description, xxx and % (appearing as column headers).
Stratum
|
Description |
|
% |
|
1 |
individuals under 25 |
1,187 |
15.0 |
75 |
112 |
150 |
2 |
26- to 35-year-olds |
1,359 |
17.2 |
86 |
129 |
171 |
3 |
36- to 45-year-olds |
2,137 |
27.0 |
135 |
202 |
270 |
4 |
46- to 55-year-olds |
1,864 |
23.5 |
117 |
177 |
235 |
5 |
individuals over 55 |
1,375 |
17.4 |
87 |
130 |
174 |
|
TOTAL |
7,922 |
100.0 |
500 |
750 |
1,000 |
Table 4.2
Calibration margins in simulations with 2009 EU-SILC data (income of salaried individuals,
)
Table summary
This table displays the results of Calibration margins in simulations with 2009 EU-SILC data (income of salaried individuals. The information is grouped by Margin (appearing as row headers), Marié, Chef, Homme, Population total and % (appearing as column headers).
Margin |
Marié |
Chef |
Homme |
Population total |
% |
1 |
0 |
0 |
0 |
1,487 |
18.8 |
2 |
0 |
0 |
1 |
1,208 |
15.2 |
3 |
0 |
1 |
0 |
323 |
4.1 |
4 |
0 |
1 |
1 |
457 |
5.8 |
5 |
1 |
0 |
0 |
1,759 |
22.2 |
6 |
1 |
0 |
1 |
1,278 |
16.1 |
7 |
1 |
1 |
0 |
328 |
4.1 |
8 |
1 |
1 |
1 |
1,082 |
13.7 |
|
|
|
TOTAL |
7,922 |
100.0 |
For each stratified sample, a calibration (linear
method) was performed to make the sums of the weights agree with the eight
margins shown above. Point estimates of the indicators and their linearized
variable were computed for each sample using the calibrated weights.
Variance was estimated using the method developed by
Deville (2000), which consists of linearizing also with respect to the
calibration by calculating the residuals
of the regression (weighted by the sampling
weights) of the linearized variables of the indicators for the auxiliary
calibration variables. The variance of the total of the residuals thus
calculated, under a stratified random sampling plan without replacement is
therefore an estimator of the variance of the estimated indicator; it is the
quantity of interest:
where
The quality of the variance estimator using
linearization is assessed analogously to the procedure for simple random
sampling, see (4.1).
Tables 4.3, 4.4 and 4.5 show the relative bias of the
variance for the three data sets used and described above, using simple random
sampling. Table 4.6 shows the relative bias of the variance using stratified
random sampling with calibrated weights. The upper portions of the tables give
the values for the Gini coefficient and QSR, which do not require estimating
the income density function. The estimation of their variance works well. Note
that there is a problem involving the underestimation of the variance of the
Gini coefficient in the case of stratification with calibration
(Table 4.6).
For the first data set, Table 4.3 does not reveal
any major differences except that the estimation of income density using
gives results that are more conservative. In
fact, the relative bias remains of the same order of magnitude, but positive,
while it is negative for the other two methods of estimating density. For the
second data set, Table 4.4 shows that it is essential to use the logarithm
or the nearest neighbour method with minimum bandwidth. The latter, all
relative bias falls under 10% when the sample sizes are sufficiently large (see
last column in the table). Simulations on the same data with a stratified
sampling plan and calibration strengthen and confirm these results (see
Table 4.6). For the third data set, Table 4.5 shows the same trends,
although the results are less stable as a result of the small sample and population
sizes. This is not surprising, since the minimum number of neighbours to
consider is fixed at 30. In this case, for the Ilocos data set, simulations
with a smaller value of
fixed at 10 makes no difference ultimately,
because the condition
automatically increases it above 30.
Furthermore, generally speaking, we can see that the
greater the use of Gaussian kernel density estimation -
- the greater the error. In fact, the relative
bias of the variance for the median income of individuals below the ARPT and
for the RMPG are almost systematically greater in absolute value that those for
the other indicators. For the RMPG, the error may be offset (as in
Table 4.3) if there are enough observations, since the density estimation
appears in both the numerator and the denominator.
Table 4.3
Relative bias (4.1) of the variance obtained with 10,000 simple random samples without replacement from the 2009 EU-SILC data (equivalent household income,
)
Table summary
This table displays the results of Relative bias (4.1) of the variance obtained with 10. The information is grouped by Indicator (appearing as row headers), Sample size (sampling rate), calculated using xxx units of measure (appearing as column headers).
Indicator |
Sample size (sampling rate) |
|
|
|
GINI |
-0.02 |
-0.02 |
-0.02 |
QSR |
0.01 |
0.00 |
0.00 |
|
|
|
|
|
|
|
|
|
|
ARPT |
-0.08 |
-0.06 |
0.04 |
-0.09 |
-0.07 |
0.03 |
-0.09 |
-0.07 |
0.04 |
ARPR |
-0.05 |
-0.01 |
-0.00 |
-0.09 |
-0.06 |
-0.05 |
-0.08 |
-0.05 |
-0.03 |
RMPG |
-0.09 |
-0.07 |
0.15 |
-0.10 |
-0.07 |
0.12 |
-0.09 |
-0.06 |
0.14 |
MEDP |
-0.16 |
-0.12 |
0.09 |
-0.19 |
-0.13 |
0.05 |
-0.18 |
-0.11 |
0.07 |
MED |
-0.08 |
-0.06 |
0.05 |
-0.08 |
-0.06 |
0.04 |
-0.08 |
-0.06 |
0.04 |
Table 4.4
Relative bias (4.1) of the variance obtained with 10,000 simple random samples without replacement from the 2009 EU-SILC data (income of salaried individuals, )
Table summary
This table displays the results of Relative bias (4.1) of the variance obtained with 10. The information is grouped by Indicator (appearing as row headers), Sample size (sampling rate), calculated using xxx units of measure (appearing as column headers).
Indicator |
Sample size (sampling rate) |
|
|
|
GINI |
-0.03 |
-0.03 |
-0.02 |
QSR |
-0.00 |
0.00 |
0.00 |
|
|
|
|
|
|
|
|
|
|
ARPT |
0.07 |
0.05 |
0.13 |
0.06 |
0.04 |
0.10 |
0.06 |
0.03 |
0.08 |
ARPR |
-0.05 |
-0.04 |
-0.02 |
-0.05 |
-0.04 |
-0.01 |
-0.06 |
-0.05 |
-0.02 |
RMPG |
0.61 |
0.12 |
0.15 |
0.60 |
0.11 |
0.08 |
0.59 |
0.09 |
0.05 |
MEDP |
0.73 |
0.17 |
0.18 |
0.72 |
0.16 |
0.10 |
0.72 |
0.15 |
0.07 |
MED |
0.07 |
0.04 |
0.13 |
0.06 |
0.04 |
0.10 |
0.05 |
0.03 |
0.07 |
Table 4.5
Relative bias (4.1) of the variance obtained with 10,000 simple random samples without replacement from Ilocos data (household income, )
Table summary
This table displays the results of Relative bias (4.1) of the variance obtained with 10. The information is grouped by Indicator (appearing as row headers), Sample size (sampling rate), calculated using xxx units of measure (appearing as column headers).
Indicator |
Sample size (sampling rate) |
|
|
GINI |
-0.16 |
-0.13 |
QSR |
0.00 |
0.00 |
|
|
|
|
|
|
|
ARPT |
-0.05 |
-0.06 |
-0.01 |
-0.03 |
-0.03 |
-0.01 |
ARPR |
-0.31 |
-0.01 |
-0.12 |
-0.33 |
-0.03 |
-0.18 |
RMPG |
1.55 |
0.83 |
0.26 |
1.54 |
0.16 |
0.39 |
MEDP |
1.02 |
0.28 |
-0.26 |
1.05 |
0.07 |
-0.11 |
MED |
0.04 |
0.03 |
0.08 |
0.07 |
0.07 |
0.09 |
Table 4.6
Relative bias (4.1) of the variance obtained with 10,000 stratified random samples without replacement, with weights calibrated to eight sociodemographic margins, from the 2009 EU-SILC data (income of salaried individuals, )
Table summary
This table displays the results of Relative bias (4.1) of the variance obtained with 10. The information is grouped by Indicator (appearing as row headers), Sample size (sampling rate) (appearing as column headers).
Indicator |
Sample size (sampling rate) |
|
|
|
GINI |
-0.21 |
-0.20 |
-0.20 |
QSR |
-0.06 |
-0.06 |
-0.07 |
|
|
|
|
|
|
|
|
|
|
ARPT |
-0.07 |
-0.09 |
-0.01 |
-0.08 |
-0.1 |
-0.04 |
-0.09 |
-0.11 |
-0.06 |
ARPR |
-0.10 |
-0.10 |
-0.08 |
-0.07 |
-0.06 |
-0.05 |
-0.06 |
-0.06 |
-0.05 |
RMPG |
0.63 |
0.13 |
0.13 |
0.61 |
0.11 |
0.08 |
0.59 |
0.10 |
0.04 |
MEDP |
0.71 |
0.16 |
0.15 |
0.68 |
0.13 |
0.09 |
0.66 |
0.12 |
0.04 |
MED |
-0.07 |
-0.09 |
-0.01 |
-0.08 |
-0.1 |
-0.04 |
-0.08 |
-0.11 |
-0.06 |
In short, we see that the variance can be overestimated
or underestimated
depending on the indicator and the data set.
The use of the logarithm
provides significant improvement. The nearest
neighbour method
eliminates all problems if there is enough
data (as in Tables 4.3, 4.4 and 4.6). Slight problems arise with this method
when the samples are small (as in Table 4.5). Illogical variations and
bias that persist in the tables may also be the result of a lack of robustness
in the linearized variables for certain samples, as stated in Section 3.3.
Previous | Next