4. Results

Eric Graf and Yves Tillé

Simulations were conducted on three sets of real data to compare and assess the different density function estimation methods, ${\hat{f}}_{1} (x),$ see (3.1), ${\hat{f}}_{2} (x)$ , see (3.2) and ${\hat{f}}_{3} (x)$ , see (3.4). These methods are required to estimate the variance of certain poverty and inequality indicators.

The first dataset contains equivalent household incomes from the EU-SILC survey conducted by the Swiss Federal Statistical Office in 2009. It includes 17,534 individuals with a non-zero income.
The second dataset also comes from the 2009 EU-SILC survey, but is limited to salaried individuals. It contains salaries from the register of the Central Compensation Office that has been linked with the survey respondents. We therefore have no non-response issues, and there are 7,922 individuals with a non-zero income.
The third test file, named Ilocos, comes with the R package ineq (Zeileis 2012). It contains 632 observations, which are household incomes in Ilocos, one of the 16 regions of the Philippines. The data come from two surveys by the National Statistics Office of the Philippines, in 1997 and in 1998.

The three datasets have a positive skewness coefficient, which is typical of income distributions. Each data set is considered to be one population, and we initially selected 10,000 simple random samples without replacement of various sizes. The values of the various indicators were calculated for each sample, giving us a Monte Carlo estimate of their variance, ${var}_{sim} (\hat{θ}),$ for a poverty or inequality indicator $θ .$ The variance estimator using linearization is denoted ${\hat{var}}_{lin} (\hat{θ})$ and is calculated using the linearization variable ${\hat{z}}^{\hat{θ}}$ estimated for each sample:

${\hat{var}}_{lin} (\hat{θ}) = \frac{N (N - n)}{n} var ({\hat{z}}_{S}^{\hat{θ}}),$

where $n$ is the size of the sample used for the simulations and

$var ({\hat{z}}_{S}^{\hat{θ}}) = \frac{1}{n - 1} \sum_{k \in S} ({\hat{z}}_{S, k}^{\hat{θ}} - {\bar{z}}_{S}^{\hat{θ}})$

where ${\bar{z}}_{S}^{\hat{θ}} = n^{- 1} \sum_{S} {\hat{z}}_{S, k}^{\hat{θ}},$ see (2.1).

The quality of the variance estimator using linearization is assessed by comparing the expected Monte Carlo value of the variance estimated using linearization, denoted $E_{sim} [{\hat{var}}_{lin} (\hat{θ})],$ with the “true” Monte Carlo variance ${var}_{sim} (\hat{θ})$ in terms of relative bias:

$RB [{\hat{var}}_{lin} (\hat{θ})] = \frac{E_{sim} [{\hat{var}}_{lin} (\hat{θ})] - {var}_{sim} (\hat{θ})}{{var}_{sim} (\hat{θ})} . (4.1)$

For the second data set (EU-SILC 2009, income of salaried individuals) we also, in a second step, selected 10,000 random samples without replacement under a stratified sampling design, and then calibrated the sampling weights to agree with the eight known sociodemographic marginal totals for the population of 7,922 individuals. The five strata used correspond to the age groups of the salaried individuals (see Table 4.1).

The eight calibration cells were obtained by crossing the three following dichotomous variables (auxiliary calibration variables):

MARIÉ, which indicates whether or not the individual is married;
CHEF, which indicates whether or not the individual’s job is a management position; and
HOMME, which indicates the individual’s sex.

The totals for the population of 7,922 individuals for these calibration cells are shown in Table 4.2.

Table 4.1
Strata used in simulations with 2009 EU-SILC data and three sample sizes (income of salaried individuals, $N = 7, 922$ )
Table summary
This table displays the results of Strata used in simulations with 2009 EU-SILC data and three sample sizes (income of salaried individuals. The information is grouped by Stratum (appearing as row headers), Description, xxx and % (appearing as column headers).
Stratum $h$	Description	$N_{h}$	%	$n_{h}$
1	individuals under 25	1,187	15.0	75	112	150
2	26- to 35-year-olds	1,359	17.2	86	129	171
3	36- to 45-year-olds	2,137	27.0	135	202	270
4	46- to 55-year-olds	1,864	23.5	117	177	235
5	individuals over 55	1,375	17.4	87	130	174
	TOTAL	7,922	100.0	500	750	1,000

Table 4.2
Calibration margins in simulations with 2009 EU-SILC data (income of salaried individuals, $N = 7, 922$ )
Table summary
This table displays the results of Calibration margins in simulations with 2009 EU-SILC data (income of salaried individuals. The information is grouped by Margin (appearing as row headers), Marié, Chef, Homme, Population total and % (appearing as column headers).
Margin	Marié	Chef	Homme	Population total	%
1	0	0	0	1,487	18.8
2	0	0	1	1,208	15.2
3	0	1	0	323	4.1
4	0	1	1	457	5.8
5	1	0	0	1,759	22.2
6	1	0	1	1,278	16.1
7	1	1	0	328	4.1
8	1	1	1	1,082	13.7
			TOTAL	7,922	100.0

For each stratified sample, a calibration (linear method) was performed to make the sums of the weights agree with the eight margins shown above. Point estimates of the indicators and their linearized variable were computed for each sample using the calibrated weights.

Variance was estimated using the method developed by Deville (2000), which consists of linearizing also with respect to the calibration by calculating the residuals $e^{\hat{θ}}$ of the regression (weighted by the sampling weights) of the linearized variables of the indicators for the auxiliary calibration variables. The variance of the total of the residuals thus calculated, under a stratified random sampling plan without replacement is therefore an estimator of the variance of the estimated indicator; it is the quantity of interest:

${\hat{var}}_{lin} (\hat{θ}) = \sum_{h = 1}^{H} \frac{N_{h}}{n_{h}} (N_{h} - n_{h}) s_{e_{h}^{\hat{θ}}}^{2} (4.2)$

where

$s_{e_{h}^{\hat{θ}}}^{2} = \frac{1}{n_{h} - 1} \sum_{k \in S_{h}} {(e_{k}^{\hat{θ}} - {\bar{e}}^{\hat{θ}})}^{2}$

The quality of the variance estimator using linearization is assessed analogously to the procedure for simple random sampling, see (4.1).

Tables 4.3, 4.4 and 4.5 show the relative bias of the variance for the three data sets used and described above, using simple random sampling. Table 4.6 shows the relative bias of the variance using stratified random sampling with calibrated weights. The upper portions of the tables give the values for the Gini coefficient and QSR, which do not require estimating the income density function. The estimation of their variance works well. Note that there is a problem involving the underestimation of the variance of the Gini coefficient in the case of stratification with calibration (Table 4.6).

For the first data set, Table 4.3 does not reveal any major differences except that the estimation of income density using ${\hat{f}}_{3} (x)$ gives results that are more conservative. In fact, the relative bias remains of the same order of magnitude, but positive, while it is negative for the other two methods of estimating density. For the second data set, Table 4.4 shows that it is essential to use the logarithm or the nearest neighbour method with minimum bandwidth. The latter, all relative bias falls under 10% when the sample sizes are sufficiently large (see last column in the table). Simulations on the same data with a stratified sampling plan and calibration strengthen and confirm these results (see Table 4.6). For the third data set, Table 4.5 shows the same trends, although the results are less stable as a result of the small sample and population sizes. This is not surprising, since the minimum number of neighbours to consider is fixed at 30. In this case, for the Ilocos data set, simulations with a smaller value of $p$ fixed at 10 makes no difference ultimately, because the condition $h (p_{j}) \geq h_{opt}$ automatically increases it above 30.

Furthermore, generally speaking, we can see that the greater the use of Gaussian kernel density estimation - ${\hat{f}}_{1} (x)$ - the greater the error. In fact, the relative bias of the variance for the median income of individuals below the ARPT and for the RMPG are almost systematically greater in absolute value that those for the other indicators. For the RMPG, the error may be offset (as in Table 4.3) if there are enough observations, since the density estimation appears in both the numerator and the denominator.

Table 4.3
Relative bias (4.1) of the variance obtained with 10,000 simple random samples without replacement from the 2009 EU-SILC data (equivalent household income, $N = 17, 534$ )
Table summary
This table displays the results of Relative bias (4.1) of the variance obtained with 10. The information is grouped by Indicator (appearing as row headers), Sample size (sampling rate), calculated using xxx units of measure (appearing as column headers).
Indicator	Sample size (sampling rate)
Indicator	$n = 500 (2 .9 %)$			$n = 750 (4 .3 %)$			$n = 1, 000 (5 .7 %)$
GINI	-0.02			-0.02			-0.02
QSR	0.01			0.00			0.00
	${\hat{f}}_{1}$	${\hat{f}}_{2}$	${\hat{f}}_{3}$	${\hat{f}}_{1}$	${\hat{f}}_{2}$	${\hat{f}}_{3}$	${\hat{f}}_{1}$	${\hat{f}}_{2}$	${\hat{f}}_{3}$
ARPT	-0.08	-0.06	0.04	-0.09	-0.07	0.03	-0.09	-0.07	0.04
ARPR	-0.05	-0.01	-0.00	-0.09	-0.06	-0.05	-0.08	-0.05	-0.03
RMPG	-0.09	-0.07	0.15	-0.10	-0.07	0.12	-0.09	-0.06	0.14
MEDP	-0.16	-0.12	0.09	-0.19	-0.13	0.05	-0.18	-0.11	0.07
MED	-0.08	-0.06	0.05	-0.08	-0.06	0.04	-0.08	-0.06	0.04

Table 4.4
Relative bias (4.1) of the variance obtained with 10,000 simple random samples without replacement from the 2009 EU-SILC data (income of salaried individuals, $N = 7, 922$ )
Table summary
This table displays the results of Relative bias (4.1) of the variance obtained with 10. The information is grouped by Indicator (appearing as row headers), Sample size (sampling rate), calculated using xxx units of measure (appearing as column headers).
Indicator	Sample size (sampling rate)
Indicator	$n = 500 (6 .3 %)$			$n = 750 (9 .5 %)$			$n = 1, 000 (12 .6 %)$
GINI	-0.03			-0.03			-0.02
QSR	-0.00			0.00			0.00
	${\hat{f}}_{1}$	${\hat{f}}_{2}$	${\hat{f}}_{3}$	${\hat{f}}_{1}$	${\hat{f}}_{2}$	${\hat{f}}_{3}$	${\hat{f}}_{1}$	${\hat{f}}_{2}$	${\hat{f}}_{3}$
ARPT	0.07	0.05	0.13	0.06	0.04	0.10	0.06	0.03	0.08
ARPR	-0.05	-0.04	-0.02	-0.05	-0.04	-0.01	-0.06	-0.05	-0.02
RMPG	0.61	0.12	0.15	0.60	0.11	0.08	0.59	0.09	0.05
MEDP	0.73	0.17	0.18	0.72	0.16	0.10	0.72	0.15	0.07
MED	0.07	0.04	0.13	0.06	0.04	0.10	0.05	0.03	0.07

Table 4.5
Relative bias (4.1) of the variance obtained with 10,000 simple random samples without replacement from Ilocos data (household income, $N = 632$ )
Table summary
This table displays the results of Relative bias (4.1) of the variance obtained with 10. The information is grouped by Indicator (appearing as row headers), Sample size (sampling rate), calculated using xxx units of measure (appearing as column headers).
Indicator	Sample size (sampling rate)
Indicator	$n = 50 (7 .9 %)$			$n = 63 (10 .0 %)$
GINI	-0.16			-0.13
QSR	0.00			0.00
	${\hat{f}}_{1}$	${\hat{f}}_{2}$	${\hat{f}}_{3}$	${\hat{f}}_{1}$	${\hat{f}}_{2}$	${\hat{f}}_{3}$
ARPT	-0.05	-0.06	-0.01	-0.03	-0.03	-0.01
ARPR	-0.31	-0.01	-0.12	-0.33	-0.03	-0.18
RMPG	1.55	0.83	0.26	1.54	0.16	0.39
MEDP	1.02	0.28	-0.26	1.05	0.07	-0.11
MED	0.04	0.03	0.08	0.07	0.07	0.09

Table 4.6
Relative bias (4.1) of the variance obtained with 10,000 stratified random samples without replacement, with weights calibrated to eight sociodemographic margins, from the 2009 EU-SILC data (income of salaried individuals, $N = 7, 922$ )
Table summary
This table displays the results of Relative bias (4.1) of the variance obtained with 10. The information is grouped by Indicator (appearing as row headers), Sample size (sampling rate) (appearing as column headers).
Indicator	Sample size (sampling rate)
Indicator	$n = 500 (6 .3 %)$			$n = 750 (9 .5 %)$			$n = 1, 000 (12 .6 %)$
GINI	-0.21			-0.20			-0.20
QSR	-0.06			-0.06			-0.07
	${\hat{f}}_{1}$	${\hat{f}}_{2}$	${\hat{f}}_{3}$	${\hat{f}}_{1}$	${\hat{f}}_{2}$	${\hat{f}}_{3}$	${\hat{f}}_{1}$	${\hat{f}}_{2}$	${\hat{f}}_{3}$
ARPT	-0.07	-0.09	-0.01	-0.08	-0.1	-0.04	-0.09	-0.11	-0.06
ARPR	-0.10	-0.10	-0.08	-0.07	-0.06	-0.05	-0.06	-0.06	-0.05
RMPG	0.63	0.13	0.13	0.61	0.11	0.08	0.59	0.10	0.04
MEDP	0.71	0.16	0.15	0.68	0.13	0.09	0.66	0.12	0.04
MED	-0.07	-0.09	-0.01	-0.08	-0.1	-0.04	-0.08	-0.11	-0.06

In short, we see that the variance can be overestimated $(RB [{\hat{var}}_{lin} (\hat{θ})] > 0)$ or underestimated $(RB [{\hat{var}}_{lin} (\hat{θ})] < 0)$ depending on the indicator and the data set. The use of the logarithm $({\hat{f}}_{2} (x))$ provides significant improvement. The nearest neighbour method $({\hat{f}}_{3} (x))$ eliminates all problems if there is enough data (as in Tables 4.3, 4.4 and 4.6). Slight problems arise with this method when the samples are small (as in Table 4.5). Illogical variations and bias that persist in the tables may also be the result of a lack of robustness in the linearized variables for certain samples, as stated in Section 3.3.

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

4. Results