Linearization versus bootstrap for variance estimation of the change between Gini indexes
Section 2. One sample case
2.1 Notation
Let
denote some finite population of size
whose units may be identified by the labels
Suppose that the variable
is measured on the population
and let
denote the values taken by
on the units in the population. Let
denote the discrete measure taking unit mass on
any point
in the population and 0 elsewhere, with
the Dirac mass at
Most of the parameters of interest
studied in surveys can be written as a functional
of
namely
For instance, the total
equals
In practice, a sample
(with or without repetitions) is selected by
means of a sampling design
and we observe the values
for
only. A substitution principle is used for
estimation (see Deville, 1999, and Goga, Deville and Ruiz-Gazen, 2009). Let
denote the expected number of draws for unit
in the sample; in case of without-replacement
sampling, this is the probability that unit
is selected in the sample. Let
denote the discrete measure taking mass
on any point in the sample and 0 elsewhere,
where
is the sampling weight. Substituting
into
yields the estimator
For
a without-replacement sampling design, the substitution estimator for a total
is the so-called Horvitz-Thompson (HT) estimator
The HT variance estimator is
where
denotes the probability that units
and
are selected jointly in the sample, and
In the particular case of simple random
sampling without replacement (SI) of size
we have
with
and formula (2.1) yields
For
a with-replacement sampling design, the substitution estimator for a total is
the so-called Hansen-Hurwitz (HH) estimator
We consider the important case of multistage
sampling, where the
units are grouped inside
non-overlapping Primary Sampling Units (PSU)
and where a with-replacement first-stage
sample
of size
is selected. Let
denote the expected number of draws for the
PSU
in
A second-stage sample
is then selected inside any
by means of some sampling design
Let
denote the expected number of draws for unit
in
The estimated measure is then
We have
where
and an unbiased variance estimator for
is
2.2 Estimating the Gini coefficient
If
the variable
is measured on the population
the Gini coefficient is
see for example Nygård and Sandström (1985). It follows that
is zero if
is constant on the population, which occurs
when the total of
is equally distributed among all the
population individuals. In the opposite case, when only one individual owns the
whole amount of
is maximized and equal to
the total of
is then concentrated in one point only, which
means maximum inequality among members of the population.
If
all individuals
have different values for the variable
the Gini coefficient
is
with
the ordered values and
the finite population distribution function;
see Sandström, Wretman and Waldèn (1988) and Deville (1997) for further details
on the derivation of (2.4). Nygård and Sandström (1985) called the term
the Gini finite population correction and gave
several reasons to make this correction, such as the non-negativity of the
lower bound of
As is frequently done in the literature (see
for example Glasser, 1962), this correction is ignored in the sequel. We redefine
the Gini coefficient as
where the finite population distribution function
is a functional family
indexed by
Substituting
into (2.5) and (2.6) yields the estimator
where
is the substitution estimator of the distribution function
2.3 Linearization variance estimation
We
give below some brief details about the influence function linearization (IFL)
(Deville, 1999), which consists in giving a first-order expansion of the
substitution estimator
around the true value
to approximate the error by a linear estimator
of some artificial linearized variable. More precisely, the first
derivatives of
with respect to
are the influence functions
and
is the linearized variable for all
Suppose that
is homogeneous, namely there exists some
positive number
dependent on
such that
for any real
Assume also that
Under some additional regularity assumptions
upon
and the sampling design (e.g., Goga and
Ruiz-Gazen, 2014), Deville (1999) establishes that
so that the error
can be approximated by the error of the HT
estimator for the total of the linearized variable
For a without-replacement sampling design,
using a sample-based estimator
of the linearized variable
in the HT variance estimator yields the
variance estimator
where
denotes the probability that units
and
are selected jointly in the sample, and
Several results of asymptotic normality have
been proved for specific sampling designs, see Hájek (1960, 1961, 1964),
Rosén (1972), Sen (1980), Krewski and Rao (1981), Gordon (1983), Ohlsson (1986, 1989),
Chen and Rao (2007), Brändén and Jonasson (2012), Saegusa and Wellner (2013)
and Chauvet (2015), among others. If the sampling design is such that the
substitution estimator
satisfies a central-limit theorem, an
approximately
confidence interval is
where
is the upper
cutoff for the standard normal distribution.
In
case of the Gini coefficient, we have
and the linearized variable is
where
denotes the mean of the
lower than
see Deville (1999). Kovačević and Binder (1997) derived the
same expression by means of the estimating equations linearization method;
using the Demnati and Rao (2004) linearization approach also leads to the same
result. The estimated linearized
variable is
where
In
the particular SI case, the linearization variance estimator for the Gini
coefficient is
and where
In the particular case of multistage sampling
and with-replacement sampling of PSUs, the linearization variance estimator for
the Gini coefficient is
2.4 Bootstrap variance estimation
The
use of bootstrap techniques in survey sampling has been extensively studied in
the literature. The main bootstrap techniques may be thought as particular
cases of the weighted bootstrap (Bertail and Combris, 1997; Antal and Tillé,
2011; Beaumont and Patak, 2012); see also Shao and Tu (1995, Chapter 6),
Davison and Hinkley (1997, Section 3.7) and Davison and Sardy (2007) for
detailed reviews. Under a weighted bootstrap procedure, the measure
is estimated, conditionally on the sample
by the bootstrap measure
where
denotes a (random) vector of resampling
weights. We note
and
for the expectation and variance with respect
to the resampling scheme. In case of without-replacement sampling, the vector
is generated in such a way that
so that the two first moments of the HT-estimator are approximately
matched. In case of with-replacement sampling, the vector
is generated in such a way that
so that the two first moments of the HH-estimator are approximately
matched.
Under
any weighted bootstrap technique, the plug-in estimator of
is
and the variance of
is estimated by
Since the variance estimator (2.17) may be difficult to compute exactly, a
simulation-based variance estimator may be used instead. More precisely,
independent realizations
of the vector
are generated, and we denote
with
the Bootstrap measure associated to the vector
Then
is estimated by
Two types of confidence intervals are usually computed. The percentile
method makes use of the ordered bootstrap estimates
to form a
confidence interval
with
and
The bootstrap
involves the estimation of the pivotal
statistic
by its bootstrap counterpart
where
is obtained by applying the bootstrap
procedure to the resample
The bootstrap
is computationally very intensive since a
double bootstrap is required, and is thus less attractive for a data user. Therefore,
we do not pursue this approach further and we focus on the percentile method.
Linearization
methods provide variance formulas applicable to general sampling designs, but
involve possibly intricate computation of derivatives for complex parameters of
interest such as the Gini coefficient. Unlike the linearization, the bootstrap
avoids theoretical work by re-calculating the existing estimation system
repeatedly. Replicate weights are supplied with the data set, and may be easily
used to produce variance estimates for a wide range of statistics. However, a
bootstrap technique is usually not suitable for general sampling designs. That
is, a particular sampling design usually requires a tailor made resampling
scheme. In this paper, we focus on two particular bootstrap techniques, which
will be generalized in Section 3 to the two-sample context.
2.4.1 Without-replacement bootstrap for SI sampling
When
the sample
is selected by means of SI, we consider the
without replacement bootstrap (BWO) introduced by Gross (1980). The approach is
readily extended to stratified simple random sampling (STSI) with a finite
number of strata. Suppose that
is an integer. Then the vector
is obtained by, first creating a
pseudo-population
of size
by duplicating
times each unit
in the original sample
and then by selecting a SI resample
of size
in
The
bootstrap measure is given by (2.14), where the resampling weight
is the number of times unit
is selected in
The building of
may be avoided by noting that under the BWO
procedure, the vector
follows a multivariate hypergeometric
distribution. Therefore, the resampling weights may be directly generated. It
can be shown that the BWO procedure leads to
where
is given in (2.2), so that equation (2.15) is
approximately matched for a large sample size.
Several
solutions have been proposed to handle the case when
is not an integer, see Chao and Lo (1985),
Bickel and Freedman (1984), Sitter (1992b), Booth, Butler and Hall (1994),
Presnell and Booth (1994), among others. The generalization of BWO variance
estimation for unequal probability sampling designs is considered in Särndal,
Swensson and Wretman (1992) and Chauvet (2007).
2.4.2 With-replacement bootstrap for multistage
sampling
When
the sample
is selected by means of multistage sampling
and with-replacement unequal probability sampling of PSUs, we consider the
bootstrap of PSUs (BWR) introduced by Rao and Wu (1988). A with-replacement
resample
of size
is selected by means of simple random sampling
with replacement (SIR) in the original first-stage sample
The bootstrap measure is
where the resampling weight
equals
multiplied by the number of times the PSU
containing
is selected in
The
resampling size
is used to reproduce the usual unbiased
variance estimator in the linear case (see Rao and Wu, 1988). It can be shown
that the BWR procedure leads to
where
is given in (2.3), so that equation (2.16) is
exactly matched. The BWR procedure is particulary simple, since involving a
resampling for the first-stage of sampling only, the sub-samples of Secondary
sampling Units (SSUs) being left unchanged inside the resampled PSUs.