Linearization versus bootstrap for variance estimation of the change between Gini indexes
Section 3. Two-sample case
3.1 Notation and composite estimation
Suppose
now that two variables
and
are measured on the population
and let
denote the values taken by
on the units in the population. The variables
and
may typically refer to some characteristic of
interest collected at two different times
and
We consider the estimation of parameters
that can be written as a functional
where
For instance, the linear case
corresponds to the difference between the
totals
and
Let
and
be two samples of sizes
and
respectively, selected from the same
population
according to some two-dimensional sampling
design
(see Goga, 2003). The variable
is measured on
while the variable
is measured on
Plugging sample-based estimators
in
yields the substitution estimator
Unlike the one-sample case, several estimators
are possible. In what follows, we focus on the
general class of composite estimators introduced by Goga, Deville and
Ruiz-Gazen (2009). We note
and
For
we note
the expected number of draws for unit
in
and
where
The composite estimators of
and
are
where
and
are some known constants. The choice
leads to the intersection estimator with
and
where the overlapping sample
only is used.
When
estimating the parameter
the composite estimator is
where
and
It may be rewritten as
where
The variance of the composite estimator is
Finding the vector
which minimizes the variance in (3.4) leads to
the optimal composite estimator (Goga, Deville and Ruiz-Gazen, 2009,
Section 3.6). Note that this is not an estimator per se, since it depends
on unknown quantities which need to be estimated in practice. However, this is
a useful benchmark which we will use for the appraisal of simpler composite
estimators.
A
variance estimator is obtained by substituting in (3.4) an estimator of the
variance-covariance matrix. The derivation of variance estimators is detailed
in Sections 3.1.1 and 3.1.2 for two examples of two-dimensional sampling
designs.
3.1.1 Two-dimensional SI design
The
two-dimensional SI design (SI2) of fixed size
assigns equal probabilities to all
for which the associated subsamples
and
have the required sizes
and
see Goga (2003) and Qualité and Tillé (2008).
The SI2 design has the attractive property that the marginal samples
and
are SI samples from the population
Similarly,
is a SI sample of size
and
is a SI sample of size
For the SI2 sampling design, the composite
estimator in (3.3) yields
and the variance of the composite estimator is
with
see Appendix for a proof.
We consider two examples. The choice
leads to the intersection estimator
and the variance simplifies as
The choice
and
leads to the union estimator
where the complete samples are used, and the variance may be written as
The variances of the union estimator and of the intersection estimator
were derived by Qualité and Tillé (2008), see also Tam (1984).
The
choice of
and
is of practical importance to obtain an
efficient composite estimator. After some algebra, the vector
which minimizes the variance of
is given by
with
For two variables
and
related to a same characteristic collected at
two different times,
is expected to be close to
and
The vector
in (3.12) is in turn close to the null vector,
and if the size of the overlapping sample
is comparable to that of
and
we obtain
and
Therefore, using the intersection estimator
where
seems reasonable in practice. On the contrary,
the union estimator can be very inefficient; see Section 4.2 for an
illustration. These conclusions are consistent with that of Qualité and Tillé
(2008), Section 2.2.2.
Several
variance estimators may be used for the composite estimator. Estimating the
dispersions on the overlapping sample only yields the unbiased variance
estimator
while an estimation on the whole samples yields
Berger (2004) considered variance estimation for the union estimator under
a maximum entropy rotating sampling scheme, by estimating separately the three
components in (3.6).
3.1.2 Two-dimensional multistage design
We
now consider a two-dimensional two-stage sampling design (MULT2). We assume
that a with-replacement first-stage sample
of size
is first selected among the PSUs
Inside each PSU
a SI2 sample of size
is then selected. This type of sampling design
emerges in particular in case of a self-weighted two-stage design in two waves,
with a partial replacement at the second wave of the SSUs selected at the first
wave. The composite estimator in (3.3) yields
where
where
where
and where
denotes the number of SSUs inside the PSU
For
example, using the overlapping samples only inside the PSUs yields the
intersection estimator
Using the complete samples inside the PSUs yields the union estimator
We note that for any vector of values
the variance due to the first-stage of
sampling for
is the same. The possible composite estimators
thus differ with respect to the second-stage variance only. In view of the
discussion in Section 3.1.1, we therefore expect the intersection
estimator to be close to the optimal composite estimator; see Section 4.2
for an illustration. An unbiased variance estimator for
is given by
3.2 Estimation of the change between Gini indexes
The
change between Gini indexes
may be written as
where
Using composite estimation leads to
where
Usually,
in a temporal sampling framework, the samples
and
are not independent. Consequently, our set-up
differs from the usual estimation of functionals depending on distribution
functions estimated with independent samples; see for example Pires and Branco
(2002) and Reid (1981), who give the first-order expansion of a two-sample
functional using the partial influence functions. Davison and Hinkley (1997, page 71)
give bootstrap methods under a similar framework. Using a general
two-dimensional sampling design
Goga, Deville and Ruiz-Gazen (2009) give a
two-sample linearization technique of bivariate functionals that will be used
in what follows.
3.3 Linearization variance estimation
To
obtain the asymptotic variance of
we adopt the asymptotic framework introduced
by Goga, Deville and Ruiz-Gazen (2009), which is an extension to the two-sample
case of the asymptotic framework of Isaki and Fuller (1982). Define, when they
exist, the partial influence functions of a functional
at point
as
We define the linearized variables
for
as the partial influence functions of
at
and
For the change between Gini indexes
the linearized variables
may be computed using (2.10), namely
where
The estimated linearized variable is
3.3.1 Two-dimensional SI design
In
case of the SI2 design presented in Section 3.1.1, plugging the variables
derived in (3.22) into the variance formula in
(3.6) yields the variance approximation
see Theorem 1 in Goga, Deville and Ruiz-Gazen (2009). To obtain a
variance estimator, the linearized variables may be estimated in several ways.
If the overlapping sample
only is used, the estimated linearized
variables
are obtained from (3.23) by taking
and
A variance estimator is then obtained by
plugging these linearized variables into (3.13). This leads to
If the whole samples
and
are used, the estimated linearized variable
are obtained from (3.23) by taking
and
A variance estimator is then obtained by
plugging these linearized variables into (3.14). This leads to
3.3.2 Two-dimensional multistage design
In
case of the MULT2 design presented in Section 3.1.2, the linearized
variables may also be estimated in several ways. For the sake of simplicity, we
consider using the overlapping sample
only so that the estimated linearized
variables
are obtained from (3.23) by taking
and
A variance estimator is then obtained by
plugging these linearized variables into (3.19). This leads to
where
and
are obtained from (3.15) and (3.16),
respectively, by replacing
with
3.4 Bootstrap variance estimation
Bootstrap
methods have not yet been studied for the change between Gini indexes. The
principles of the weighted bootstrap technique can be extended to the
two-sample context, i.e. each measure
with
and
is estimated, conditionally on the samples
originally selected, by some weighted bootstrap measure
which enables to match, at least
approximately, the two first moments of an unbiased estimator in the linear
case. In Section 3.4.1, we consider a generalization of the BWO to the SI2
design. In Section 3.4.2, we propose a generalisation of the BWR to the
MULT2 design.
3.4.1 A generalization of the BWO to the SI2 design
We
first consider the SI2 design. Building a pseudo-population
is more intricate in the two-sample case,
since the variables of interest measured at waves
and
need to be available for each unit in
We therefore describe a bootstrap algorithm
where the overlapping sample
only is used to build the pseudo-population
in the spirit of the intersection variance
estimator in (3.24).
Suppose
that
is an integer. The vectors
are obtained by, first creating a
pseudo-population
of size
by duplicating
times each unit
in the original sample
A SI2 resample
of size
is then selected in
The bootstrap measures are then
with
the number of times that unit
is selected in the resample
In the linear case, the bootstrap estimator of
the parameter
is then
where
After some algebra, we obtain
where
is given in (3.7), and
is given in (3.13). The proposed
generalization of the BWO therefore enables to exactly match the intersection
estimator of the first moment, and to approximately match the intersection
estimator of the second moment for a large
The
building of
may be avoided by noting that under the BWO
procedure, each vector
follows a multivariate hypergeometric
distribution. Therefore, the resampling weights may be directly generated. The
algorithm may be adapted to the general case when
is not an integer by means of any of the
techniques mentioned in Section 2.4.
3.4.2 A generalization of the BWR for the
two-dimensional multistage design
We
now consider the two-dimensional two-stage sampling design with a common
first-stage sample
presented in Section 3.1.2. The proposed
bootstrap procedure is similar to that described in Rao and Wu (1988). A
with-replacement resample
of size
is selected by means of simple random sampling
with replacement (SIR) in the original first-stage sample
The bootstrap measures are then
It may be rewritten as
with
the union of the samples
for
and where the resampling weight
equals
multiplied by the number of times the PSU
containing
is selected in
In
the linear case, the bootstrap estimator of the parameter
is then
where
is defined in (3.16). After some algebra, we
obtain
where
is given in (3.15), and
is given in (3.19). The proposed
generalization of the BWR therefore enables to exactly match the composite
estimator of the first moment, and the associated estimator of the second
moment.