2 Functional data in a finite population
Hervé Cardot, Alain Dessertaine, Camelia Goga,
Étienne Josserand and Pauline Lardin
Previous | Next
Consider a finite population of size N and assume that for each unit in the population we can observe the deterministic curve .
The objective is to estimate the mean curve of the population, which is defined
for any instant by
Let be a sample of fixed size selected randomly in according to a sampling design Let and be the first- and second- order inclusion
probabilities respectively. Assume that for any unit in population
The mean curve is estimated using the Horvitz-Thompson
estimator (Cardot et al. 2010)
as follows:
where is the indicator that unit belongs to the sample For each instant the estimator is unbiased for meaning that where the expectation is considered in
relation to the sampling design.
Generally, the trajectories are not observed continuously for but only for a set of measurement instants
In functional data analysis, a classical
strategy is to interpolate or smooth discretized trajectories to obtain objects
that are truly functions (Ramsay and Silverman 2005). This also makes it
possible to deal with curves whose measurement instants are not identical. In
the context of surveys, Cardot and Josserand (2011) studied linear
interpolation where there is no measurement error at the discretized points,
while Cardot et al. (2013) examined
smoothing procedures. If there are enough discretization points and the
trajectories are fairly regular (but not necessarily derivable), the
approximation error due to smoothing or interpolation is negligible in relation
to the sampling error. We subsequently assume that the trajectories are
observed at any point
of the interval
The Horvitz-Thompson covariance function is given by
for any and If we assume that the second-order
probabilities of inclusion satisfy an unbiased estimator of is given by the Horvitz-Thompson unbiased
estimator of the variance,
for any
2.1 Using auxiliary information for estimating
the mean trajectory
It is well known that using auxiliary information that
effectively explains the variable of interest can greatly improve the precision
of the Horvitz-Thompson estimator. In the case of the EDF data, the outside
temperature or the type of contract could probably be useful auxiliary
variables. A stratification based on geographic position would also yield
estimates for different regions. In this study, we have as an auxiliary
variable the total electricity consumption for the previous week. We assume
that this variable (a real one) is observed for all units in the population.
In this section, we present the Horvitz-Thompson
estimator for the mean curve as well as an estimate of the covariance function
of this estimator, both for a stratified design using simple random sampling
without replacement (SRSWOR) in each stratum, denoted hereafter as STRAT, and
for PPS sampling without replacement, which will be denoted as . We also consider an estimator of
the mean curve, assisted by a functional linear model.
2.1.1 Stratified
sampling with SRSWOR in each stratum (STRAT)
The population is assumed to be stratified into a fixed
number of strata of sizes Within each stratum a sample of size is drawn according to an SRSWOR design.
We denote , for , the
mean curve in each stratum, and , its estimate. The estimator of
the mean curve is then defined by
The Horvitz-Thompson estimator of the covariance
function is then
where
is the estimator of the covariance function in stratum For we obtain the estimator of the variance
function as follows:
where
is the estimator of the variance in stratum Cardot and Josserand (2011) propose an
extension, in the functional framework, of Neyman's optimal allocation. When
the sizes of the samples verify
the integrated variance, of the stratified estimator is minimized.
This allocation is similar to the one obtained in a multivariate context by
Cochran (1977). By replacing the variable by another variable that is known for the entire population and is
highly correlated with the variable of interest, we obtain an allocation that
can be described as optimal.
Note 2.1 For we obtain the simple random design without
replacement (SRSWOR), and the mean curve is estimated by
The estimator
of the covariance function defined in (2.2) is then
2.1.2 PPS sampling without replacement ()
PPS sampling designs with or without replacement are
often used in practice because they are more effective than equal probability
designs when the variable of interest is basically proportional to an auxiliary
variable that has strictly positive values.
In the case of samples of fixed size drawn without replacement, it is possible to
give the equivalent of the formula of Yates and Grundy (1953) and Sen (1953).
The covariance function of verifies
Assume that the values of variable are known for all units in the population. It is then possible to
define the inclusion probabilities as follows:
Methods have been proposed in the literature for
the case (Särndal et
al. 1992).
Second-order inclusion probabilities are generally very
difficult to calculate for designs, and therefore Formula (2.2) cannot be
used. However, there is a simple asymptotic approximation of the variance,
which was proposed by Hájek (1964) and which entails only first-order inclusion
probabilities. This approximation proves to be very effective when the sample
is large and the entropy of the sampling design is close to maximum entropy. To
select sample with inclusion probabilities the cube algorithm (Deville and Tillé 2004)
balanced on the variable can be used. Deville and Tillé (2005) show
that for this particular sampling design, the Hàjek formula is highly effective
for estimating the variance of a total or a mean. This formula for
approximating the variance can also be used for the covariance, which is then
estimated by
where
We also used the systematic sampling with unequal
probabilities proposed by Madow (1949), since it is simple to use. Unfortunately,
it is difficult to estimate the variance for this type of design, and we will
therefore not use it to construct confidence bands.
2.2 The model-assisted estimator
Consider real auxiliary variables and let be the value of the variable for the individual. Let denote the vector containing the values of auxiliary variables measured on the individual. We consider that the relationship
between the variable of interest and the auxiliary variables is modeled by the
following superpopulation model
with
This model is an immediate generalization of the
functional linear model proposed by Faraway (1997) to several auxiliary
variables.
The estimate of based on the model and the sampling design is given by
Note that the sampling weights do not depend on the
time Let be the estimator based on the sampling design
for the prediction of under the model . By direct analogy with the
univariate case (Särndal et al. 1992),
we finally obtain the following estimator for the mean, for
If the contains the constant variable 1, then the
estimator becomes
For fixed and , the asymptotic covariance of and can be calculated according to the classical
residual technique (Särndal et al.
1992),
where is the prediction of under the superpopulation model and is the estimate of at the level of the population and
Cardot, Goga and Lardin (2013) show that this
result remains valid uniformly in
As an estimator of the covariance function , we propose the Horvitz-Thompson
estimator of asymptotic covariance given by (2.14) where is replaced by its estimator based on the sampling design,
Note 2.2 It
is entirely possible to consider a superpopulation model that is more general than the linear model
proposed here. Estimation techniques based on smoothing by B-splines (Goga and
Ruiz-Gazen 2012) can then also be considered. In our study, the relationship
between consumption at instant and the mean consumption for the previous week
is almost linear (cf. Figure 4.1), which justifies not using these
non-parametric approaches.
Previous | Next