Publications

Survey Methodology

Browse by

2 Functional data in a finite population

Hervé Cardot, Alain Dessertaine, Camelia Goga, Étienne Josserand and Pauline Lardin

Consider a finite population $U = {1,..., N}$ of size N and assume that for each unit $k$ in the population $U,$ we can observe the deterministic curve $Y_{k} = {(Y_{k} (t))}_{t \in [0, T]}$ . The objective is to estimate the mean curve of the population, which is defined for any instant $t \in [0, T],$ by

$μ (t) = \frac{1}{N} \sum_{k \in U} Y_{k} (t) .$

Let $s$ be a sample of fixed size $n,$ selected randomly in $U,$ according to a sampling design $p (.) .$ Let $π_{k} = \Pr (k \in s)$ and $π_{k l} = \Pr (k & l \in s)$ be the first- and second- order inclusion probabilities respectively. Assume that $π_{k} > 0$ for any unit $k$ in population $U .$

The mean curve $μ$ is estimated using the Horvitz-Thompson estimator (Cardot et al. 2010) as follows:

$\hat{μ} (t) = \frac{1}{N} \sum_{k \in s} \frac{Y_{k} (t)}{π_{k}} = \frac{1}{N} \sum_{k \in U} \frac{Y_{k} (t)}{π_{k}} 1_{k \in s}, t \in [0, T], (2.1)$

where $1_{k \in s}$ is the indicator that unit $k$ belongs to the sample $s .$ For each instant $t \in [0, T],$ the estimator $\hat{μ} (t)$ is unbiased for $μ (t),$ meaning that $E (\hat{μ} (t)) = μ (t)$ where the expectation is considered in relation to the sampling design.

Generally, the trajectories $Y_{k} (t)$ are not observed continuously for $t \in [0, T]$ but only for a set of $D$ measurement instants $0 = t_{1} < t_{2} < \dots < t_{D} = T .$ In functional data analysis, a classical strategy is to interpolate or smooth discretized trajectories to obtain objects that are truly functions (Ramsay and Silverman 2005). This also makes it possible to deal with curves whose measurement instants are not identical. In the context of surveys, Cardot and Josserand (2011) studied linear interpolation where there is no measurement error at the discretized points, while Cardot et al. (2013) examined smoothing procedures. If there are enough discretization points and the trajectories are fairly regular (but not necessarily derivable), the approximation error due to smoothing or interpolation is negligible in relation to the sampling error. We subsequently assume that the trajectories are observed at any point $t$ of the interval $[0, T] .$

The Horvitz-Thompson covariance function $γ (r, t) = cov (\hat{μ} (r), \hat{μ} (t))$ is given by

$γ (r, t) = \frac{1}{N^{2}} \sum_{k \in U} \sum_{l \in U} Δ_{k l} \frac{Y_{k} (r)}{π_{k}} \frac{Y_{l} (t)}{π_{l}}$

for any $(r, t) \in [0, T] \times [0, T]$ and $Δ_{k l} = π_{k l} - π_{k} π_{l} .$ If we assume that the second-order probabilities of inclusion satisfy $π_{k l} > 0,$ an unbiased estimator of $γ (r, t)$ is given by the Horvitz-Thompson unbiased estimator of the variance,

$\hat{γ} (r, t) = \frac{1}{N^{2}} \sum_{k \in s} \sum_{l \in s} \frac{Δ_{k l}}{π_{k l}} \frac{Y_{k} (r)}{π_{k}} \frac{Y_{l} (t)}{π_{l}} (2.2)$

for any $(r, t) \in [0, T] \times [0, T] .$

2.1 Using auxiliary information for estimating the mean trajectory

It is well known that using auxiliary information that effectively explains the variable of interest can greatly improve the precision of the Horvitz-Thompson estimator. In the case of the EDF data, the outside temperature or the type of contract could probably be useful auxiliary variables. A stratification based on geographic position would also yield estimates for different regions. In this study, we have as an auxiliary variable the total electricity consumption for the previous week. We assume that this variable (a real one) is observed for all units in the population.

In this section, we present the Horvitz-Thompson estimator for the mean curve as well as an estimate of the covariance function of this estimator, both for a stratified design using simple random sampling without replacement (SRSWOR) in each stratum, denoted hereafter as STRAT, and for PPS sampling without replacement, which will be denoted as $π p s$ . We also consider an estimator of the mean curve, assisted by a functional linear model.

2.1.1 Stratified sampling with SRSWOR in each stratum (STRAT)

The population $U$ is assumed to be stratified into a fixed number $H$ of strata $U_{1}, \dots, U_{H}$ of sizes $N_{1}, \dots, N_{H} .$ Within each stratum $U_{h},$ a sample $s_{h}$ of size $n_{h}$ is drawn according to an SRSWOR design.

We denote $μ_{h} (t) = \sum_{k \in U_{h}} Y_{k} (t) / N_{h}$ , for $t \in [0, T]$ , the mean curve in each stratum, and ${\hat{μ}}_{h} (t) = \sum_{k \in s_{h}} Y_{k} (t) / n_{h}$ , its estimate. The estimator of the mean curve $μ$ is then defined by

${\hat{μ}}_{strat} (t) = \frac{1}{N} \sum_{h = 1}^{H} N_{h} {\hat{μ}}_{h} (t) = \sum_{h = 1}^{H} \frac{N_{h}}{N} (\frac{1}{n_{h}} \sum_{k \in s_{h}} Y_{k} (t)), t \in [0, T] . (2.3)$

The Horvitz-Thompson estimator of the covariance function $γ$ is then

${\hat{γ}}_{s t r a t} (r, t) = \frac{1}{N^{2}} \sum_{h = 1}^{H} N_{h}^{2} (\frac{1}{n_{h}} - \frac{1}{N_{h}}) S_{Y (r) Y (t), s_{h}} r, t \in [0, T], (2.4)$

where

$S_{Y (r) Y (t), s_{h}} = \frac{1}{n_{h} - 1} \sum_{k \in s_{h}} (Y_{k} (r) - {\hat{μ}}_{h} (r)) (Y_{k} (t) - {\hat{μ}}_{h} (t))$

is the estimator of the covariance function $S_{Y (r) Y (t), U_{h}}$ in stratum $h .$ For $r = t \in [0, T],$ we obtain the estimator of the variance function as follows:

${\hat{γ}}_{s t r a t} (r) = \frac{1}{N^{2}} \sum_{h = 1}^{H} N_{h}^{2} (\frac{1}{n_{h}} - \frac{1}{N_{h}}) S_{Y (r), s_{h}}^{2},$

where

$S_{Y (r), s_{h}}^{2} = \frac{1}{n_{h} - 1} \sum_{k \in s_{h}} {(Y_{k} (r) - {\hat{μ}}_{h} (r))}^{2}$

is the estimator of the variance $S_{Y (r), U_{h}}^{2}$ in stratum $h .$ Cardot and Josserand (2011) propose an extension, in the functional framework, of Neyman's optimal allocation. When the sizes $n_{h}$ of the samples $s_{h}$ verify

$n_{h} = n \frac{N_{h} \sqrt{\int_{0}^{T} S_{Y (r), U_{h}}^{2} d r}}{\sum_{h = 1}^{H} N_{h} \sqrt{\int_{0}^{T} S_{Y (r), U_{h}}^{2} d r}}, h = 1,..., H, (2.5)$

the integrated variance, $\int_{0}^{T} {\hat{γ}}_{strat} (t) d t,$ of the stratified estimator is minimized. This allocation is similar to the one obtained in a multivariate context by Cochran (1977). By replacing the variable $Y$ by another variable $X$ that is known for the entire population and is highly correlated with the variable of interest, we obtain an allocation that can be described as $x $ optimal.

Note 2.1 For $H = 1,$ we obtain the simple random design without replacement (SRSWOR), and the mean curve $μ (t)$ is estimated by

${\hat{μ}}_{srswor} (t) = \frac{1}{n} \sum_{k \in s} Y_{k} (t), t \in [0, T] . (2.6)$

The estimator of the covariance function defined in (2.2) is then

${\hat{γ}}_{srswor} (r, t) = (\frac{1}{n} - \frac{1}{N}) S_{Y (r) Y (t), s} . (2.7)$

2.1.2 PPS sampling without replacement ( $π p s$ )

PPS sampling designs with or without replacement are often used in practice because they are more effective than equal probability designs when the variable of interest is basically proportional to an auxiliary variable $X$ that has strictly positive values.

In the case of samples of fixed size $n$ drawn without replacement, it is possible to give the equivalent of the formula of Yates and Grundy (1953) and Sen (1953). The covariance function of $\hat{μ}$ verifies

$γ (r, t) = - \frac{1}{2} \frac{1}{N^{2}} \sum_{k \in U} \sum_{l \in U, l \neq k} (π_{k l} - π_{k} π_{l}) (\frac{Y_{k} (r)}{π_{k}} - \frac{Y_{l} (r)}{π_{l}}) (\frac{Y_{k} (t)}{π_{k}} - \frac{Y_{l} (t)}{π_{l}}), r, t \in [0, T] . (2.8)$ Assume that the values $x_{k}$ of variable $X$ are known for all units $k$ in the population. It is then possible to define the inclusion probabilities as follows:

$π_{k} = n \frac{x_{k}}{\sum_{k \in U} x_{k}} .$

Methods have been proposed in the literature for the case $π_{k} > 1$ (Särndal et al. 1992).

Second-order inclusion probabilities are generally very difficult to calculate for $π p s$ designs, and therefore Formula (2.2) cannot be used. However, there is a simple asymptotic approximation of the variance, which was proposed by Hájek (1964) and which entails only first-order inclusion probabilities. This approximation proves to be very effective when the sample is large and the entropy of the sampling design is close to maximum entropy. To select sample $s$ with inclusion probabilities $π_{k},$ the cube algorithm (Deville and Tillé 2004) balanced on the variable $π = {(π_{k})}_{k \in U}$ can be used. Deville and Tillé (2005) show that for this particular sampling design, the Hàjek formula is highly effective for estimating the variance of a total or a mean. This formula for approximating the variance can also be used for the covariance, which is then estimated by

${\hat{γ}}_{π ps} (r, t) = \frac{1}{N^{2}} \sum_{k \in s} (1 - π_{k}) (\frac{Y_{k} (r)}{π_{k}} - \hat{R} (r)) (\frac{Y_{k} (t)}{π_{k}} - \hat{R} (t)), r, t \in [0, T], (2.9)$

where

$\hat{R} (t) = \frac{\sum_{k \in s} \frac{Y_{k} (t)}{π_{k}} (1 - π_{k})}{\sum_{k \in s} (1 - π_{k})} .$

We also used the systematic sampling with unequal probabilities proposed by Madow (1949), since it is simple to use. Unfortunately, it is difficult to estimate the variance for this type of design, and we will therefore not use it to construct confidence bands.

2.2 The model-assisted estimator

Consider $p$ real auxiliary variables $X_{1}, \dots, X_{p}$ and let $x_{k j}$ be the value of the variable $X_{j}$ for the $k^{t h}$ individual. Let $x_{k} = (x_{k 1},..., x_{k p})'$ denote the vector containing the values of $p$ auxiliary variables measured on the $k^{t h}$ individual. We consider that the relationship between the variable of interest and the auxiliary variables is modeled by the following superpopulation model

$ξ : Y_{k} (t) = {x^{'}}_{k} β (t) + ε_{k t}, t \in [0, T] (2.10)$

with

$E_{ξ} (ε_{k t}) = 0, E_{ξ} (ε_{k t} ε_{l t^{'}}) = 0 for k \neq l and E_{ξ} (ε_{k t} ε_{k t^{'}}) = σ_{t t^{'}}^{2} for k = l .$

This model is an immediate generalization of the functional linear model proposed by Faraway (1997) to several auxiliary variables.

The estimate of $β$ based on the model $ξ$ and the sampling design $p (.)$ is given by

$\hat{β} (t) = {(\sum_{k \in s} \frac{x_{k} {x^{'}}_{k}}{π_{k}})}^{- 1} \sum_{k \in s} \frac{x_{k} Y_{k} (t)}{π_{k}}, t \in [0, T] . (2.11)$

Note that the sampling weights do not depend on the time $t \in [0, T] .$ Let ${\hat{Y}}_{k} (t) = {x^{'}}_{k} \hat{β} (t)$ be the estimator based on the sampling design for the prediction of $Y_{k} (t)$ under the model $ξ$ . By direct analogy with the univariate case (Särndal et al. 1992), we finally obtain the following estimator for the mean, for $t \in [0, T],$

$\begin{matrix} {\hat{μ}}_{M A} (t) = \frac{1}{N} \sum_{k \in s} {\hat{Y}}_{k} (t) - \frac{1}{N} \sum_{k \in s} \frac{({\hat{Y}}_{k} (t) - Y_{k} (t))}{π_{k}} (2.12) \\ = \frac{1}{N} \sum_{k \in U} \frac{Y_{k} (t) - {x^{'}}_{k} \hat{β} (t)}{π_{k}} + \frac{1}{N} (\sum_{k \in U} {x^{'}}_{k}) \hat{β} (t) . \end{matrix}$

If the $ξ$ contains the constant variable 1, then the estimator becomes

${\hat{μ}}_{M A} (t) = \frac{1}{N} \sum_{k \in U} {\hat{Y}}_{k} (t), t \in [0, T] . (2.13)$

For fixed $r$ and $t$ , the asymptotic covariance of ${\hat{μ}}_{M A} (r)$ and ${\hat{μ}}_{M A} (t)$ can be calculated according to the classical residual technique (Särndal et al. 1992),

$γ_{M A} (r, t) ≃ \frac{1}{N^{2}} \sum_{k \in U} \sum_{l \in U} (π_{k l} - π_{k} π_{l}) \frac{(Y_{k} (r) - {\tilde{Y}}_{k} (r))}{π_{k}} \frac{(Y_{l} (t) - {\tilde{Y}}_{l} (t))}{π_{l}}, (2.14)$

where ${\tilde{Y}}_{k} (r) = {x^{'}}_{k} \tilde{β} (t)$ is the prediction of $Y_{k} (t)$ under the superpopulation model and $\tilde{β} (t) = {(\sum_{U} x_{k} {x^{'}}_{k})}^{- 1} (\sum_{U} x_{k} Y_{k} (t))$ is the estimate of $β$ at the level of the population and $r, t \in [0, T] .$ Cardot, Goga and Lardin (2013) show that this result remains valid uniformly in $r, t \in [0, T] .$

As an estimator of the covariance function $γ_{M A} (r, t)$ , we propose the Horvitz-Thompson estimator of asymptotic covariance given by (2.14) where $\tilde{β} (t)$ is replaced by its estimator $\hat{β} (t)$ based on the sampling design,

${\hat{γ}}_{M A} (r, t) = \frac{1}{N^{2}} \sum_{k, l \in s} \frac{π_{k l} - π_{k} π_{l}}{π_{k l}} \frac{(Y_{k} (r) - {\hat{Y}}_{k} (r))}{π_{k}} \frac{(Y_{l} (t) - {\hat{Y}}_{l} (t))}{π_{l}}, r, t \in [0, T] . (2.15)$

Note 2.2 It is entirely possible to consider a superpopulation model $ξ$ that is more general than the linear model proposed here. Estimation techniques based on smoothing by B-splines (Goga and Ruiz-Gazen 2012) can then also be considered. In our study, the relationship between consumption at instant $t$ and the mean consumption for the previous week is almost linear (cf. Figure 4.1), which justifies not using these non-parametric approaches.

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

2 Functional data in a finite population

2.1 Using auxiliary information for estimating the mean trajectory

2.1.1 Stratified sampling with SRSWOR in each stratum (STRAT)

2.1.2 PPS sampling without replacement ( $π p s$ )

2.2 The model-assisted estimator