Browse by

4. Application to winsorized estimators

Cyril Favre Martinoz, David Haziza and Jean-François Beaumont

Estimator (3.5) can be written in alternative forms, which can make it easier to implement in some cases. We consider the winsorized form. This form has been widely studied in the literature. As mentioned in Section 1, standard winsorization is distinguished from Dalén-Tambay winsorization.

Standard winsorization involves decreasing the value of units that are above a particular threshold, taking their weight into account. Let ${\tilde{y}}_{i}$ be the value of variable $y$ for unit $i$ after winsorization. We have

${\tilde{y}}_{i} = {\begin{array}{l} y_{i} & if d_{i} y_{i} \leq K \\ \frac{K}{d_{i}} & if d_{i} y_{i} > K \end{array} (4.1)$

where $K > 0$ is the winsorization threshold. The standard winsorized estimator of the total $t$ is given by

$\begin{array}{l} {\hat{t}}_{s} & = \sum_{i \in S} d_{i} {\tilde{y}}_{i} (4.2) \\ = \hat{t} + Δ (K), \end{array}$

where

$Δ (K) = - \sum_{i \in S} \max (0, d_{i} y_{i} - K) .$

Hence, the estimator (4.2) can be written in the form (3.1). An alternative is to express ${\hat{t}}_{s}$ as a weighted sum of the initial values using modified weights:

${\hat{t}}_{s} = \sum_{i \in S} {\tilde{d}}_{i} y_{i},$

where

${\tilde{d}}_{i} = d_{i} \frac{min (y_{i}, \frac{K}{d_{i}})}{y_{i}} . (4.3)$

If $min (y_{i}, K / d_{i}) = y_{i}$ (that is, if unit $i$ is not influential), then ${\tilde{d}}_{i} = d_{i} .$ Thus, the weight of a non-influential unit is not modified. In contrast, the modified weight of an influential unit is less than $d_{i}$ and may even be less than 1. It is worth noting that a unit with a value of $y_{i} = 0$ presents no particular problems, since its contribution to the estimated total, ${\hat{t}}_{s},$ is zero. In this case, an arbitrary value can be assigned to the modified weight ${\tilde{d}}_{i} .$

In the case of Dalén-Tambay winsorization, the values of the variable of interest after winsorization are defined by

${\tilde{y}}_{i} = {\begin{array}{l} y_{i} & if d_{i} y_{i} \leq K \\ \frac{K}{d_{i}} + \frac{1}{d_{i}} (y_{i} - \frac{K}{d_{i}}) & if d_{i} y_{i} > K \end{array} . (4.4)$

This leads to the winsorized estimator of the total $t_{y} :$

$\begin{array}{l} {\hat{t}}_{DT} & = \sum_{i \in S} d_{i} {\tilde{y}}_{i} (4.5) \\ = \hat{t} + Δ (K), \end{array}$

where

$Δ (K) = - \sum_{i \in S} \frac{(d_{i} - 1)}{d_{i}} \max (0, d_{i} y_{i} - K) .$

Estimator (4.5) can also be written in the form (3.1). As in the case of ${\hat{t}}_{s},$ an alternative is to express ${\hat{t}}_{DT}$ as a weighted sum of the initial values using modified weights:

${\hat{t}}_{DT} = \sum_{i \in S} {\tilde{d}}_{i} y_{i},$

where

${\tilde{d}}_{i} = 1 + (d_{i} - 1) \frac{min (y_{i}, \frac{K}{d_{i}})}{y_{i}} . (4.6)$

As in the case of the standard winsorized estimator, the weight of a non-influential unit is not modified. Unlike standard winsorization, Dalén-Tambay winsorization guarantees that the modified weights will not be less than 1. Once again, a unit with a value of $y_{i} = 0$ presents no particular problems, since its contribution to the estimated total, ${\hat{t}}_{DT},$ is zero. In this case, an arbitrary value can be assigned to the modified weight ${\tilde{d}}_{i} .$

Since the standard and Dalén-Tambay winsorized estimators are of the form (3.1), the optimal constant $K_{opt}$ that minimizes (3.2) is obtained by solving

$Δ (K) = - \frac{1}{2} ({\hat{B}}_{min} + {\hat{B}}_{max})$

$\sum_{j \in S} a_{j} \max (0, d_{j} y_{j} - K) = \frac{{\hat{B}}_{min} + {\hat{B}}_{max}}{2}, (4.7)$

where $a_{j} = 1$ in the case of ${\hat{t}}_{s}$ and $a_{j} = (d_{j} - 1) / d_{j}$ in the case of ${\hat{t}}_{DT} .$ It is shown in the Appendix that a solution to equation (4.7) exists under the following conditions:

$π_{i j} - π_{i} π_{j} \leq 0; and$
$\frac{1}{2} ({\hat{B}}_{min} + {\hat{B}}_{max}) \geq 0.$

Condition 1 is satisfied for most one-stage designs used in practice, such as stratified simple random sampling and Poisson sampling. Condition 2 implies that ${\hat{t}}_{R}$ must be less than or equal to $\hat{t},$ since by construction, a winsorized estimator cannot be greater than the Horvitz-Thompson estimator. It is generally expected that Condition 2 will be satisfied in most skewed populations encountered in business surveys and social surveys. It is also shown in the Appendix that the solution to equation (4.7) is unique if the above conditions are met and if $y_{i} \geq 0$ for $i \in S .$ The Appendix contains a brief description of an algorithm for finding the solution to equation (4.7).

It should be noted that while the value $K_{opt}$ is different for each type of winsorized estimator used, the resulting robust estimators are identical. In other words, we have

${\hat{t}}_{s} (K_{opt}) = {\hat{t}}_{DT} (K_{opt}) = {\hat{t}}_{R} = \hat{t} - \frac{{\hat{B}}_{min} + {\hat{B}}_{max}}{2} . (4.8)$

To compare the influence of each population unit with respect to the (non-robust) expansion estimator, $\hat{t},$ and its robust version (4.8), we carried out a simulation study. For that purpose, we generated two populations, each of size $N = 100.$ One population was generated according to a normal distribution with mean 4,108 and standard deviation 1,500, and the other was generated according to a lognormal distribution with mean 4,108 and standard deviation 7,373. From each population we selected $M = 500,000$ samples according to two sampling designs: (i) a simple random sampling without-replacement design of size $n = 10,$ and (ii) a Bernoulli design of expected size $n = 10.$ First, we calculated the conditional bias of the Horvitz-Thompson estimator for a simple random sampling without-replacement design, given in (2.3) and for a Bernoulli design, given in (2.4). Note that the conditional bias of the Horvitz-Thompson estimator does not have to be approximated by simulation since all the population parameters are known. The conditional bias associated with unit $i$ of the robust estimator given in (3.3) was approximated as follows: Out of the 500,000 selected samples, we identified those which contained unit $i .$ In each of these samples, we calculated the error, ${\hat{t}}_{R} - t .$ Finally, we calculated the average value of ${\hat{t}}_{R} - t$ over all the samples containing unit $i .$

The results for the simple random sampling without-replacement design for the normal and lognormal distributions are shown in Figures 4.1 (a) and 4.1 (b) respectively. The results for the Bernoulli sampling design for the normal and lognormal distributions are shown in Figures 4.1 (c) and 4.1 (d) respectively. In each figure, the absolute value of the conditional bias of ${\hat{t}}_{R}$ is shown in relation to the absolute value of the conditional bias of $\hat{t}$ for each population unit. The units above the first bisectrix have a conditional bias associated with ${\hat{t}}_{R}$ whose absolute value is greater than that of the conditional bias associated with $\hat{t} .$ Looking first at the results for simple random sampling without replacement, we see that the behaviour of the absolute value of the conditional bias of ${\hat{t}}_{R}$ is similar to that of the absolute value of the conditional bias of $\hat{t},$ which indicates that the influence of the units is not altered significantly after robustification of the expansion estimator. This result is not surprising since the population does not contain any highly influential units. In the case of the lognormal distribution, we see that the influence of the values that have a high conditional bias associated with $\hat{t}$ has been reduced significantly. On the other hand, we note that for the majority of the data, the conditional bias of ${\hat{t}}_{R}$ is slightly higher than that of $\hat{t} .$ Turning to the results for Bernoulli sampling, we see that in the case of the normal population, the influence of most units has been reduced, since the absolute value of the conditional bias of ${\hat{t}}_{R}$ is significantly lower than the absolute value of the conditional bias of $\hat{t} .$ In the case of the lognormal distribution, the results are similar to those obtained with simple random sampling without replacement for the same distribution.

Figure 4.1 Absolute value of the conditional biases of the robust and non-robust estimators

Figure 4.1 Absolute value of the conditional biases of the robust and non-robust estimators

Description for Figure 4.1

Previous | Next

Date modified:: 2015-11-27

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

4. Application to winsorized estimators