Browse by

5. Robust estimation of domain totals

Cyril Favre Martinoz, David Haziza and Jean-François Beaumont

In practice, we usually want to produce estimates for population domains as well as an estimate at the global level. Let $t_{g} = \sum_{i \in U_{g}} y_{i}$ be the total of the $y -$ variable in domain $g .$ We assume that the domains form a partition of the population such that $t = \sum_{i \in U} y_{i} = \sum_{g = 1}^{G} t_{g},$ where $G$ is the number of domains. Let $S_{g}$ be the set of sampled units in domain $g .$ The expansion estimator of $t_{g}$ is given by ${\hat{t}}_{g} = \sum_{i \in S_{g}} d_{i} y_{i} .$ We have the consistency relation $\sum_{g = 1}^{G} {\hat{t}}_{g} = \hat{t} .$

In the presence of influential values, we can apply a robust procedure separately for each domain using the method described in Section 3, which leads to $G$ robust estimators, ${\hat{t}}_{R, g} .$ A robust estimator of the total at the population level, ${\hat{t}}_{R (agg)},$ is easily obtained by aggregating the robust estimators ${\hat{t}}_{R, g} .$ Thus, we have ${\hat{t}}_{R (agg)} = \sum_{g = 1}^{G} {\hat{t}}_{R, g} .$ The consistency relation between the domain-level estimates and the population-level estimate is therefore satisfied. However, aggregating $G$ robust estimators, each suffering from a potential bias, may produce a highly biased aggregate robust estimator, ${\hat{t}}_{R (agg)} .$ In most cases, the bias of ${\hat{t}}_{R (agg)}$ will be negative, since each of the ${\hat{t}}_{R, g}$ estimators has a negative bias.

To avoid having an estimator with an unacceptable bias, we first compute the robust estimator (4.8), ${\hat{t}}_{R, g},$ for each domain. Then, we independently compute a robust estimator of the total $t$ in the population, ${\hat{t}}_{R,0},$ given by (4.8). In this case, however, the consistency relation is no longer necessarily satisfied. In other words, we have ${\hat{t}}_{R,0} \neq \sum_{g = 1}^{G} {\hat{t}}_{R, g},$ in general. It is therefore necessary to force consistency between the robust domain estimates and the aggregate robust estimate using a method similar to calibration. To do so, we compute final robust estimates ${\hat{t}}_{R, g}^{*}, g = 0, 1, .., G,$ that are as close as possible to the initial robust estimates ${\hat{t}}_{R, g},$ based on a particular distance function, and that satisfy the calibration equation

$\sum_{g = 1}^{G} {\hat{t}}_{R, g}^{*} = {\hat{t}}_{R,0}^{*} . (5.1)$

In the case of the generalized chi-square distance function, we are seeking final robust estimates, ${\hat{t}}_{R, g}^{*},$ such that

$\sum_{g = 0}^{G} \frac{{{\hat{t}}_{R, g}^{*} - {\hat{t}}_{R, g}}^{2}}{2 q_{g} {\hat{t}}_{R, g}} (5.2)$

is minimized subject to (5.1). The coefficient $q_{g}$ in the above expression is a weight assigned to the initial estimate in domain $g, {\hat{t}}_{R, g},$ and is interpreted as its importance in the minimization problem. Using the Lagrange multipliers method, we can easily obtain a solution to this minimization problem. The solution is given by

${\hat{t}}_{R, g}^{*} = {\hat{t}}_{R, g} - \frac{\sum_{h = 0}^{G} δ_{h} {\hat{t}}_{R, h}}{\sum_{h = 0}^{G} q_{h} {\hat{t}}_{R, h}} δ_{g} q_{g} {\hat{t}}_{R, g}, (5.3)$

where $δ_{0} = - 1$ and $δ_{g} = 1,$ for $g = 1, \dots, G .$

We make the following remarks: (i) If $q_{g} = 0,$ then the final robust estimate ${\hat{t}}_{R, g}^{*}$ is identical to the initial robust estimate ${\hat{t}}_{R, g} .$ Thus, if we want to ensure that the initial estimate in domain $g$ is not modified excessively, we simply associate it with a small value of $q_{g} .$ This point is also illustrated empirically in Section 6.2. (ii) Note that like the initial robust estimates at the domain level, ${\hat{t}}_{R, g},$ for $g = 1, \dots, G,$ the initial robust estimate at the population level, ${\hat{t}}_{R,0},$ can also be modified. (iii) If $q_{0} = 0$ (in other words, the initial robust estimate for the population level is not modified) and $q_{g} = q$ for $g = 1, \dots, G,$ where $q$ is a strictly positive constant, expression (5.3) simplifies to

${\hat{t}}_{R, g}^{*} = {\hat{t}}_{R, g} (\frac{{\hat{t}}_{R,0}}{{\hat{t}}_{R (agg)}}) . (5.4)$

In this case, the initial estimates ${\hat{t}}_{R, g}$ are all modified by the same factor, ${\hat{t}}_{R,0} / {\hat{t}}_{R (agg)} .$ (iv) How can we set the values of $q_{g}$ in practice? It seems natural to adopt the following choice:

$q_{g} = \hat{CV} ({\hat{t}}_{g}) / \sum_{g = 1}^{G} \hat{CV} ({\hat{t}}_{g}),$

where $\hat{CV} ({\hat{t}}_{g})$ is the estimated coefficient of variation (CV) associated with domain $g .$ For example, in a repeated survey, the estimated CV observed in a previous iteration can be used. This choice of $q_{g}$ is based on the fact that we will not want to make a large change in the initial estimate associated with a domain that has a small estimated CV. In such a domain, the problem of influential values is clearly less serious, and the initial robust estimate ${\hat{t}}_{R, g}$ is expected to be relatively close to the actual total $t_{g} .$ In other words, the robust estimator ${\hat{t}}_{R, g}$ should have low bias and be relatively stable. It therefore makes sense not to attempt to change the initial robust estimate substantially. (v) In (5.2), we used the generalized chi-square distance, which leads to the linear method. In the literature on calibration (e.g., Deville and Särndal 1992), there are a number of other calibration methods. In particular, there is the Kullback-Leibler distance, which leads to the exponential method and the logit and truncated linear methods. Using the last two methods, we can specify positive bounds $C_{1}$ and $C_{2}$ such that $C_{1} \leq {\hat{t}}_{R, g}^{*} / {\hat{t}}_{R, g} \leq C_{2} .$ In other words, we ensure that the ratio ${\hat{t}}_{R, g}^{*} / {\hat{t}}_{R, g}$ falls within the interval between $C_{1}$ and $C_{2} .$ Note that the calibration procedure may lead to ${\hat{t}}_{R, g}^{*} - {\hat{t}}_{g} \geq 0,$ for a certain $g,$ which is counterintuitive. In this case, we simply include the constraint ${\hat{t}}_{R, g}^{*} \leq {\hat{t}}_{g}$ for $g = 1, \dots, G,$ in the calibration procedure. (vi) An alternative is to express ${\hat{t}}_{R, g}^{*}$ as a weighted sum of the initial values using modified weights:

${\hat{t}}_{R, g}^{*} = \sum_{i \in S_{g}} {\tilde{d}}_{i}^{*} y_{i},$

where

${\tilde{d}}_{i}^{*} = {\tilde{d}}_{i} (1 - δ_{g} q_{g} \frac{\sum_{h = 0}^{G} δ_{h} {\hat{t}}_{R, h}}{\sum_{h = 0}^{G} q_{h} {\hat{t}}_{R, h}})$

and ${\tilde{d}}_{i}$ is given by either (4.3) or (4.6). We can also write the estimator ${\hat{t}}_{R, g}^{*}$ as a weighted sum with the initial weights using modified values:

${\hat{t}}_{R, g}^{*} = \sum_{i \in S_{g}} d_{i} {\tilde{y}}_{i}^{*},$

where

${\tilde{y}}_{i}^{*} = {\tilde{y}}_{i} (1 - δ_{g} q_{g} \frac{\sum_{h = 0}^{G} δ_{h} {\hat{t}}_{R, h}}{\sum_{h = 0}^{G} q_{h} {\hat{t}}_{R, h}}), i \in g$

and ${\tilde{y}}_{i}$ is given by either (4.1) or (4.4). (vii) We may want to find the winsorization thresholds $K_{g}, g = 1, \dots, G,$ such that the standard winsorized estimator or the Dalén-Tambay winsorized estimator is equal to ${\hat{t}}_{R, g}^{*} .$ We can follow a procedure similar to the one in Section 4, and we can use an algorithm similar to the one in the Appendix. A necessary condition for the existence of a solution is that ${\hat{t}}_{g} - {\hat{t}}_{R, g}^{*} \geq 0.$ (viii) With the proposed calibration procedure, more than one partition of the population can be dealt with jointly. For example, we may be interested in publishing both provincial estimates and industry estimates. If so, we simply insert the following calibration equations into the calibration procedure:

$\sum_{g = 1}^{G} {\hat{t}}_{R, g}^{*} = {\hat{t}}_{R,0}^{*},$

$\sum_{l = 1}^{L} {\hat{t}}_{R, l}^{*} = {\hat{t}}_{R,0}^{*},$

where $G$ and $L$ denote the number of provinces and the number of industries respectively. The method can also be applied to more than two partitions of the population.

Previous | Next

Date modified:: 2015-11-27

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

5. Robust estimation of domain totals