# 3 The problem with skewed populations

Pierre Lavallée and Sébastien Labelle-Blanchet

As mentioned in the introduction, the application of the GWSM to business surveys can produce estimates with large variances. This lack of precision is due to the skewness of the population. We propose to illustrate the problem with a small example given in Figure 3.1.

We want to study the revenue $y$ of the population ${U}^{B}$ of Figure 3.1 containing ${N}^{B}=$ 3 enterprises, where enterprise 1 contains ${M}_{1}^{B}=$ 4 establishments, enterprise 2 contains ${M}_{2}^{B}=$ 4 establishments, and enterprise 3 contains ${M}_{3}^{B}=$ 3 establishments. As it can be observed from Figure 3.1, the revenue $y$ of the ${M}^{B}=$ 11 establishments can be considered as a skewed population.

For the survey, we build a frame ${U}^{A}$ containing the ${M}^{A}=$ 11
establishments, and we decide to stratify the establishments according to three
size strata: stratum $h=$ 1
contains the establishments with $y\ge 750;$ $h=$ 2
contains those with $100\le y<750;$ and $h=$ 3
those with $y<100$ (in practice, such a stratification is not
possible since the stratification variable $y$ is the same as the variable of interest, and
instead, we would use some size variable $x$ highly correlated with the variable of
interest $y).$ In stratum $h=$ 1,
we use a sampling fraction of 1 (*i.e.*,
${f}_{1}={m}_{1}^{A}/{M}_{1}^{A}=1);$ for $h=$ 2,
the sample size is 1 (*i.e.*, ${f}_{2}={m}_{2}^{A}/{M}_{2}^{A}=1/3);$ and for $h=$ 3,
the sample size is 2 (*i.e.*, ${f}_{3}={m}_{3}^{A}/{M}_{3}^{A}=2/6=1/3).$

There are $1\times 3\times 15=45$ possible samples ${s}^{A}$ that can be selected from ${U}^{A},$ for estimating the true total $Y=$ 3,800. For each of these 45 samples, we computed $\widehat{Y}$ using (2.1). The estimates are presented in the left box plot of Figure 3.2.

We also computed estimates of $Y$ assuming the use of stratified SRSWoR *without* Indirect Sampling. That is, in
each stratum $h,$ we select a sample ${s}_{h}^{A}$ of ${m}_{h}^{A}$ establishments using SRSWoR and we measure
only the variable of interest ${y}_{ik}$ for the establishments $ik$ of ${U}^{B}$ directly linked to the sampled establishments $j$ of ${U}^{A}.$ Thus, we measure the variable of interest ${y}_{j}$ for the sampled establishments $j$ of ${U}^{A}.$ Unlike Indirect Sampling, we do not measure
the variables of interest of the other establishments of the enterprises
containing the initially sampled establishments. This corresponds to the
classical sampling theory. Thus, we estimated $Y$ using

$\widehat{Y}{}_{\text{classic}}={\displaystyle \sum _{h=1}^{H}\frac{{M}_{h}^{A}}{{m}_{h}^{A}}{\displaystyle \sum _{j=1}^{{m}_{h}^{A}}{y}_{hj}}}.$(3.1)

It can be proved that estimator (3.1) is unbiased, and its variance is given by

$\text{Var}({\widehat{Y}}_{\text{classic}})={\displaystyle \sum _{h=1}^{H}{M}_{h}^{A}\left(\frac{{M}_{h}^{A}-{m}_{h}^{A}}{{m}_{h}^{A}}\right){S}_{y,h}^{2}}$(3.2)

where ${S}_{y,h}^{2}={\displaystyle \sum _{j=1}^{{M}_{h}^{A}}{\left({y}_{hj}-{\overline{Y}}_{h}\right)}^{2}}/({M}_{h}^{A}-1)$ and ${\overline{Y}}_{h}={\displaystyle \sum _{j=1}^{{M}_{h}^{A}}{y}_{hj}}/{M}_{h}^{A}.$ The estimates are presented in the right box plot of Figure 3.2.

As we can see from Figure 3.2, the estimates obtained
from Indirect Sampling (and the GWSM) are quite variable from one sample to the
next. If we do not use Indirect Sampling (*i.e.*,
we use the classical approach), the variability is much less. This result can
be seen directly from the variances of $\widehat{Y}$ and ${\widehat{Y}}_{\text{classic}}.$ Using formulas (2.7) and (3.2), we obtain the
variance $V({\widehat{Y}}_{\text{classic}})=$ 80,480,
while $V(\widehat{Y})=$ 1,115,111!

The next section presents methods designed to reduce the variability of the estimates produced using Indirect Sampling.

- Date modified: