3 The problem with skewed populations

Pierre Lavallée and Sébastien Labelle-Blanchet

As mentioned in the introduction, the application of the GWSM to business surveys can produce estimates with large variances. This lack of precision is due to the skewness of the population. We propose to illustrate the problem with a small example given in Figure 3.1.

We want to study the revenue $y$ of the population ${U}^{B}$ of Figure 3.1 containing ${N}^{B}=$ 3 enterprises, where enterprise 1 contains ${M}_{1}^{B}=$ 4 establishments, enterprise 2 contains ${M}_{2}^{B}=$ 4 establishments, and enterprise 3 contains ${M}_{3}^{B}=$ 3 establishments. As it can be observed from Figure 3.1, the revenue $y$ of the ${M}^{B}=$ 11 establishments can be considered as a skewed population.

For the survey, we build a frame ${U}^{A}$ containing the ${M}^{A}=$ 11 establishments, and we decide to stratify the establishments according to three size strata: stratum $h=$ 1 contains the establishments with $y\ge 750;$ $h=$ 2 contains those with $100\le y<750;$ and $h=$ 3 those with $y<100$ (in practice, such a stratification is not possible since the stratification variable $y$ is the same as the variable of interest, and instead, we would use some size variable $x$ highly correlated with the variable of interest $y\right).$ In stratum $h=$ 1, we use a sampling fraction of 1 (i.e., ${f}_{1}={m}_{1}^{A}/{M}_{1}^{A}=1\right);$ for $h=$ 2, the sample size is 1 (i.e., ${f}_{2}={m}_{2}^{A}/{M}_{2}^{A}=1/3\right);$ and for $h=$ 3, the sample size is 2 (i.e., ${f}_{3}={m}_{3}^{A}/{M}_{3}^{A}=2/6=1/3\right).$

There are $1×3×15=45$ possible samples ${s}^{A}$ that can be selected from ${U}^{A},$ for estimating the true total $Y=$ 3,800. For each of these 45 samples, we computed $\stackrel{^}{Y}$ using (2.1). The estimates are presented in the left box plot of Figure 3.2.

Data table for Figure 3.1

We also computed estimates of $Y$ assuming the use of stratified SRSWoR without Indirect Sampling. That is, in each stratum $h,$ we select a sample ${s}_{h}^{A}$ of ${m}_{h}^{A}$ establishments using SRSWoR and we measure only the variable of interest ${y}_{ik}$ for the establishments $ik$ of ${U}^{B}$ directly linked to the sampled establishments $j$ of ${U}^{A}.$ Thus, we measure the variable of interest ${y}_{j}$ for the sampled establishments $j$ of ${U}^{A}.$ Unlike Indirect Sampling, we do not measure the variables of interest of the other establishments of the enterprises containing the initially sampled establishments. This corresponds to the classical sampling theory. Thus, we estimated $Y$ using

$\stackrel{^}{Y}{}_{\text{classic}}=\sum _{h=1}^{H}\frac{{M}_{h}^{A}}{{m}_{h}^{A}}\sum _{j=1}^{{m}_{h}^{A}}{y}_{hj}.$(3.1)

It can be proved that estimator (3.1) is unbiased, and its variance is given by

$\text{Var}\left({\stackrel{^}{Y}}_{\text{classic}}\right)=\sum _{h=1}^{H}{M}_{h}^{A}\left(\frac{{M}_{h}^{A}-{m}_{h}^{A}}{{m}_{h}^{A}}\right){S}_{y,h}^{2}$(3.2)

where ${S}_{y,h}^{2}=\sum _{j=1}^{{M}_{h}^{A}}{\left({y}_{hj}-{\overline{Y}}_{h}\right)}^{2}/\left({M}_{h}^{A}-1\right)$ and ${\overline{Y}}_{h}=\sum _{j=1}^{{M}_{h}^{A}}{y}_{hj}/{M}_{h}^{A}.$ The estimates are presented in the right box plot of Figure 3.2.

Data table for Figure 3.2

As we can see from Figure 3.2, the estimates obtained from Indirect Sampling (and the GWSM) are quite variable from one sample to the next. If we do not use Indirect Sampling (i.e., we use the classical approach), the variability is much less. This result can be seen directly from the variances of $\stackrel{^}{Y}$ and ${\stackrel{^}{Y}}_{\text{classic}}.$ Using formulas (2.7) and (3.2), we obtain the variance $V\left({\stackrel{^}{Y}}_{\text{classic}}\right)=$ 80,480, while $V\left(\stackrel{^}{Y}\right)=$ 1,115,111!

The next section presents methods designed to reduce the variability of the estimates produced using Indirect Sampling.

Date modified: