# 2 Indirect sampling and the GWSM

Pierre Lavallée and Sébastien Labelle-Blanchet

In this section, we provide an overview of Indirect Sampling and the GWSM. Although Indirect Sampling has been developed for any type of sample design, we will focus on stratified Simple Random Sampling Without Replacement (SRSWoR), since this sampling design is the most commonly used for business surveys.

Let the population ${U}^{A}$ of ${M}^{A}$ establishments be stratified in $H$ strata, where stratum $h$ contains ${M}_{h}^{A}$ establishments. In each stratum $h,$ we select a sample ${s}_{h}^{A}$ of ${m}_{h}^{A}$ establishments using SRSWoR. Let ${s}^{A}={\cup }_{h=1}^{H}{s}_{h}^{A}$ and ${m}^{A}={\sum }_{h=1}^{H}{m}_{h}^{A}.$ The target population ${U}^{B}$ contains ${N}^{B}$ enterprises, where enterprise $i$ contains those ${M}_{i}^{B}$ establishments of ${U}^{A}.$ This population can also be viewed as a population of ${M}^{B}$ establishments, where each establishment $k$ belongs to an enterprise $i,$ with ${M}^{B}={\sum }_{i=1}^{{N}^{B}}{M}_{i}^{B}.$

We wish to produce an estimate for the target population ${U}^{B},$ using the sampling frame ${U}^{A}$ along with the existing links between the two populations. The links between population ${U}^{A}$ and population ${U}^{B}$ are identified by the indicator variable ${l}_{j,i},$ where ${l}_{j,i}=1$ if there exists a link between establishment $j\in {U}^{A}$ and enterprise $i\in {U}^{B},$ and 0 otherwise. In the present case, ${l}_{j,i}=1$ if the establishment $j$ of ${U}^{A}$ belongs to enterprise $i$ of ${U}^{B},$ and 0 otherwise. Because each establishment can belong to only one enterprise, the links between ${U}^{A}$ and ${U}^{B}$ are many-to-one or one-to-one. Therefore, we have ${L}_{j}^{A}={\sum }_{i=1}^{{N}^{B}}{l}_{j,i}=1,$ ${L}_{i}^{B}={\sum }_{j=1}^{{M}^{A}}{l}_{j,i}={M}_{i}^{B},$ for all establishments $j\in {U}^{A}$ and for all enterprise $i\in {U}^{B}.$

Steps for Indirect Sampling:

1. For each establishment $j$ selected in ${s}^{A},$ we identify the corresponding enterprise $i$ of ${U}^{B}.$
2. For each enterprise $i$ identified, we assume that we can set up the list ${U}_{i}^{B}$ of all ${M}_{i}^{B}$ establishments of this enterprise.
3. For each enterprise $i$ identified, we survey all ${M}_{i}^{B}$ establishments of the enterprise.
4. At the end, we obtain a sample ${s}^{B}$ of ${n}^{B}$ enterprises, and this sample contains ${m}^{B}={\sum }_{i=1}^{{n}^{B}}{M}_{i}^{B}$ establishments.

For all the establishments $k$ linked to enterprises $i\in {s}^{B},$ we measure a variable of interest ${y}_{ik}.$ We want to estimate the total $Y={\sum }_{i=1}^{{N}^{B}}{\sum }_{k=1}^{{M}_{i}^{B}}{y}_{ik}={\sum }_{i=1}^{{N}^{B}}{Y}_{i}$ for the target population ${U}^{B}.$ Note that the collection process of Indirect Sampling results in a number of surveyed establishments that is much larger than the number of establishments in the initial sample ${s}^{A}.$ We initially sample ${m}^{A}$ establishments in ${s}^{A},$ and end up with sampling ${m}^{B}={\sum }_{i=1}^{{n}^{B}}{M}_{i}^{B}$ establishments, where ${m}^{B}\ge {m}^{A}.$

In practice, it can happen that some enterprises only provide their data at the enterprise level. That is, we obtain the values ${Y}_{i}={\sum }_{k=1}^{{M}_{i}^{B}}{y}_{ik}$ for $i\in {s}^{B},$ but not the values ${y}_{ik}$ measured at the establishment level. As we will see, this does not create problems for global estimates, but it might create difficulties for some detailed estimates. When this occurs, a disaggregation (also called allocation) of the enterprise values to the establishment level is performed mainly based on subject matter expertise (see for example, Delorme 2000).

With indirect sampling, nonresponse can be present within the sample ${s}^{A}$ selected from ${U}^{A},$ or within the units (enterprises or establishments) identified to be surveyed within ${U}^{B}.$ Since the units in population ${U}^{B}$ are in fact surveyed by cluster (recall that enterprises are clusters of establishments), there are two types of nonresponse from ${U}^{B}\text{ }:$ cluster nonresponse and unit nonresponse. Cluster nonresponse refers to the case where the variable of interest $y$ is not measured for any of the establishments of the enterprises selected in the survey. Unit nonresponse occurs when one or more establishments of the enterprise, but not all, did not respond. With Indirect Sampling, there is also another form of nonresponse that comes from the problem of identifying some of the links. This type of nonresponse is associated with the situation where it is impossible to determine whether an establishment $k$ of an enterprise $i$ of ${U}^{B}$ is linked or not to an establishment $j$ of ${U}^{A}.$ This is referred to as the problem of links identification. Lavallée (2002, 2007) proposed solutions to correct these types of nonresponse based on weight adjustments. To restrict the scope of the present paper, we will assume that nonresponse does not occur at any level.

According to the GWSM, to estimate the total $Y,$ we use the estimator

$\stackrel{^}{Y}=\sum _{i=1}^{{n}^{B}}{w}_{i}{Y}_{i}$(2.1)

where ${n}^{B}$ is the number of surveyed enterprises. The weights obtained from the GWSM are given by

${w}_{i}=\sum _{j=1}^{{M}^{A}}\frac{{t}_{j}^{A}}{{\pi }_{j}^{A}}\frac{{l}_{j,i}}{{L}_{i}^{B}}$(2.2)

where ${t}_{j}^{A}=1$ if $j\in {s}^{A},$ 0 otherwise, and ${\pi }_{j}^{A}$ is the selection probability of establishment $j.$ In the present case, we have ${\pi }_{j}^{A}={m}_{h}^{A}/{M}_{h}^{A}$ for $j\in h.$ It should be noted that the weights (2.2) do not correspond, in general, to the selection probabilities ${\pi }_{i}^{B}$ of the enterprises $i.$ Using (2.2), we can rewrite estimator (2.1) as

$\stackrel{^}{Y}=\sum _{j=1}^{{M}^{A}}\frac{{t}_{j}}{{\pi }_{j}^{A}}{Z}_{j}$(2.3)

where

${Z}_{j}=\sum _{i=1}^{{N}^{B}}\frac{{Y}_{i}}{{L}_{i}^{B}}{l}_{j,i}.$(2.4)

Because of the many-to-one correspondence between ${U}^{A}$ and ${U}^{B},$ we have

${w}_{i}=\frac{1}{{M}_{i}^{B}}\sum _{j=1}^{{M}_{i}^{B}}\frac{{t}_{j}^{A}}{{\pi }_{j}^{A}}.$(2.5)

In addition, the variable ${Z}_{j}$ of (2.4) can be written as ${Z}_{j}={Y}_{i}/{M}_{i}^{B}={\overline{Y}}_{i},$ for $j\in i,$ which is the average of the ${M}_{i}^{B}$ establishments belonging to enterprise $i.$ We thus have

$\stackrel{^}{Y}=\sum _{h=1}^{H}\frac{{M}_{h}^{A}}{{m}_{h}^{A}}\sum _{j=1}^{{m}_{h}^{A}}{Z}_{hj}$(2.6)

where ${Z}_{hj}={Y}_{i}/{M}_{i}^{B}={\overline{Y}}_{i},$ for $j\in i.$

One can prove that estimator (2.1) (and therefore (2.3) and (2.6)) is unbiased for $Y$ (see Lavallée 2002, 2007). Note that estimator $\stackrel{^}{Y}$ is in fact only a Horvitz-Thompson estimator where the variable of interest is the variable ${Z}_{hj}.$ In the case of stratified SRSWoR, its variance is given by

$\text{Var}\left(\stackrel{^}{Y}\right)=\sum _{h=1}^{H}{M}_{h}^{A}\left(\frac{{M}_{h}^{A}-{m}_{h}^{A}}{{m}_{h}^{A}}\right){S}_{Z,h}^{2}$(2.7)

where ${S}_{Z,h}^{2}=\sum _{j=1}^{{M}_{h}^{A}}{\left({Z}_{hj}-{\overline{Z}}_{h}\right)}^{2}/\left({M}_{h}^{A}-1\right)$ and ${\overline{Z}}_{h}=\sum _{j=1}^{{M}_{h}^{A}}{Z}_{hj}/{M}_{h}^{A}.$ The variance $\text{Var}\left(\stackrel{^}{Y}\right)$ can be estimated using the classical estimator for stratified SRSWoR, or by other variance estimators proposed in the scientific literature, such as Jackknife and Bootstrap estimators. See Wolter (2007) or Särndal, Swensson and Wretman (1992).

The precision of the estimates produced using the GWSM depends solely on the variance because the estimator (2.1) (and therefore (2.3) and (2.6)) is unbiased. Looking at equation (2.7), we find that the precision depends, as in the classical case, on the sample sizes and sampling fractions used to select ${s}^{A},$ but also on the variability of the derived variables $Z.$ Since ${Z}_{hj}={Y}_{i}/{M}_{i}^{B}={\overline{Y}}_{i},$ for $j\in i,$ the value of ${Z}_{hj}$ is the same for all establishments $j$ of a given enterprise $i.$ That is, the enterprise total ${Y}_{i}$ is shared equally among its establishments. If all the establishments of an enterprise belong to the same stratum, the variability of the variables $Z$ within a stratum will only depend on the difference between the average values of a limited number of enterprises, which might make the variability to be relatively small. On the other hand, if the establishments of an enterprise belong to different strata, the variability of the variables $Z$ within a stratum will depend on the difference between up to as many enterprises as there are establishments, which might result in a quite large variability. Because of the skewness of the population of establishments and the stratification applied to ${U}^{A},$ the latter case is the one that is most likely to occur.

It is interesting to see that the present version of Indirect Sampling (together with the GWSM) corresponds mathematically to Adaptive Cluster Sampling presented by Thompson (1990, 1991, 1992, 2002) and Thompson and Seber (1996). With Adaptive Cluster Sampling, a sample of establishments would first be selected, and a collection strategy would then be performed to survey all establishments of the enterprises identified by the initial sample of selected establishments. Typically, the collection strategy would be to expand the sample of establishments by visiting them sequentially, until all establishments of the same enterprises are covered. With Indirect Sampling, the collection strategy is not specified, but at the end of the collection process, the complete set of establishments of the selected enterprises is assumed to be surveyed. The estimator related to Adaptive Cluster Sampling can be proved the same as estimator (2.1) obtained through the GWSM (see Lavallée 2002, 2007). Note that the two sampling designs happen to be mathematically equivalent only in some particular cases. This is the case in the present paper when estimator (2.1) is used. When the weighted links (see next section) are used, the GWSM turns out to produce a different estimator than the one related to Adaptive Cluster Sampling. As well, when the links between populations ${U}^{A}$ and ${U}^{B}$ are many-to-many, Indirect Sampling and Adaptive Cluster Sampling are no longer equivalent.

## 2.1  Use of weighted links

The indicator variable ${l}_{j,i}$ simply indicates whether there is a link between establishments $j$ and enterprise $i$ from populations ${U}^{A}$ and ${U}^{B},$ respectively. It is however possible to replace the indicator variable ${l}_{j,i}$ with any quantitative variable ${\theta }_{j,i}$ representing the importance that we want to give to the link ${l}_{j,i}.$ That is, there is no problem with generalising the indicator variable $l$ defined on {0,1} with a quantitative variable $\theta$ defined on $\left[0,+\infty \left[,$ the set of non-negative real numbers. In this case, a value of ${\theta }_{j,i}=0$ amounts to a link ${l}_{j,i}=0.$ The theory developed around the GWSM remains valid. For instance, the resulting estimator is still unbiased. As it will be seen later, choosing appropriate values for the weighted links ${\theta }_{j,i}$ will be the basis for methods that aim to reduce the variance of the estimates obtained through the GWSM.

Let ${\stackrel{˜}{\theta }}_{j,i}={\theta }_{j,i}/{\theta }_{i}^{B}$ where ${\theta }_{i}^{B}={\sum }_{j=1}^{{M}^{A}}{\theta }_{j,i}.$ From (2.2), we define

${w}_{i}^{\theta }=\sum _{j=1}^{{M}_{i}^{B}}\frac{{t}_{j}^{A}}{{\pi }_{j}^{A}}{\stackrel{˜}{\theta }}_{j,i}.$(2.8)

Using (2.8), we can modify estimator (2.6) as

${\stackrel{^}{Y}}_{\theta }=\sum _{h=1}^{H}\frac{{M}_{h}^{A}}{{m}_{h}^{A}}\sum _{j=1}^{{m}_{h}^{A}}{Z}_{hj}^{\theta }$(2.9)

where

${Z}_{hj}^{\theta }=\sum _{i=1}^{{N}^{B}}{\stackrel{˜}{\theta }}_{j,i}{Y}_{i}$(2.10)

for $j\in h.$ Because of the many-to-one correspondence between ${U}^{A}$ and ${U}^{B},$ the variable ${Z}_{hj}^{\theta }$ in (2.10) is a weighted portion of the total ${Y}_{i}$ of the ${M}_{i}^{B}$ establishments belonging to enterprise $i.$ The variance of (2.9) is obtained by replacing ${Z}_{j}$ by ${Z}_{j}^{\theta }$ in (2.7):

$\text{Var}\left({\stackrel{^}{Y}}_{\theta }\right)=\sum _{h=1}^{H}{M}_{h}^{A}\left(\frac{{M}_{h}^{A}-{m}_{h}^{A}}{{m}_{h}^{A}}\right){S}_{\theta Zh}^{2}$(2.11)

where ${S}_{\theta Zh}^{2}=\sum _{j=1}^{{M}_{h}^{A}}{\left({Z}_{hj}^{\theta }-{\overline{Z}}_{h}^{\theta }\right)}^{2}/\left({M}_{h}^{A}-1\right)$ and ${\overline{Z}}_{h}^{\theta }=\sum _{j=1}^{{M}_{h}^{A}}{Z}_{hj}^{\theta }/{M}_{h}^{A}.$

## 2.2  Using optimal weighted links

The GWSM offers a simple solution for obtaining an estimation weight ${w}_{i}$ for each surveyed enterprise $i.$ However, the resulting estimator $\stackrel{^}{Y}$ given by (2.1) and (2.3) resulting from the default use of the GWSM is not always the one that has the smallest variance. It is possible to improve it by determining optimal weights for the links ${\theta }_{j,i}.$ This problem has been solved by Deville and Lavallée (2006).

We pointed out earlier that the variance (2.7) depends on the variability of the derived variables $Z.$ Without weighted links, i.e., with ${Z}_{hj}={Y}_{i}/{M}_{i}^{B}={\overline{Y}}_{i},$ for $j\in i,$ the value of ${Z}_{hj}$ is the same for all establishments $j$ of a given enterprise $i.$ Because it is likely that the establishments of an enterprise belong to different strata, the variability of the variables $Z$ within a stratum will depend on the difference between up to as many enterprises as there are establishments. Moreover, a given enterprise $i$ will provide the same value of $Z$ to all its establishments $j$ since ${Z}_{hj}={\overline{Y}}_{i}.$ Therefore, whether an establishment is part of a stratum of "large� or "small� units (with respect to some size measure) or not, this establishment will receive the average value of its owning enterprise. This will contribute to increase the variability within strata, and thus, to increase the variance (2.7). The idea behind the use of weighted links is to share the value of the enterprise total ${Y}_{i}$ unequally between its establishments. Searching for optimal weighted links is to seek for sharing the value of the enterprise total ${Y}_{i}$ in such a way that the variance (2.11) will be minimal.

Deville and Lavallée (2006) obtained an estimator that has a variance less than or equal to that of the original estimator $\stackrel{^}{Y}.$ As mentioned earlier, estimator ${\stackrel{^}{Y}}_{\theta }$ given by (2.9) will still provide unbiased estimates. Now, the variance (2.11) of this estimator depends on the weighted links ${\theta }_{j,i}.$ The problem is then to find at least one set of values ${\theta }_{j,i}$ such that the variance of the estimator ${\stackrel{^}{Y}}_{\theta }$ is minimal. That is, for the ${\theta }_{j,i}$ that are greater than 0, we want to determine the values such that we obtain the most precise estimator ${\stackrel{^}{Y}}_{\theta }.$ The solution to this problem is obtained by minimising the variance (2.11) with respect to the weighted links ${\theta }_{j,i},$ which is a relatively standard and simple problem to solve. However, the solution is not trivial to write, and it often depends on the variable of interest $y.$

If the optimal weighted links ${\theta }_{j,i}^{\text{opt}}$ depend on the variable of interest $y,$ then the weights ${w}_{i}^{\theta }$ will also depend on $y.$ This means that a different set of weights will need to be computed for each variable of interest. To overcome this problem, Deville and Lavallée (2006) defined weak optimality, which corresponds to minimising the variance (2.11) for a very specific choice of a variable of interest: ${Y}_{i}=1$ for an enterprise $i$ of ${U}^{B}$ and ${Y}_{{i}^{\prime }}=0$ for all other enterprises ${i}^{\prime }$ of ${U}^{B}\left({i}^{\prime }\ne i\right).$ The resulting weak-optimal weighted links do not involve, per se, the variable $y$ and they turn out to be relatively easy to compute, i.e., they can be obtained as a closed-form solution, without the need of numerical computations. In addition, if some conditions given by Deville and Lavallée (2006) are satisfied, then weak-optimality corresponds to strong optimality independent of $y.$ That is, the weighted links ${\theta }_{j,i}^{w-\text{opt}}$ obtained through weak optimality correspond to the optimal weighted links ${\theta }_{j,i}^{\text{opt}}$ obtained by minimising (2.11), and they do not depend on the variable of interest $y.$ Unfortunately, these conditions are rarely satisfied in practice, even for simple sampling designs such as SRSWoR.

Assuming SRSWoR without stratification, it can be shown that the weak-optimal weighted links are given by ${\stackrel{˜}{\theta }}_{j,i}^{w-\text{opt}}={\theta }_{j,i}^{w-\text{opt}}/{\sum }_{j=1}^{{M}^{A}}{\theta }_{j,i}^{w-\text{opt}}=1/{M}_{i}^{B}$ for establishment $j\in {U}^{A}$ belonging to enterprise $i\in {U}^{B},$ 0 otherwise. This solution agrees with the solution conjectured by Kalton and Brick (1995). They obtained this result based on the simplified situation where ${M}^{A}=$ 2 and with ${s}^{A}$ obtained through equal probability sampling. Their conclusions suggested the use of optimal values ${\theta }_{j,i}^{\text{opt}}=1$ when ${\theta }_{j,i}>0,$ and ${\theta }_{j,i}^{\text{opt}}=0$ when ${\theta }_{j,i}=0.$ Lavallée (2002) and Lavallée and Caron (2001) obtained results along the same lines by the use of simulations. As mentioned earlier, unfortunately, the weak-optimal weights ${\stackrel{˜}{\theta }}_{j,i}^{w-\text{opt}}=1/{M}_{i}^{B}$ do not correspond to strong-optimal weights that are independent of $y.$

Date modified: