# 5 Simulations using real data

Pierre Lavallée and Sébastien Labelle-Blanchet

The simulations reflect a typical business survey at Statistics Canada. Three populations commonly surveyed by Statistics Canada were chosen. The populations of establishments of the industries of Manufacturing, Retail Trade and Restaurants were extracted from the Business Register (BR). These populations are known to have a skewed distribution for economic variables such as revenue, especially the first two. Stratified SRSWoR was used for the simulations, stratified by the industry, the region and the class of revenue. The algorithm of Lavallée-Hidiroglou (1988) was used to create the classes of revenue, determine the sample size and perform the allocation. The establishments were divided in three strata based on their size: one take-all and two take-some strata. A coefficient of variation of 5% was targeted in each of the strata by industry and region. The following table contains some statistics on the population.

Table 5.1
Simulation populations, sample counts and statistics
Table summary
This table displays the results of simulation populations. The information is grouped by industry (appearing as row headers), header 1, header 2, header 3, average revenue, variance and skewness (appearing as column headers).
Industry ${N}^{B}$ ${M}^{A}$ ${m}^{A}$ Average revenue Variance Skewness
Manufacturing 96.955 100.109 2.223 4,364,808 1.08x1016 164
Retail Trade 142.02 159.247 3.627 2,034,111 3.29x1014 133
Restaurants 107.358 113.425 2.439 561.764 4.43x1012 106
Total 346.333 372.781 8.289 -- -- --

The revenue variable available on the BR was used as the variable of interest $y.$ For these simulations, since the values of this variable are known for all units, no sample selection was needed for most of the methods. It should be noted that for Method 2 $\left({\theta }_{j,i}$ proportional to some size measure ${x}_{j}\right),$ the number of employees was used.

Except for Methods 7 and 8, the true variances were calculated from the data using formula (2.11). For Method 7, we used formula (4.23). For Method 8, we needed to calculate the true probabilities of selection ${\pi }_{i}^{B}$ of all enterprises. Although it would have been possible to compute these probabilities with exact formulas, we choose to compute them using a two-step Monte-Carlo simulation. One reason for this is variance formula (4.27) that uses the joint selection probabilities ${\pi }_{i,{i}^{\prime }}^{B}$ which involves too many pairs $\left(j,{j}^{\prime }\right).$ For the first step of the Monte-Carlo simulation, we selected 20,000 samples of establishments using the stratified SRSWoR design described above. For each sample, we identified which enterprise ended up being selected. Over those 20,000 samples, we were able to estimate the probability of selection ${\pi }_{i}^{B}$ of each enterprise $i$ under that design of unequal probabilities. Once these probabilities were derived, we conducted another Monte-Carlo simulation for computing the variance. We selected $R=$ 20,000 samples ${s}^{A}$ of establishments using again the stratified SRSWoR design described above. We then obtained the corresponding samples ${s}^{B}$ of enterprises. For each replicate $r,r=1,\dots ,R,$ we produced an estimate ${\stackrel{^}{Y}}_{\text{HT},r}$ of $Y$ using estimator (4.26). The variance was computed using

${V}_{\text{MC}}\left({\stackrel{^}{Y}}_{\text{HT}}\right)=\frac{1}{R}\sum _{r=1}^{R}{\left({\stackrel{^}{Y}}_{\text{HT},r}-{\stackrel{^}{\overline{Y}}}_{\text{HT}}^{\left(R\right)}\right)}^{2}$(5.1)

where ${\stackrel{^}{\overline{Y}}}_{\text{HT}}^{\left(R\right)}={\sum }_{r=1}^{R}{\stackrel{^}{Y}}_{\text{HT},r}/R.$ Note that because estimator (4.26) is unbiased, we have ${\stackrel{^}{\overline{Y}}}_{\text{HT}}^{\left(R\right)}\approx Y.$

For each estimator $\stackrel{^}{Y},$ the coefficient of variation was computed using

$\text{CV}\left(\stackrel{^}{Y}\right)=\frac{\sqrt{V\left(\stackrel{^}{Y}\right)}}{Y}.$(5.2)

## 5.1  Results of the simulation

For the classical GWSM and all the methods presented, the estimates, variances and coefficients of variation were computed. The graph below presents the CVs obtained at the national level.

### Graph 5.1 Coefficients of variation by method

Description for Graph 5.1

Except for Method 7, all methods show a decrease in the variance, and often the reduction is substantial. As described earlier, Method 7 (using designated establishments) determines a single establishment within an enterprise based on the auxiliary variable, and the whole enterprise is assigned to that establishment. In other words, a given establishment inherits all the revenue of the enterprise. This is beneficial when the designated establishment is in a take-all stratum. However, if the designated establishment is in a take-some stratum, the distribution within that stratum will become more skewed. One hundred percent of the revenue of the enterprise, times its sampling weight is assigned in that stratum, and the resulting variance is increased significantly. All the other methods provided reasonable results and we analyze them in detail via the graphs that follow.

The following graphs show the CV for each take some strata by industry.

### Graph 5.2  CV for Manufacturing by strata

Description for Graph 5.2

### Graph 5.3  CV for Retail Trade by Strata

Description for Graph 5.3

### Graph 5.4  CV for restaurants by strata

Description for Graph 5.4

Note 1 : The scale of the CV is different for Manufacturing than for the two other industries.

Note 2 : The CVs per stratum of methods 7 and 8 are not showed here because they are not pertinent for the present comparison. The reason is that for these two methods, the notion of stratum is not the same as for the other methods. The stratification defined by the original sample design is at the establishment level. For methods 7 and 8, sampling was done at the enterprise level, and therefore a typical stratum for methods 1 to 6 became a domain for methods 7 and 8. Of course, the variances associated with methods 7 and 8 are much higher, making any comparison with the other methods irrelevant.

The CVs are particularly high for the classical GWSM in some strata, especially in the industry of Manufacturing. This was expected because the skewness of the distribution of the variable of interest was the highest among the three industries. Furthermore, in this industry, we have establishments with revenues that can vary a lot within the same enterprise, and these establishments can be spread amongst several strata. It is for these reasons that the variance of the classical GWSM ends up being very high.

All graphs show that there is a reduction in the CV, thus in the variance, by using any of the suggested methods. The CV are generally lower for the other methods, as compared to the classical GWSM (represented by the dark blue line with diamonds).

## 5.2 Comparison of the proposed methods

Method 1 yields very promising results, given its simplicity. This method results in some of the lowest CV amongst all methods. It is really targeting the source of the problem of the GWSM: the need of an unequal allocation of the weights, somewhat proportional to the size of the variable of interest. Since most of business surveys are using stratification by size, this method works well. It also has the advantage of not depending directly on the variable of interest.

Method 2 uses an auxiliary variable (here, the number of employees) to distribute the weights. This variable is not that well correlated with the variable of interest, and this explains why it has the lowest decrease in variance. In fact, this method is a weaker version of Method 3.

Method 3 is sharing the weight proportionally to the variable of interest $y,$ which is the establishment revenue within the enterprise. The method performs very well both at the national and provincial level. It is generally slightly higher than method 1, 4, 5 and 6 because of the high skewness of the distribution of the revenue.

Methods 4, 5 and 6 give very similar results, offering a CV between 6% and 10% for Retail Trade and Restaurants, and between 10% and 25% for Manufacturing. The similarity in the results for all three methods is reasonable because they all aim to produce the lowest possible variance. Whenever there is one establishment of an enterprise that is in a take-all stratum, these methods concentrate all values on this establishment, and assign links of zero to all other establishments of the enterprise. This is a natural choice to minimize the variance since the contribution to the variance of that enterprise becomes null. Since an establishment of a large enterprise can be part of a take-all stratum, the variance turns out to be lower than for any other methods. It is for these reasons that these three methods are the best way to share the weight. However, it was not possible to determine which one of the three was the best.

Method 7 is not providing good results. Recall that for this method, we let a single designated establishment represent the whole enterprise. All the establishment values of the variable of interest are then summed to the enterprise, and assigned to that designated establishment. The distribution of the variable of interest by stratum becomes even more skewed and this leads to a larger variance. From a sampling viewpoint, an enterprise ends up in a single stratum (because it is represented by a single establishment), which might not be take-all. In addition, this is not efficient for producing estimates at the provincial or industry level. The estimates cannot benefit from the stratification of establishments, while this is possible for the other methods. This also contributes to a higher variance.

Method 8 is using the selection probabilities of the enterprises obtained via simulations. This method is doing well with a CV of 1.3% at the national level. It matches Method 4, 5 and 6. However, this method can reveal to be difficult to apply in practice. We must either calculate explicitly the second order selection probabilities ${\pi }_{i,{i}^{\prime }}^{B},$ which can be very difficult to obtain, or estimate them by simulation, which is very time consuming.

Date modified: