5 Simulations using real data
Pierre Lavallée and Sébastien Labelle-Blanchet
Previous | Next
The simulations reflect a typical business survey at
Statistics Canada. Three populations commonly surveyed by Statistics Canada
were chosen. The populations of establishments of the industries of
Manufacturing, Retail Trade and Restaurants were extracted from the Business
Register (BR). These populations are known to have a skewed distribution for
economic variables such as revenue, especially the first two. Stratified SRSWoR
was used for the simulations, stratified by the industry, the region and the class
of revenue. The algorithm of Lavallée-Hidiroglou (1988) was used to create the
classes of revenue, determine the sample size and perform the allocation. The
establishments were divided in three strata based on their size: one take-all
and two take-some strata. A coefficient of variation of 5% was targeted in each
of the strata by industry and region. The following table contains some
statistics on the population.
Table 5.1
Simulation populations, sample counts and statistics
Table summary
This table displays the results of simulation populations. The information is grouped by industry (appearing as row headers), header 1, header 2, header 3, average revenue, variance and skewness (appearing as column headers).
Industry |
|
|
|
Average revenue |
Variance |
Skewness |
Manufacturing |
96.955 |
100.109 |
2.223 |
4,364,808 |
1.08x1016 |
164 |
Retail Trade |
142.02 |
159.247 |
3.627 |
2,034,111 |
3.29x1014 |
133 |
Restaurants |
107.358 |
113.425 |
2.439 |
561.764 |
4.43x1012 |
106 |
Total |
346.333 |
372.781 |
8.289 |
-- |
-- |
-- |
The revenue variable available on the BR was used as the
variable of interest For these simulations, since the values of
this variable are known for all units, no sample selection was needed for most
of the methods. It should be noted that for Method 2 proportional to some size measure the number of employees was used.
Except for Methods 7 and 8, the true variances were
calculated from the data using formula (2.11). For Method 7, we used formula
(4.23). For Method 8, we needed to calculate the true probabilities of
selection of all enterprises. Although it would have
been possible to compute these probabilities with exact formulas, we choose to
compute them using a two-step Monte-Carlo simulation. One reason for this is
variance formula (4.27) that uses the joint selection probabilities which involves too many pairs For the first step of the Monte-Carlo
simulation, we selected 20,000 samples of establishments using the stratified
SRSWoR design described above. For each sample, we identified which enterprise
ended up being selected. Over those 20,000 samples, we were able to estimate
the probability of selection of each enterprise under that design of unequal probabilities.
Once these probabilities were derived, we conducted another Monte-Carlo
simulation for computing the variance. We selected 20,000
samples of establishments using again the stratified
SRSWoR design described above. We then obtained the corresponding samples of enterprises. For each replicate we produced an estimate of using estimator (4.26). The variance was
computed using
(5.1)
where Note that because estimator (4.26) is
unbiased, we have
For each estimator the coefficient of variation was computed
using
(5.2)
5.1 Results of the simulation
For the classical GWSM
and all the methods presented, the estimates, variances and coefficients of
variation were computed. The graph below presents the CVs obtained at the
national level.
Graph 5.1 Coefficients of variation by method
Description for Graph 5.1
Except for Method 7, all methods show a decrease in the
variance, and often the reduction is substantial. As described earlier, Method
7 (using designated establishments) determines a single establishment within an
enterprise based on the auxiliary variable, and the whole enterprise is
assigned to that establishment. In other words, a given establishment inherits
all the revenue of the enterprise. This is beneficial when the designated
establishment is in a take-all stratum. However, if the designated
establishment is in a take-some stratum, the distribution within that stratum
will become more skewed. One hundred percent of the revenue of the enterprise,
times its sampling weight is assigned in that stratum, and the resulting
variance is increased significantly. All the other methods provided reasonable
results and we analyze them in detail via the graphs that follow.
The following graphs show the CV for each take some
strata by industry.
Graph 5.2 CV for Manufacturing by strata
Description for Graph 5.2
Graph 5.3 CV for Retail Trade by Strata
Description for Graph 5.3
Graph 5.4 CV for restaurants by strata
Description for Graph 5.4
Note 1 : The scale of the CV is different for Manufacturing than for the two other industries.
Note 2 : The CVs per stratum of methods 7 and 8 are not showed here because they are not pertinent for the present comparison. The reason is that for these two methods, the notion of stratum is not the same as for the other methods. The stratification defined by the original sample design is at the establishment level. For methods 7 and 8, sampling was done at the enterprise level, and therefore a typical stratum for methods 1 to 6 became a domain for methods 7 and 8. Of course, the variances associated with methods 7 and 8 are much higher, making any comparison with the other methods irrelevant.
The CVs are particularly high for the classical GWSM in
some strata, especially in the industry of Manufacturing. This was expected
because the skewness of the distribution of the variable of interest was the
highest among the three industries. Furthermore, in this industry, we have
establishments with revenues that can vary a lot within the same enterprise,
and these establishments can be spread amongst several strata. It is for these
reasons that the variance of the classical GWSM ends up being very high.
All graphs show that there is a reduction in the CV,
thus in the variance, by using any of the suggested methods. The CV are
generally lower for the other methods, as compared to the classical GWSM
(represented by the dark blue line with diamonds).
5.2 Comparison of the
proposed methods
Method 1 yields very
promising results, given its simplicity. This method results in some of the
lowest CV amongst all methods. It is really targeting the source of the problem
of the GWSM: the need of an unequal allocation of the weights, somewhat
proportional to the size of the variable of interest. Since most of business
surveys are using stratification by size, this method works well. It also has
the advantage of not depending directly on the variable of interest.
Method 2 uses an auxiliary variable (here, the number of
employees) to distribute the weights. This variable is not that well correlated
with the variable of interest, and this explains why it has the lowest decrease
in variance. In fact, this method is a weaker version of Method 3.
Method 3 is sharing the weight proportionally to the
variable of interest which is the establishment revenue within the
enterprise. The method performs very well both at the national and provincial
level. It is generally slightly higher than method 1, 4, 5 and 6 because of the
high skewness of the distribution of the revenue.
Methods 4, 5 and 6 give very similar results, offering a
CV between 6% and 10% for Retail Trade and Restaurants, and between 10% and 25%
for Manufacturing. The similarity in the results for all three methods is
reasonable because they all aim to produce the lowest possible variance.
Whenever there is one establishment of an enterprise that is in a take-all
stratum, these methods concentrate all values on this establishment, and assign
links of zero to all other establishments of the enterprise. This is a natural
choice to minimize the variance since the contribution to the variance of that
enterprise becomes null. Since an establishment of a large enterprise can be
part of a take-all stratum, the variance turns out to be lower than for any
other methods. It is for these reasons that these three methods are the best
way to share the weight. However, it was not possible to determine which one of
the three was the best.
Method 7 is not providing good results. Recall that for
this method, we let a single designated establishment represent the whole
enterprise. All the establishment values of the variable of interest are then
summed to the enterprise, and assigned to that designated establishment. The
distribution of the variable of interest by stratum becomes even more skewed
and this leads to a larger variance. From a sampling viewpoint, an enterprise
ends up in a single stratum (because it is represented by a single
establishment), which might not be take-all. In addition, this is not efficient
for producing estimates at the provincial or industry level. The estimates
cannot benefit from the stratification of establishments, while this is
possible for the other methods. This also contributes to a higher variance.
Method 8 is using the selection probabilities of the
enterprises obtained via simulations. This method is doing well with a CV of
1.3% at the national level. It matches Method 4, 5 and 6. However, this method
can reveal to be difficult to apply in practice. We must either calculate
explicitly the second order selection probabilities which can be very difficult to obtain, or
estimate them by simulation, which is very time consuming.
Previous | Next