Comparison of the conditional bias and Kokic and Bell methods for Poisson and stratified sampling
Section 4. Comparison of winsorization and conditional bias

In the previous section, we presented two types of methods for processing influential units applied to survey data:

the Kokic and Bell winsorization, which aims to determine the winsorization thresholds that minimize the mean square error of the winsorized estimator under the sample design and the law of the variable of interest, which was initially conceived for a stratified simple random sampling design, but which we have extended to the case of Poisson sampling. The Kokic and Bell method, like its extension, is thus valid under hypotheses made about the law of the winsorized variable;
the conditional bias method proposed by Beaumont, Haziza and Ruiz-Gazen, which potentially applies to all sampling designs and does not rely on any hypothesis on the law of the variable of interest; it aims to obtain the estimator for which the most influential unit has the least influence possible.

To compare the efficiency of these two methods, we performed two exercises:

simulations applied to the Poisson sampling;
a comparison on real data, applied to the data from the French labour cost and structure of earnings survey (ECMOSS).

4.1 Simulations in the case of a Poisson sampling

We performed a simulations study to examine the properties of the two robust estimators proposed in the context of a Poisson drawing. We carried out four scenarios to compare the efficiency of the two estimators, but also to study, in the case of the Kokic and Bell estimator, the model’s robustness to a bad specification, i.e., to a modification between the learning model and the model that generated the sample data.

The simulation proceeds as follows:

We consider $L = 1,000$ realizations of a certain model, which makes it possible to generate our learning set of $N = 5,000$ units;
For each of these realizations we calculate the optimal threshold $K_{l}$ according to the method proposed in Section (2.2);
Then we create $M = 10,000$ test sampling frames generated according to a (different) model on which we select a sample of expected size $n =500$ following a Poisson drawing and calculate the robust estimator ${\hat{θ}}_{(m)}$ with the threshold $K_{l}$ calculated. As a comparison, we also calculate the robust estimator resulting from the method based on the conditional bias.

The inclusion probabilities, as well as the values of the variable, $X$ were generated according to the following model:

$U_{i} \sim L og - N (1; 1 .1),$

$π_{i} = n \times \frac{U_{i}}{\sum_{i =1}^{N} U_{i}},$

$X_{i} = 2,000 \times π_{i} + π_{i} ϵ_{i} + δ_{i} V_{i},$

$ϵ_{i} \sim N (0; 100), V_{i} \sim L og - N (log (500); 1 .2), δ_{i} \sim B (ω),$

where $ω$ is the Bernoulli parameter, reflecting the proportion of influential values whose values are given in Table 4.1. The notation $L og - N$ denotes a log-normal distribution.

Table 4.1
Values of parameter $ω$ used to generate populations
Table summary
This table displays the results of Values of parameter $ω$ used to generate populations. The information is grouped by Scenario (appearing as row headers), Values of parameter $ω$ (appearing as column headers).
Scenario	Values of parameter $ω$
Scenario	Learning model	Test model
1	0	0
2	0.01	0.01
3	0.01	0.1
4	0.1	0.01

Scenario 1 corresponds to the population model for which the extension of the Kokic and Bell method was developed in the Poisson case with $H =1,$ but in which no or very few units are influential (the value of the parameter $ω$ being fixed at 0). Scenario 2 corresponds to a situation in which this model applies, but in which a small proportion (1%) of units are influential. The model is, in scenarios 1 and 2, identical in the population used to calculate the threshold and the sample to which the threshold is applied.

In scenarios 3 and 4, the basic model is the same between the learning population and the sample, but the number of influential units varies between the two. In scenario 3, the learning population contains 10 times fewer influential units than the sample. Scenario 4 corresponds to the opposite scenario.

As a measure of the bias of an estimator $\hat{θ}$ of a total $T,$ we calculated the relative Monte Carlo bias (as in percentage)

${BR}_{MC} (\hat{θ}) = \frac{\frac{1}{M} \sum_{m =1}^{M} ({\hat{θ}}_{(m)} - T)}{T} \times 100,$

where ${\hat{θ}}_{(m)}$ is the estimator $\hat{θ}$ in the sample $m,$ $m =1, \dots, M .$

We also calculated the relative efficiency of the robust estimators relative (RE) to the dilation estimator, $\hat{t}$ :

${RE}_{MC} (\hat{θ}) = \frac{\frac{1}{M} \sum_{m =1}^{M} {({\hat{θ}}_{(m)} - T)}^{2}}{\frac{1}{M} \sum_{m =1}^{M} {({\hat{t}}_{(m)} - T)}^{2}} \times 100.$

Tables 4.2 and 4.3 represent the descriptive statistics associated with the $L = 1,000$ Monte Carlo values calculated according to the learning population considered.

Table 4.2
Descriptive statistics for scenarios 1 and 2 of the 1,000 simulations for (équation)
Table summary
This table displays the results of Descriptive statistics for scenarios 1 and 2 of the 1. The information is grouped by Statistic (appearing as row headers), Scenario (appearing as column headers).
Statistic	Scenario
Statistic	1				2
Description	K&B		BHR		K&B		BHR
Description	BR	RE	BR	RE	BR	RE	BR	RE
Min.	-0.2	100	-0.43	100	-9.0	1	-4.3	26
$Q 1$	-0.1	100	-0.32	100	-2.9	35	-1.9	51
Median	0.0	100	-0.27	100	-1.8	50	-1.5	62
Mean	0.0	100	-0.27	100	-2.0	50	-1.6	62
$Q 3$	0.0	100	-0.23	100	-1.0	64	-1.3	73
Max.	0.0	100	-0.14	100	-0.1	109	-0.6	91

Scenario 1 corresponds to a situation in which no or very few influential units are present in the population: the performance of the robust estimators is therefore identical to that of the usual Horvitz-Thompson estimator, with a relative bias very close to 0. Scenario 2 corresponds to the situation for which the extension of the Kokic and Bell method to the Poisson case was developed, with the introduction of influential units. The two robust estimators are more effective than the usual estimator, but the performance of the Kokic and Bell estimator in terms of the gain in mean square error is greater, with a median relative efficiency over the 1,000 simulations of 50%, compared to 62% for the conditional bias method. This result is expected given that the threshold of the Kokic and Bell method is explicitly determined to obtain the estimator with the smallest mean square error.

Table 4.3
Descriptive statistics for scenarios 3 and 4 on the 1,000 simulations for (équation)
Table summary
This table displays the results of Descriptive statistics for scenarios 3 and 4 on the 1. The information is grouped by Statistic (appearing as row headers), Scenario (appearing as column headers).
Statistic	Scenario
Statistic	3				4
Description	K&B		BHR		K&B		BHR
Description	BR	RE	BR	RE	BR	RE	BR	RE
Min.	-32.2	2	-7.8	27	-4.5	1	-4.3	26
$Q 1$	-18.9	50	-5.1	59	-1.8	48	-1.9	51
Median	-13.9	82	-4.6	66	-1.5	70	-1.5	62
Mean	-14.2	89	-4.7	65	-1.5	68	-1.6	62
$Q 3$	-9.3	138	-4.2	72	-1.2	91	-1.3	73
Max.	-0.01	537	-2.7	89	-0.6	100	-0.6	91

The performances of the two methods in scenario 3 are more contrasted. While over the set of simulations, the conditional bias method succeeds in reducing the mean square error of the estimators, with a minimum mean square error gain of 27%, the Kokic and Bell method deteriorates precision in more than a quarter of cases. The population on which the threshold was calculated contains, in this scenario, too few influential units compared to the sample for the calculated threshold to be effective.

In scenario 4, where the learning population contains more influential units than the sample, the performances of the two methods are of the same order of magnitude.

Therefore, these simulations show:

that in the absence of influential units, the two robust estimation methods do not lead to a loss of estimation efficiency;
that when applied in its hypotheses, the Kokic and Bell method leads to more accurate estimators than the conditional bias method;
that the Kokic and Bell method is, however, sensitive to the data used to calculate thresholds; if these data are not generated according to the same model as the data to which the thresholds are applied, the method may lead to a loss of precision;
that the conditional bias method always allows a gain in precision on these simulations, even if this gain is not optimal.

4.2 Application to the Survey on labour costs and wage structure

4.2.1 Presentation of the survey

The Survey on labour cost and structure of earnings (ECMOSS) is conducted by INSEE every year and harmonized at the European level. It is used to respond to European regulations on the production of statistics on both the cost of labour and structure of earnings which contribute to comparisons between European countries in terms of work time and costs.

ECMOSS is a survey of local business units (or establishments). It covers all sectors $-$ both market and non-market $-$ with the exception of agriculture, state administrations and certain activities (extraterritorial activities, embassies, consulates, activities of individuals acting as employers) and businesses with 10 or more employees. It covers establishments located in the metropolitan territory and in the overseas departments. Each sampled business answers two questionnaires: In the first, it must provide a certain amount of aggregated information on its workforce, payroll and a breakdown into its main elements (basic wages, bonuses, social contributions paid by the employer and by employees, etc.) and on the number of work hours of its employees; in the second, it details these elements for a randomly selected sample of its employees.

Given this survey method, the ECMOSS sample design has two stages:

First stage: A sample of approximately 17,000 establishments is selected according to a stratified sampling design by sector, size of business, size of the establishment and geographical location;
Second stage: In each establishment, a sample of employees is selected from the lists of employees reported by the establishment to social security organizations. The sampling design is drawn independently in each establishment and stratified by social category of the employees, distinguishing between managers and non-managers. The number of employees surveyed in each establishment varies according to its size, but is limited to 24 to prevent the survey from placing too much burden on businesses. In the end, around 150,000 employees are surveyed each year.

Each year, a certain number of establishments do not respond to the survey, and responding establishments do not systematically provide information for all their employees. Therefore, there is total non-response at each stage, which is handled by reweighting according to the homogeneous response group method. Next, the final sample of respondent employees, on which most operations are performed, is calibrated on the population of employees from the files of social security organizations.

Last, the sample of employees is obtained through a complex sample design, comprising two drawing stages (establishments and employees), with two drawing phases at each stage.

Given the very great variability of the establishments and their wage policy (both in terms of differences in the average levels of wages between establishments and differences in the dispersion of wages in the establishments), the sampling weights of the sampled employees are widely dispersed, and the accuracy of the estimators is sensitive to the influential values of the sample: for example, a very high level executive in a large business, or the athletes employed by a high-level sports club.

4.2.2 Parameter of interest

The main parameter of interest in the survey is the average hourly wage, calculated in different dissemination domains: sectors, sectors crossed with the employment size ranges of the businesses, and sectors crossed with the region in which the establishment is located. The estimators used later in our simulations are obtained by calculating the ratio of estimators by expansion of total remuneration over the total number of hours:

$\hat{R} (D) = \frac{\sum_{i \in S \cap D} w_{i} e_{i}}{\sum_{i \in S \cap D} w_{i} h_{i}} (4.1)$

with $S$ the sample of employees, $D$ the domain of interest, $e_{i}$ the annual remuneration of the employee $i,$ $h_{i}$ their annual hourly work volume and $w_{i}$ the employee’s estimation weight obtained by multiplying the selection probabilities and the response probabilities associated with each stage and phase of the sample design. Estimator (4.1) does not correspond to the estimator used in practice because it involves the initial weights corrected for non-response, while the estimator used in practice uses the calibrated weights. In the context of this article, the calibration phase was not taken into account, but it could have been using the classical residual technique and an additional degree of complexity which we deemed unnecessary to compare the two robust estimation methods.

4.2.3 How to adapt the processing methods for influential units to the ECMOSS sampling design

Estimator (4.1) is not the expansion estimator of a total, for which the previously described methods were designed. The problem can, however, be adapted to the framework of these two methods.

Indeed, an unbiased estimator of the variance of $\sum_{i \in S} w_{i} {\hat{L}}_{i} [\hat{R} (D)],$ with ${\hat{L}}_{i} [\hat{R} (D)] = \frac{e_{i} - \hat{R} (D) h_{i}}{\sum_{i \in S \cap D} w_{i} h_{i}} I (i \in D)$ the estimated linearized variable of $\hat{R} (D),$ is also an asymptotically unbiased estimator of $V (\hat{R} (D)) .$ Thus, a robust estimator of the total of the linearized variable ${\hat{L}}_{i} [\hat{R} (D)]$ will also be a robust estimator for the influential units of $\hat{R} (D) .$ Each method, applied to the estimated linearized variable, generates a winsorized value of this variable, denoted ${\hat{L}}_{i}^{w} [\hat{R} (D)] .$ The effects of the processing of the influential units are then transferred to all other variables of interest of the survey through the estimation weight, by calculating a winsorized estimation weight:

$w_{i}^{w} = w_{i} \frac{{\hat{L}}_{i}^{w} [\hat{R} (D)]}{{\hat{L}}_{i} [\hat{R} (D)]} .$

We thus test the two methods of Kokic and Bell and Beaumont, Haziza and Ruiz-Gazen to estimate the total of ${\hat{L}}_{i} [\hat{R} (D)] .$ However, each of the two methods requires adaptations to be applied to the sampling design and variables of interest of ECMOSS.

4.2.4 Adaptation of winsorization according to the Kokic and Bell method and its extension

The survey and the parameter of interest of the survey, even after linearization, do not fit with the framework of the Kokic and Bell method, whether it is the original method, or the extension presented previously. First, the ECMOSS sample is not selected using a stratified simple random survey or a Poisson sampling. Moreover, the variable to winsorize, the estimated linearized variable ${\hat{L}}_{i} [\hat{R} (D)],$ is not always positive. To apply the Kokic and Bell method to the ECMOSS case, we have made the following adaptations.

We apply the processing of the influential units as though the employees were directly selected by stratified simple random sampling (Poisson sampling for the extension) in strata defined by the sector, the number of employees of the business and the location of the employing establishment, by grouping certain modalities of this last variable to avoid generating pseudo-strata containing too few observations (we distinguish Île de France, the overseas departments and the rest of the country) and by the social category of the employee (distinguishing managers and non-managers). As the classical method acts as though the sample in each pseudo-stratum was selected by simple random sampling and thus all employees of the same pseudo-stratum have the same sampling weight, we do not consider the dispersion of the estimation weights in the pseudo-strata from the actual sampling design of the survey, and thus risk missing influential units. In the case of the extension of the method, this dispersion of the weights is properly taken into account.
In each of these pseudo-strata, winsorization is not applied directly to the estimated linearized variable, but to a translated version of it.

More precisely, we define for each sampled employee:

${\hat{T}}_{i} [\hat{R} (D)] = {\hat{L}}_{i} [\hat{R} (D)] + {min}_{j \in S} {\hat{L}}_{j} [\hat{R} (D)]$

for which we calculate winsorization thresholds in the pseudo-strata according to the method initially proposed by Kokic and Bell and for its extension. We then deduce two sets of estimation weights used to estimate the average hourly wage in each domain of interest of the form:

$w_{i}^{w} = w_{i} \frac{{\hat{T}}_{i}^{w} [\hat{R} (D)]}{{\hat{T}}_{i} [\hat{R} (D)]} .$

We can thus only identify and process influential units with high values of the estimated linearized variable, i.e., employees whose hourly wage is higher than the average hourly wage in the domain of interest $D .$ Units with low hourly wages cannot be identified by this method, but pose less problems for the accuracy of estimates, since the distribution of hourly wages is particularly skewed, with a very long tail on the right.

A final adaptation is necessary to adapt the method to the case of ECMOSS. This can only be used if observations of the variable of interest in each pseudo-stratum are available. Previous editions of the survey can be used. However, the tests performed to evaluate the efficiency of the Kokic and Bell method applied to the Annual Sectoral Surveys (see Deroyon, 2015) have shown that the use of responses to previous editions of the survey to calculate winsorization thresholds does not lead to the largest gains in accuracy. This is because the small number of observations available per stratum to calculate these thresholds are determined with too little precision, so that too many units can be winsorized, or conversely, influential units escape winsorization. We have chosen to use the auxiliary information available in the social security files on total remuneration paid annually to employees and their number of hours worked. These data are not those measured in the survey (in particular, the wages declared in the social security files form the tax base on which are calculated social contributions and tax contributions on wages, and not labour income paid to employees), but are strongly correlated with them.

4.2.5 Adaptation of Beaumont, Haziza and Ruiz-Gazen estimator

Because of its generality, the conditional bias method requires fewer adaptations to be applied to the ECMOSS. It can thus be applied directly to the variables of interest of the survey without the need to mobilize external data. However, calculating conditional biases while considering the whole sampling design is complex; therefore, for our simulations, we chose to apply the conditional bias methods as though the employees had been selected directly by a Poisson sampling, with the selection probabilities $1 / w_{i},$ where $w_{i}$ designates the estimation weight after correction for non-response of the employee $i .$ The conditional bias used to identify influential units is therefore equal to:

$B_{1 i} {{\hat{L}}_{i} [\hat{R} (D)]} = (w_{i} - 1) {\hat{L}}_{i} [\hat{R} (D)] .$

With formula (3.8), the Beaumont, Haziza and Ruiz-Gazen estimator processes only a limited number of units, i.e., the observations with the lowest and highest conditional biases, for which all corresponding indicators define the sets $A_{min}$ and $A_{max} :$

$A_{min} = {argmin}_{j \in S} B_{1 j} {{\hat{L}}_{j} [\hat{R} (D)]}$

$A_{max} = {argmax}_{j \in S} B_{1 j} {{\hat{L}}_{j} [\hat{R} (D)]} .$

Thus, the processed estimation weight of the influential units is equal to:

$w_{i}^{BHR} = {\begin{array}{l} \frac{(2 | A_{min} | - 1) w_{i} + 1}{2 | A_{min} |} & if B_{1 i} {{\hat{L}}_{i} [\hat{R} (D)]} \in A_{min} \\ \frac{(2 | A_{max} | - 1) w_{i} + 1}{2 | A_{max} |} & if B_{1 i} {{\hat{L}}_{i} [\hat{R} (D)]} \in A_{max} \\ w_{i} & otherwise . \end{array}$

where $| A_{min} |$ and $| A_{max} |$ respectively designate the cardinal of $A_{min}$ and $A_{max} .$

Compared to the Kokic and Bell method, the robust estimator based on conditional biases does not focus on influential units located in the right-hand part of the distribution of the estimated linearized variable, but identifies the influential units with very low and very high values for this variable. It also focuses on an a priori limited number of units, since only observations with the minimum and maximum conditional bias are modified.

4.2.6 Robust estimation on several domains of interest

As previously described, the domains of interest for the dissemination of the ECMOSS results are numerous. For the sake of simplicity of dissemination and to comply with the requirements of European regulations, each employee in the individual sample must have only one estimation weight, so adaptations are necessary:

Robust estimators for several sets of domains of interest
European regulations require the dissemination of results in sets of domains that intersect and are not included in one another, such as intersections of sectors and ranges of numbers of employees and crossings of sectors and regions. Sampled units may belong to more than one dissemination domain.
Ideally, the processing of influential units should be done in each domain of interest separately, so that a single observation may be associated with a different estimation weight for each dissemination domain to which it belongs. However, this solution is not possible for the reasons mentioned above.
Another solution is to apply both of the methods to the crossings of all the dissemination domains. The risk is then in applying the processes for estimators calculated on very small populations, for which many units are influential. Thus, for the estimation on real dissemination domains, too many units would be winsorized. The resulting estimators will be less precise than the robust estimators adapted to each domain, but potentially also less precise than the unprocessed estimators of the influential units because they are too biased.
Robust estimators for all modalities of a domain of interest
For a given set of domains (e.g., industry sectors), an observation can be identified as influential and processed for estimation on more than one dissemination domain, and thus have more than one final estimate weight. This is the case if the selection of an observation belonging to a dissemination domain has an influence on the selection of other units belonging to other dissemination domains (e.g., in the case of a stratified sampling, if the dissemination domains intersect the drawing strata).
This situation is impossible for the Beaumont, Haziza and Ruiz-Gazen estimator, for which we assume the Poisson sampling design. However, this can happen for the Kokic and Bell method and its extensions as we implement them, because some dissemination domains do not consist of groupings of the pseudo-strata that we have formed. The situation is then the same as that exposed in the case of several sets of dissemination domains: the only way to maintain a unique estimation weight for each sampled unit is to apply the methods to pseudo-dissemination domains close to the real dissemination domains but made up of groupings of winsorization pseudo-strata. These pseudo-domains are in fact formed by intersections of sectors, a range of the number of employees of the businesses and the geographic location of the establishments (distinguishing only the three modalities specified above).

To evaluate the performance in terms of precision gains or losses of the methods defined above, we carried out a set of simulations based on the ECMOSS sampling design and data on wages and hours worked from the social security files, available for all employees and for which we are therefore able to compare the average hourly wages observed in the population with their various estimators. In these simulations, we compared the efficiency of the methods applied directly to each dissemination domain, which lead to the optimal results, and to the pseudo-dissemination domains defined above.

4.2.7 Simulations

The simulations are conducted in the social security files, from which the sample of employees is selected and which are available for all French employees. They are implemented as follows:

the ECMOSS sampling design (including the selection of responding establishments and employees) is applied 5,000 times to produce 5,000 samples of employees, denoted $S_{m}, m =1, \dots, 5,000;$
for each sample and each dissemination domain, we calculate the usual expansion estimator of ${\hat{R}}_{m} (D);$
the Kokic and Bell winsorization and conditional bias are applied to each sample according to different specifications:
- the Kokic and Bell winsorization, classical or adapted to Poisson sampling, is applied only as though the real dissemination domains were the pseudo-dissemination domains defined above.
- The Beaumont, Haziza and Ruiz-Gazen estimator is applied in each activity sector taken separately on the one hand, and on the other hand as though the pseudo-dissemination domains defined above were the real dissemination domains. For each dissemination domain, we can thus compare the performances of the conditional bias estimator applied in its optimal specification for this domain (producing an estimator ${\hat{R}}_{m}^{BHR *} (D)$ of the average hourly wages in the domain) to the conditional bias method and the Kokic and Bell method (producing estimators ${\hat{R}}_{m}^{BHR} (D),$ ${\hat{R}}_{m}^{KB} (D)$ and ${\hat{R}}_{m}^{{KB}_{poiss}} (D)$ for the extension) applied according to specifications that are sub-optimal for this domain but simpler to implement.

For each robust estimator and each domain, we calculate the mean relative bias (RB) and the relative mean square error (RMSE) for all simulations by:

$\begin{matrix} AB [{\hat{R}}^{KB} (D)] & = & \frac{\sum_{m =1}^{5,000} [{\hat{R}}_{m}^{KB} (D) - R (D)]}{5,000} \\ AMSE [{\hat{R}}^{KB} (D)] & = & \frac{\sum_{m =1}^{5, 000} {[{\hat{R}}_{m}^{KB} (D) - R (D)]}^{2}}{5,000} \\ RB [{\hat{R}}^{KB} (D)] & = & 100 \frac{AB [{\hat{R}}^{KB} (D)]}{R (D)} \\ RMSE [{\hat{R}}^{KB} (D)] & = & 100 \frac{AMSE [{\hat{R}}^{KB} (D)]}{AMSE [\hat{R} (D)]} \end{matrix}$

where, for example, for the classical Kokic and Bell method, $R (D)$ designates the average hourly wage observed in the social security files in the domain $D$ and $\hat{R} (D)$ designates the usual expansion estimator of this parameter. Relative bias compares the bias of the robust estimator to the real value of the parameter. The relative mean square error measures the gain or loss of precision provided by the robust estimators relative to the usual estimator.

4.2.8 Simulation results

Among the different estimators tested in our simulations, the estimator obtained by applying the adaptation of the Kokic and Bell method to Poisson sampling is distinguished by extremely poor performances, summarized in Table 4.4. Application of the Kokic and Bell method extended to Poisson sampling for the ECMOSS results in a significant or even dramatic deterioration in the precision of the estimates.

Table 4.4
Statistics on the MSE ratio of the robust Kokic and Bell estimators applied to the Poisson sampling in the different domains of interest
Table summary
This table displays the results of Statistics on the MSE ratio of the robust Kokic and Bell estimators applied to the Poisson sampling in the different domains of interest. The information is grouped by Statistic (appearing as row headers), (équation) and Domain (appearing as column headers).
Statistic	$RMSE ({\hat{R}}_{m}^{{KB}_{_{poiss}}} (D))$
	Domain
	NACE* Workforce	NACE	NACE* NUTS
Min.	18	128	33
Mean	490	1,858	324
Max.	4,437	8,606	2,466

Figures 4.1, 4.2 and 4.3 focus on presenting the results of the conditional bias and classical Kokic and Bell methods, applied under the hypothesis of a stratified simple random sampling.

Figure 4.1 of article 54961 issue 2018002

Description for Figure 4.1

Bar charts presenting the relative mean square errors for the estimators of average hourly wage by sector. Relative mean square errors are on the y-axis, going from 0 to 150%. NACE sections are on the x-axis, going from B to S. There are three bars for each sections representing the following estimators: conditional bias in the pseudo-domains, conditional bias in the NACE sections and Kokic and Bell in the pseudo-domains. No estimator shows a systematically higher or lower relative mean square error, it depends on the sections. Sections C, N and O have the highest relative MSE.

Figure 4.2 of article 54961 issue 2018002

Description for Figure 4.2

Box-plots presenting the distribution of relative mean square errors in each domain. Relative mean square errors are on the y-axis, going from 40 to 160%. Three domains are shown: NACE, NACE*No. et NACE*NUTS. There are three box-plots for each domain: one for the conditional bias, one for the optimal conditional bias and one for Kokic-Bell. Box-plots for the optimal conditional bias show a smaller dispersion compared to the other estimators. The median of conditional bias box-plots is lower compared to the other estimators.

Figure 4.1 shows the relative mean square errors of the robust average hourly wage estimators in each section of the Statistical classification of economic activities in the European Community (NACE, a grouping of business sectors into 21 categories, of which 18 are in the ECMOSS field) and Figure 4.2 shows the distribution of relative mean square errors in each domain (among all sections, section crossings, and number of business employees, or crossings of sector and location of the establishment).

For almost all domains of interest, the robust estimators considered provide gains in precision over the usual expansion estimator. The domains in which the robust estimators have a higher error than the usual estimator are also those where the estimation variance is the lowest originally. The processes for influential units considered in these figures (conditional bias and classical Kokic and Bell method) are thus able to reduce estimation errors when necessary without causing too much loss of precision when the estimators are not affected by influential units.

The biases of the average hourly wage estimators in the sectors are low (see Figure 4.3), except in some domains where the sample size is small (A: Agriculture, forestry and fishing; K: Financial and insurance activities; R: Arts, entertainment and recreation). The same results are also observed for the other domains.

Figure 4.3 of article 54961 issue 2018002

Description for Figure 4.3

Bar charts presenting the relative biases for estimators of average hourly wage by sector. Relative bias is on the y-axis, going from -3 to 1%. NACE sections are on the x-axis, going from A to S. There are three bars for each sections representing the following estimators: conditional bias in the pseudo-domains, conditional bias in the NACE sections and Kokic and Bell in the pseudo-domains. The biases of the average hourly wage estimators in the sectors are low, except in some domains where the sample size is small (A: Agriculture, forestry and fishing; K: Financial and insurance activities; R: Arts, entertainment and recreation).

The application of conditional bias methods adapted to each domain gives the best results for the estimation in the NACE sections, but not necessarily in the other dissemination domains. The NACE sections are much more aggregated than the pseudo-domains used for the identification of influential units, so the bias introduced by the processing of influential units is more significant in the cases where the application is made on pseudo-domains, compared to the optimal version applied directly to the NACE sections. In the other domains, the identification of influential units at a finer level than the real dissemination domain makes it possible to identify more influential units and thus substantially reduce the estimation variance, without introducing too much additional bias, when the domain used to identify the influential units and the real dissemination domains are close. Differences in how the sampling design is described to apply each of the two methods and the actual sampling design may explain why the use of the Beaumont, Haziza and Ruiz-Gazen robust estimator in each dissemination domain does not necessarily translate into greater precision gains.

The differences between the results obtained with the conditional bias and Kokic and Bell methods under the hypothesis of the stratified simple random sampling design are, however, small. Note however that, for the implementation of these simulations, we use the population data as observations of the additional interest variables not from the sample to calculate the winsorization thresholds in the Kokic and Bell method. Since we also evaluate the performance of the different estimators based on these data, the Kokic and Bell method is favoured a priori.

The extension of the Kokic and Bell method to Poisson sampling results in a significant deterioration in the precision of the estimators.

The discrepancies between the performances of the two implementations of the Kokic and Bell method are thus very high. However, these implementations are both based on two hypotheses:

a hypothesis on the sampling design used to select the sample;
a hypothesis on the distribution of the variable of interest in subpopulations $U_{h} .$

In both applications of the Kokic and Bell method, the first hypothesis is not respected. The violation of this hypothesis is, however, a priori more significant when we apply the Kokic and Bell method as though the sample had been selected by a stratified simple random sampling in pseudo-strata constructed ad-hoc, because in so doing we assume that the selection probabilities are identical in these pseudo-strata, which is not at all verified. The Kokic and Bell method applied as though the employees had been selected by Poisson sampling, for its part, considers real simple inclusion probabilities, but neglects the links between the indicators of belonging to the sample of different employees.

However, the population model postulated for the Kokic and Bell method extended to the Poisson case is not valid, since the simple inclusion probabilities are not proportional to the variable of interest considered. It is more complex to assess the validity of the population model used for the classical Kokic and Bell method; up to a point, it is still possible to consider that the results of the variable of interest in a pseudo-stratum are derived from the same law whose expectation and variance can be estimated by the mean and the empirical variance of the results of the variable of interest in the stratum.

Also, the performance differences of the two implementations of the Kokic and Bell method are complex to analyze. A first possible explanation is that the performances of the method are more sensitive to violations of the hypothesis on the law of the observations than to those on the form of the sampling design. This finding was shared by Fizzala (2017) in the case of an application of winsorization in the context of corporate profiling. In our ECMOSS simulations, we observe that the classical Kokic and Bell method, based on the hypothesis of stratified simple random samplings, gives very valid results despite the fact that this hypothesis is only partially respected. Future extensions of this work could consist of validating this explanation on the basis of simulations. Another explanation for these differences in performance may lie in the relationship between the two hypotheses in the case of the extension of Kokic and Bell to Poisson sampling. Indeed, while in the case of the classical Kokic and Bell method, the hypotheses on the sampling design and the law of the variable of interest in each stratum are unrelated, in the case of the Poisson sampling, the population model involves selection probabilities and therefore implies additional constraints on the sampling design. Therefore, the fact that the selection probabilities are not proportional to the variable of interest implies that, for the extension of the Kokic and Bell to Poisson sampling, the hypotheses on the sampling design and the population are simultaneously violated, which could explain this explosion of errors of the estimator.

However, the conditional bias and classical Kokic and Bell methods, whatever the configuration, seem to be able to identify influential units for the estimation of the parameters affected, and thus guarantee significant gains in precision even when they are applied in a setting that is remote from their original hypotheses and the actual sampling design of the survey.

Appendix

A Demonstrations of the formulas for the extension of the Kokic and Bell method in the case of a Poisson sampling

A.1 Calculation of the mean square error of the winsorized estimator

First, we will calculate

$\begin{array}{l} E_{P} {{[\hat{T} (\tilde{X}) - T (X)]}^{2}} & = E_{P} {{[\hat{T} (\tilde{X}) - T (\tilde{X})]}^{2} + {[T (\tilde{X}) - T (X)]}^{2} \\ ^{^{^{^{^{^{}}}}}} + 2 [\hat{T} (\tilde{X}) - T (\tilde{X})] [T (\tilde{X}) - T (X)]} \\ = E_{P} {{[\hat{T} (\tilde{X}) - T (\tilde{X})]}^{2}} + {[T (\tilde{X}) - T (X)]}^{2} \end{array}$

with $T (\tilde{X}) = E_{P} [\hat{T} (\tilde{X})] = \sum_{h =1}^{H} \sum_{i \in U_{h}} {\tilde{X}}_{h i} .$

Furthermore,

$E_{P} {{[\hat{T} (\tilde{X}) - T (\tilde{X})]}^{2}} = \sum_{h =1}^{H} \sum_{i \in U_{h}} d_{h i} (1 - \frac{1}{d_{h i}}) {\tilde{X}}_{h i}^{2}$

finally:

$E_{P} {{[\hat{T} (\tilde{X}) - T (X)]}^{2}} = \sum_{h =1}^{H} \sum_{i \in U_{h}} d_{h i} (1 - \frac{1}{d_{h i}}) {\tilde{X}}_{h i}^{2} + {[\sum_{h =1}^{H} \sum_{i \in U_{h}} ({\tilde{X}}_{h i} - X_{h i})]}^{2} . (A .1)$

Assuming in each stratum that:

$E_{m} (d_{h i} X_{h i}) = μ_{h};$
${Var}_{m} (d_{h i} X_{h i}) = σ_{h}^{2} < + \infty;$
and that $d_{h i} X_{h i}$ are independent and of density $g_{h} (x) >0.$

and noting that:

$\begin{array}{l} {\tilde{X}}_{h i} & = X_{h i} (1 - J_{h i}) + J_{h i} [\frac{X_{h i}}{d_{h i}} + (1 - \frac{1}{d_{h i}}) \frac{K_{h}}{d_{h i}}] \\ = \frac{1}{d_{h i}} [d_{h i} X_{h i} + J_{h i} (1 - \frac{1}{d_{h i}}) (K_{h} - d_{h i} X_{h i})] \end{array}$

and so that:

${\tilde{X}}_{h i} - X_{h i} = \frac{1}{d_{h i}} (1 - \frac{1}{d_{h i}}) J_{h i} (K_{h} - d_{h i} X_{h i}), (A .2)$

we obtain:

$\begin{array}{l} {\tilde{X}}_{h i}^{2} & = \frac{1}{d_{h i}^{2}} [d_{h i}^{2} X_{h i}^{2} + J_{h i} {(1 - \frac{1}{d_{h i}})}^{2} (K_{h}^{2} + d_{h i}^{2} X_{h i}^{2} - 2 d_{h i} X_{h i} K_{h}) \\ + 2 (1 - \frac{1}{d_{h i}}) (d_{h i} X_{h i} J_{h i} K_{h} - J_{h i} d_{h i}^{2} X_{h i}^{2})], (A .3) \end{array}$

and that:

$\begin{array}{l} E_{m} {{[\sum_{h =1}^{H} \sum_{i \in U_{h}} ({\tilde{X}}_{h i} - X_{h i})]}^{2}} & = \sum_{h =1}^{H} \sum_{i \in U_{h}} V_{m} ({\tilde{X}}_{h i} - X_{h i}) \\ + {[\sum_{h =1}^{H} \sum_{i \in U_{h}} E_{m} ({\tilde{X}}_{h i} - X_{h i})]}^{2} \\ = \sum_{h =1}^{H} \sum_{i \in U_{h}} {E_{m} [{({\tilde{X}}_{h i} - X_{h i})}^{2}] - {[E_{m} ({\tilde{X}}_{h i} - X_{h i})]}^{2}} \\ + {[\sum_{h =1}^{H} \sum_{i \in U_{h}} E_{m} ({\tilde{X}}_{h i} - X_{h i})]}^{2} . (A .4) \end{array}$

In the end, taking the expectation under the model of expression (A.1) and applying simplifications (A.2), (A.3), (A.4), we obtain, after some additional simplifications:

$\begin{array}{l} E_{m} E_{P} {{[\hat{\tilde{T}} (\tilde{X}) - T (X)]}^{2}} & = \sum_{h =1}^{H} \sum_{i \in U_{h}} (\frac{1}{d_{h i}}) (1 - \frac{1}{d_{h i}}) {μ_{h}^{2} + σ_{h}^{2} + {(1 - \frac{1}{d_{h i}})}^{2} [K_{h}^{2} E_{m} (J_{h i}) + E_{m} (J_{h i} d_{h i}^{2} X_{h i}^{2}) - 2 K_{h} E_{m} (J_{h i} d_{h i} X_{h i})] + 2 (1 - \frac{1}{d_{h i}}) [K_{h} E_{m} (J_{h i} d_{h i} X_{h i}) - E_{m} (J_{h i} d_{h i}^{2} X_{h i}^{2})]} \\ + \sum_{h =1}^{H} \sum_{i \in U_{h}} {(\frac{1}{d_{h i}})}^{2} {(1 - \frac{1}{d_{h i}})}^{2} {K_{h}^{2} E_{m} (J_{h i}) + E_{m} (J_{h i} d_{h i}^{2} X_{h i}^{2}) - 2 K_{h} E_{m} (J_{h i} d_{h i} X_{h i}) + {[K_{h} E_{m} (J_{h i}) - E_{m} (J_{h i} d_{h i} X_{h i})]}^{2}^{}} \\ + {\sum_{h =1}^{H} \sum_{i \in U_{h}} (\frac{1}{d_{h i}}) (1 - \frac{1}{d_{h i}}) [K_{h} E_{m} (J_{h i}) - E_{m} (J_{h i} d_{h i} X_{h i})]}^{2} . \end{array}$

Given that the $d_{h i} X_{h i}$ are assumed to be independent and follow the same law within the strata, it is sufficient to consider a random variable $Z_{h}$ that has the same law as one of the $d_{h i} X_{h i},$ , i.e., verifying:

$E_{m} (Z_{h}) = μ_{h};$
${Var}_{m} (Z_{h}) = σ_{h}^{2} < + \infty;$
and that $Z_{h}$ are independent and of density $g_{h} (x) >0.$

Thus, we can also consider that a random variable $J_{h} = I_{Z_{h} > K_{h}}$ to calculate the expectation with respect to the model of the winsorized indicator. The previous expression is rewritten:

$\begin{array}{l} E_{m} E_{P} [{(\hat{\tilde{T}} (\tilde{X}) - T (X))}^{2}] & = \sum_{h =1}^{H} \sum_{i \in U_{h}} (\frac{1}{d_{h i}}) (1 - \frac{1}{d_{h i}}) {μ_{h}^{2} + σ_{h}^{2} + {(1 - \frac{1}{d_{h i}})}^{2} [K_{h}^{2} E_{m} (J_{h}) + E_{m} (J_{h} Z_{h}^{2}) - 2 K_{h} E_{m} (J_{h} Z_{h})] + 2 (1 - \frac{1}{d_{h i}}) [K_{h} E_{m} (J_{h} Z_{h}) - E_{m} (J_{h} Z_{h}^{2})]} \\ + \sum_{h =1}^{H} \sum_{i \in U_{h}} {(\frac{1}{d_{h i}})}^{2} {(1 - \frac{1}{d_{h i}})}^{2} {K_{h}^{2} E_{m} (J_{h}) + E_{m} (J_{h} Z_{h}^{2}) - 2 K_{h} E_{m} (J_{h} Z_{h}) - {[K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h})]}^{2}^{}} \\ + {\sum_{h =1}^{H} \sum_{i \in U_{h}} (\frac{1}{d_{h i}}) (1 - \frac{1}{d_{h i}}) [K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h})]}^{2} . \end{array}$

A.2 Search for thresholds to minimize the MSE

To determine the value of the thresholds $K_{h}$ leading to the optimum of $E_{m} E_{P} {{[\hat{T} (\tilde{X}) - T (X)]}^{2}},$ we use the same property as Kokic and Bell in their demonstration, i.e., that:

$E_{m} (Z_{h}^{p} J_{h}) = \int_{K_{h}}^{+ \infty} t_{h}^{p} g_{h} (t) d t,$

and so that

$\frac{\partial}{\partial K_{h}} E_{m} (Z_{h}^{p} J_{h}) = - K_{h}^{p} g_{h} (K_{h}) .$

By deriving relative to $K_{h},$ and after simplification, we obtain that:

$\begin{array}{l} \frac{\partial}{\partial K_{h}} E_{m} E_{P} {{[\hat{\tilde{T}} (\tilde{X}) - T (X)]}^{2}} & =2 B \times A_{h} E_{m} (J_{h}) \\ + 2 C_{h} {[K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h})] [1 - E_{m} (J_{h})]} \\ + 2 D_{h} [K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h})] + 2 F_{h} E_{m} (J_{h} Z_{h}) (A .5) \end{array}$

where

$B = \sum_{h =1}^{H} \sum_{i \in U_{h}} (\frac{1}{d_{h i}}) (1 - \frac{1}{d_{h i}}) [K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h})],$
$A_{h} = \sum_{i \in U_{h}} (\frac{1}{d_{h i}}) (1 - \frac{1}{d_{h i}}),$
$C_{h} = \sum_{i \in U_{h}} {(\frac{1}{d_{h i}})}^{2} {(1 - \frac{1}{d_{h i}})}^{2} ,$
$D_{h} = \sum_{i \in U_{h}} (\frac{1}{d_{h i}}) {(1 - \frac{1}{d_{h i}})}^{3} ,$
$F_{h} = \sum_{i \in U_{h}} (\frac{1}{d_{h i}}) {(1 - \frac{1}{d_{h i}})}^{2} .$

Equation (A.5) is reduced to:

$\begin{array}{l} \frac{\partial}{\partial K_{h}} E_{m} E_{P} [{(\hat{\tilde{T}} (\tilde{X}) - T (X))}^{2}] =0 \\ ⇔ \\ A_{h} \times B \times E_{m} (J_{h}) + (C_{h} + D_{h}) K_{h} E_{m} (J_{h}) \\ - C_{h} E_{m} (J_{h}) [K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h})] + (F_{h} - C_{h} - D_{h}) E_{m} (J_{h} Z_{h}) =0. \end{array}$

Finally, by noting that $(F_{h} - C_{h} - D_{h}) =0$ and assuming that $E_{m} (J_{h}) >0,$ we obtain that the threshold $K_{h}$ minimizing the MSE verifies the equation:

$A_{h} \times B + (C_{h} + D_{h}) K_{h} - C_{h} [K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h})] =0$

which is reduced further to

$B + \frac{(C_{h} + D_{h})}{A_{h}} K_{h} = \frac{C_{h}}{A_{h}} [K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h})] .$

It remains to be shown that $\frac{C_{h} [K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h})]}{A_{h} B}$ tends toward zero when $n \to \infty .$ However,

$\frac{C_{h} | K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h}) |}{| A_{h} B |} = \frac{C_{h} | K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h}) |}{| A_{h} | \sum_{l =1}^{H} | A_{l} | | K_{l} E_{m} (J_{l}) - E_{m} (J_{l} Z_{l}) |}$

and according to hypothesis (2.8) relating to inclusion probabilities, we have that, $\forall h =1, \dots, H, \forall i \in U_{h}$ $d_{h i} >1.$ Which implies $A_{h} >0,$ and thus:

$\begin{array}{l} \frac{| C_{h} | | K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h}) |}{| A_{h} B |} & \leq \frac{C_{h}}{A_{h}^{2}} \\ \leq \frac{\sum_{i \in U_{h}} {(\frac{1}{d_{h i}})}^{2} {(1 - \frac{1}{d_{h i}})}^{2}}{{[\sum_{i \in U_{h}} (\frac{1}{d_{h i}}) (1 - \frac{1}{d_{h i}})]}^{2}} \\ \leq \frac{1}{[\sum_{i \in U_{h}} (\frac{1}{d_{h i}}) (1 - \frac{1}{d_{h i}})]} . \end{array}$

However, it is possible to demonstrate from hypothesis (2.8) that ${[\sum_{i \in U_{h}} (\frac{1}{d_{h i}}) (1 - \frac{1}{d_{h i}})]}^{- 1} = O (\frac{1}{N_{h}}) .$ Thus: $\frac{C_{h} [K_{h} E_{m} (J_{h}) - E_{m} (J_{h} Z_{h})]}{A_{h} B}$ tends toward zero when $n \to \infty .$

$K_{h}$ is thus equivalent in each stratum to $- \frac{A_{h}}{(C_{h} + D_{h})} B,$ when the size of the population and the sample tend toward infinity.

References

Chambers, R. (1986). Outlier robust finite population estimation. Journal of the American Statistical Association, 81, 1063-1069.

Beaumont, J.-F., Haziza, D. and Ruiz-Gazen, A. (2013). A unified approach to robust estimation in finite population sampling. Biometrika, 100, 555-569.

Clark, R.G. (1995). Winsorization methods in sample surveys. Master’s thesis, Department of Statistics, Australian National University.

Dalén, J. (1987). Practical estimators of a population total which reduce the impact of large observations. R & D Report, Statistics Sweden.

Demoly, E., Fizzala, A. and Gros, E. (2014). Méthodes et pratiques des enquêtes entreprises à l’Insee. Journal de la Société Française de Statistique, 155-4.

Deroyon, T. (2015). Traitement des observations atypiques d’une enquête par winsorisation : application aux Enquêtes Sectorielles Annuelles. Actes des Journées de Méthodologie Statistique.

Fizzala, A. (2017). Adaptations of Winsorization Caused by Profiling - An Example Based on the French SBS Survey. European Establishment Survey Workshop, Southampton.

Favre-Martinoz, C., Haziza, D. and Beaumont, J.-F. (2015). A method of determining the winsorization threshold, with an application to domain estimation. Survey Methodology, 41, 1, 57-77. Paper available at https://www150.statcan.gc.ca/n1/fr/pub/12-001-x/2015001/article/14199-eng.pdf.

Favre-Martinoz, C., Haziza, D. and Beaumont, J.-F. (2016). Robust inference in two-phase sampling designs with application to unit nonresponse. Scandinavian Journal of Statistics, 43, 1019-1034.

Kokic, P.N., and Bell, P.A. (1994). Optimal winsorizing cut-offs for a stratified finite population estimation. Journal of Official Statistics, 10-4, 419-435.

Moreno-Rebollo, J.-L., Muñoz-Reyez, A.M. and Muñoz-Pichardo, J.M. (1999). Influence diagnostics in survey sampling: Conditional bias. Biometrika, 86, 923-968.

Moreno-Rebollo, J.-L., Muñoz-Reyez, A.M., Jimenez-Gamero, J.-L. and Muñoz-Pichardo, J.M. (2002). Influence diagnostics in survey sampling: Estimating the conditional bias. Metrika, 55, 209-214.

Rivest, L.-P., and Hurtubise, D. (1995). On searls’ winsorized mean for skewed populations. Survey Methodology, 21, 2, 107-116. Paper available at https://www150.statcan.gc.ca/n1/fr/pub/12-001-x/1995002/article/14399-eng.pdf.

Tambay, J.-L. (1988). An integrated approach for the treatment of outliers in sub-annual surveys. Proceedings of the Survey Research Methods Section, American Statistical Association, 229-234.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2018-12-20

Language selection

Search and menus

Search

Comparison of the conditional bias and Kokic and Bell methods for Poisson and stratified sampling
Section 4. Comparison of winsorization and conditional bias

4.1 Simulations in the case of a Poisson sampling