Comparison of the conditional bias and Kokic and Bell methods for Poisson and stratified sampling
Section 4. Comparison of winsorization and conditional bias
In the previous section, we presented two types
of methods for processing influential units applied to survey data:
- the Kokic and Bell winsorization, which aims to determine the
winsorization thresholds that minimize the mean square error of the winsorized
estimator under the sample design and the law of the variable of interest,
which was initially conceived for a stratified simple random sampling design,
but which we have extended to the case of Poisson sampling. The Kokic and Bell
method, like its extension, is thus valid under hypotheses made about the law
of the winsorized variable;
- the conditional bias method proposed by Beaumont, Haziza and
Ruiz-Gazen, which potentially applies to all sampling designs and does not rely
on any hypothesis on the law of the variable of interest; it aims to obtain the
estimator for which the most influential unit has the least influence possible.
To compare the efficiency of these two methods,
we performed two exercises:
- simulations applied to the Poisson sampling;
- a comparison on real data, applied to the data from the French labour
cost and structure of earnings survey (ECMOSS).
4.1 Simulations in the case of a Poisson sampling
We performed a simulations study to examine the
properties of the two robust estimators proposed in the context of a Poisson
drawing. We carried out four scenarios to compare the efficiency of the two
estimators, but also to study, in the case of the Kokic and Bell estimator, the
model’s robustness to a bad specification, i.e., to a modification between the
learning model and the model that generated the sample data.
The simulation proceeds as follows:
- We consider
realizations of a certain model, which makes
it possible to generate our learning set of
units;
- For each of these realizations we calculate the optimal threshold
according to the method proposed in Section (2.2);
- Then we create
test sampling frames generated according to a
(different) model on which we select a sample of expected size
following a Poisson drawing and calculate the
robust estimator
with the threshold
calculated. As a comparison, we also calculate
the robust estimator resulting from the method based on the conditional bias.
The inclusion
probabilities, as well as the values of the variable,
were
generated according to the following model:
where
is the
Bernoulli parameter, reflecting the proportion of influential values whose
values are given in Table 4.1. The notation
denotes a log-normal
distribution.
Table 4.1
Values of parameter
used to generate populations
Table summary
This table displays the results of Values of parameter
used to generate populations. The information is grouped by Scenario (appearing as row headers), Values of parameter
(appearing as column headers).
| Scenario |
Values of parameter
|
| Learning model |
Test model |
| 1 |
0 |
0 |
| 2 |
0.01 |
0.01 |
| 3 |
0.01 |
0.1 |
| 4 |
0.1 |
0.01 |
Scenario 1
corresponds to the population model for which the extension of the Kokic and
Bell method was developed in the Poisson case with
but in
which no or very few units are influential (the value of the parameter
being
fixed at 0). Scenario 2 corresponds to a situation in which this model
applies, but in which a small proportion (1%) of units are influential. The
model is, in scenarios 1 and 2, identical in the population used to
calculate the threshold and the sample to which the threshold is applied.
In
scenarios 3 and 4, the basic model is the same between the learning
population and the sample, but the number of influential units varies between
the two. In scenario 3, the learning population contains 10 times fewer
influential units than the sample. Scenario 4 corresponds to the opposite
scenario.
As a measure of
the bias of an estimator
of a
total
we calculated the relative Monte Carlo
bias (as in percentage)
where
is the
estimator
in the
sample
We also calculated
the relative efficiency of the
robust estimators relative (RE) to the dilation estimator,
:
Tables 4.2
and 4.3 represent the descriptive statistics associated with the
Monte
Carlo values calculated according to the learning population considered.
Table 4.2
Descriptive statistics for scenarios 1 and 2 of the 1,000 simulations for (équation)
Table summary
This table displays the results of Descriptive statistics for scenarios 1 and 2 of the 1. The information is grouped by Statistic (appearing as row headers), Scenario (appearing as column headers).
| Statistic |
Scenario |
| 1 |
2 |
| Description |
K&B |
BHR |
K&B |
BHR |
| BR |
RE |
BR |
RE |
BR |
RE |
BR |
RE |
| Min. |
-0.2 |
100 |
-0.43 |
100 |
-9.0 |
1 |
-4.3 |
26 |
|
-0.1 |
100 |
-0.32 |
100 |
-2.9 |
35 |
-1.9 |
51 |
| Median |
0.0 |
100 |
-0.27 |
100 |
-1.8 |
50 |
-1.5 |
62 |
| Mean |
0.0 |
100 |
-0.27 |
100 |
-2.0 |
50 |
-1.6 |
62 |
|
0.0 |
100 |
-0.23 |
100 |
-1.0 |
64 |
-1.3 |
73 |
| Max. |
0.0 |
100 |
-0.14 |
100 |
-0.1 |
109 |
-0.6 |
91 |
Scenario 1
corresponds to a situation in which no or very few influential units are
present in the population: the performance of the robust estimators is
therefore identical to that of the usual Horvitz-Thompson estimator, with a
relative bias very close to 0. Scenario 2 corresponds to the situation for
which the extension of the Kokic and Bell method to the Poisson case was
developed, with the introduction of influential units. The two robust
estimators are more effective than the usual estimator, but the performance of
the Kokic and Bell estimator in terms of the gain in mean square error is
greater, with a median relative efficiency over the 1,000 simulations of 50%,
compared to 62% for the conditional bias method. This result is expected given
that the threshold of the Kokic and Bell method is explicitly determined to
obtain the estimator with the smallest mean square error.
Table 4.3
Descriptive statistics for scenarios 3 and 4 on the 1,000 simulations for (équation)
Table summary
This table displays the results of Descriptive statistics for scenarios 3 and 4 on the 1. The information is grouped by Statistic (appearing as row headers), Scenario (appearing as column headers).
| Statistic |
Scenario |
| 3 |
4 |
| Description |
K&B |
BHR |
K&B |
BHR |
| BR |
RE |
BR |
RE |
BR |
RE |
BR |
RE |
| Min. |
-32.2 |
2 |
-7.8 |
27 |
-4.5 |
1 |
-4.3 |
26 |
|
-18.9 |
50 |
-5.1 |
59 |
-1.8 |
48 |
-1.9 |
51 |
| Median |
-13.9 |
82 |
-4.6 |
66 |
-1.5 |
70 |
-1.5 |
62 |
| Mean |
-14.2 |
89 |
-4.7 |
65 |
-1.5 |
68 |
-1.6 |
62 |
|
-9.3 |
138 |
-4.2 |
72 |
-1.2 |
91 |
-1.3 |
73 |
| Max. |
-0.01 |
537 |
-2.7 |
89 |
-0.6 |
100 |
-0.6 |
91 |
The performances
of the two methods in scenario 3 are more contrasted. While over the set
of simulations, the conditional bias method succeeds in reducing the mean
square error of the estimators, with a minimum mean square error gain of 27%,
the Kokic and Bell method deteriorates precision in more than a quarter of
cases. The population on which the threshold was calculated contains, in this
scenario, too few influential units compared to the sample for the calculated
threshold to be effective.
In
scenario 4, where the learning population contains more influential units
than the sample, the performances of the two methods are of the same order of
magnitude.
Therefore, these
simulations show:
- that in the absence of influential units, the two robust estimation
methods do not lead to a loss of estimation efficiency;
- that when applied in its hypotheses, the Kokic and Bell method
leads to more accurate estimators than the conditional bias method;
- that the Kokic and Bell method is, however, sensitive to the data
used to calculate thresholds; if these data are not generated according to the
same model as the data to which the thresholds are applied, the method may lead
to a loss of precision;
- that the conditional bias method always allows a gain in precision
on these simulations, even if this gain is not optimal.
4.2 Application to the Survey on labour costs and
wage structure
4.2.1 Presentation of the survey
The Survey on
labour cost and structure of earnings (ECMOSS) is conducted by INSEE every year
and harmonized at the European level. It is used to respond to European
regulations on the production of statistics on both the cost of labour and
structure of earnings which contribute to comparisons between European
countries in terms of work time and costs.
ECMOSS is a survey
of local business units (or establishments). It covers all sectors
both market and non-market
with the exception of agriculture, state
administrations and certain activities (extraterritorial activities, embassies,
consulates, activities of individuals acting as employers) and businesses with
10 or more employees. It covers establishments located in the metropolitan
territory and in the overseas departments. Each sampled business answers two
questionnaires: In the first, it must provide a certain amount of aggregated
information on its workforce, payroll and a breakdown into its main elements
(basic wages, bonuses, social contributions paid by the employer and by
employees, etc.) and on the number of work hours of its employees; in the
second, it details these elements for a randomly selected sample of its
employees.
Given this survey
method, the ECMOSS sample design has two stages:
- First stage: A sample of approximately 17,000 establishments is
selected according to a stratified sampling design by sector, size of business,
size of the establishment and geographical location;
- Second stage: In each establishment, a sample of employees is
selected from the lists of employees reported by the establishment to social
security organizations. The sampling design is drawn independently in each
establishment and stratified by social category of the employees,
distinguishing between managers and non-managers. The number of employees
surveyed in each establishment varies according to its size, but is limited to
24 to prevent the survey from placing too much burden on businesses. In the
end, around 150,000 employees are surveyed each year.
Each year, a certain number of establishments
do not respond to the survey, and responding establishments do not systematically
provide information for all their employees. Therefore, there is total
non-response at each stage, which is handled by reweighting according to the
homogeneous response group method. Next, the final sample of respondent
employees, on which most operations are performed, is calibrated on the
population of employees from the files of social security organizations.
Last, the sample of employees is obtained
through a complex sample design, comprising two drawing stages (establishments
and employees), with two drawing phases at each stage.
Given the very great variability of the
establishments and their wage policy (both in terms of differences in the
average levels of wages between establishments and differences in the
dispersion of wages in the establishments), the sampling weights of the sampled
employees are widely dispersed, and the accuracy of the estimators is sensitive
to the influential values of the sample: for example, a very high level
executive in a large business, or the athletes employed by a high-level sports
club.
4.2.2 Parameter of interest
The main parameter of interest in the survey is
the average hourly wage, calculated in different dissemination domains:
sectors, sectors crossed with the employment size ranges of the businesses, and
sectors crossed with the region in which the establishment is located. The
estimators used later in our simulations are obtained by calculating the ratio
of estimators by expansion of total remuneration over the total number of
hours:
with
the
sample of employees,
the
domain of interest,
the
annual remuneration of the employee
their
annual hourly work volume and
the
employee’s estimation weight obtained by multiplying the selection
probabilities and the response probabilities associated with each stage and
phase of the sample design. Estimator (4.1) does not correspond to the
estimator used in practice because it involves the initial weights corrected
for non-response, while the estimator used in practice uses the calibrated
weights. In the context of this article, the calibration phase was not taken into
account, but it could have been using the classical residual technique and an
additional degree of complexity which we deemed unnecessary to compare the two
robust estimation methods.
4.2.3 How to adapt the processing methods for
influential units to the ECMOSS sampling design
Estimator (4.1) is
not the expansion estimator of a total, for which the previously described
methods were designed. The problem can, however, be adapted to the framework of
these two methods.
Indeed, an
unbiased estimator of the variance of
with
the
estimated linearized variable of
is also
an asymptotically unbiased estimator of
Thus, a
robust estimator of the total of the linearized variable
will
also be a robust estimator for the influential units of
Each
method, applied to the estimated linearized variable, generates a winsorized
value of this variable, denoted
The
effects of the processing of the influential units are then transferred to all
other variables of interest of the survey through the estimation weight, by
calculating a winsorized estimation weight:
We thus test
the two methods of Kokic and Bell and Beaumont, Haziza and Ruiz-Gazen to
estimate the total of
However,
each of the two methods requires adaptations to be applied to the sampling
design and variables of interest of ECMOSS.
4.2.4 Adaptation of winsorization according to the
Kokic and Bell method and its extension
The survey and the parameter of interest of the
survey, even after linearization, do not fit with the framework of the Kokic
and Bell method, whether it is the original method, or the extension presented
previously. First, the ECMOSS sample is not selected using a stratified simple
random survey or a Poisson sampling. Moreover, the variable to winsorize, the
estimated linearized variable
is not
always positive. To apply the Kokic and Bell method to the ECMOSS case, we have
made the following adaptations.
- We apply the processing of the influential units as though the
employees were directly selected by stratified simple random sampling (Poisson
sampling for the extension) in strata defined by the sector, the number of
employees of the business and the location of the employing establishment, by
grouping certain modalities of this last variable to avoid generating
pseudo-strata containing too few observations (we distinguish Île de France,
the overseas departments and the rest of the country) and by the social
category of the employee (distinguishing managers and non-managers). As the
classical method acts as though the sample in each pseudo-stratum was selected
by simple random sampling and thus all employees of the same pseudo-stratum
have the same sampling weight, we do not consider the dispersion of the
estimation weights in the pseudo-strata from the actual sampling design of the
survey, and thus risk missing influential units. In the case of the extension
of the method, this dispersion of the weights is properly taken into account.
- In each of these pseudo-strata, winsorization is not applied
directly to the estimated linearized variable, but to a translated version of
it.
More precisely, we
define for each sampled employee:
for which we
calculate winsorization thresholds in the pseudo-strata according to the method
initially proposed by Kokic and Bell and for its extension. We then deduce two
sets of estimation weights used to estimate the average hourly wage in each
domain of interest of the form:
We can thus only identify and process
influential units with high values of the estimated linearized variable, i.e.,
employees whose hourly wage is higher than the average hourly wage in the
domain of interest
Units
with low hourly wages cannot be identified by this method, but pose less
problems for the accuracy of estimates, since the distribution of hourly wages
is particularly skewed, with a very long tail on the right.
A final adaptation is necessary to adapt the
method to the case of ECMOSS. This can only be used if observations of the
variable of interest in each pseudo-stratum are available. Previous editions of
the survey can be used. However, the tests performed to evaluate the efficiency
of the Kokic and Bell method applied to the Annual Sectoral Surveys (see
Deroyon, 2015) have shown that the use of responses to previous editions of the
survey to calculate winsorization thresholds does not lead to the largest gains
in accuracy. This is because the small number of observations available per
stratum to calculate these thresholds are determined with too little precision,
so that too many units can be winsorized, or conversely, influential units
escape winsorization. We have chosen to use the auxiliary information available
in the social security files on total remuneration paid annually to employees
and their number of hours worked. These data are not those measured in the
survey (in particular, the wages declared in the social security files form the
tax base on which are calculated social contributions and tax contributions on
wages, and not labour income paid to employees), but are strongly correlated
with them.
4.2.5 Adaptation of Beaumont, Haziza and Ruiz-Gazen
estimator
Because of its
generality, the conditional bias method requires fewer adaptations to be
applied to the ECMOSS. It can thus be applied directly to the variables of
interest of the survey without the need to mobilize external data. However,
calculating conditional biases while considering the whole sampling design is
complex; therefore, for our simulations, we chose to apply the conditional bias
methods as though the employees had been selected directly by a Poisson sampling,
with the selection probabilities
where
designates the estimation weight after
correction for non-response of the employee
The
conditional bias used to identify influential units is therefore equal to:
With formula
(3.8), the Beaumont, Haziza and Ruiz-Gazen estimator processes only a limited
number of units, i.e., the observations with the lowest and highest conditional
biases, for which all corresponding indicators define the sets
and
Thus, the
processed estimation weight of the influential units is equal to:
where
and
respectively designate the cardinal of
and
Compared to the
Kokic and Bell method, the robust estimator based on conditional biases does
not focus on influential units located in the right-hand part of the
distribution of the estimated linearized variable, but identifies the
influential units with very low and very high values for this variable. It also
focuses on an a priori limited number of units, since only observations
with the minimum and maximum conditional bias are modified.
4.2.6 Robust estimation on several domains of
interest
As previously
described, the domains of interest for the dissemination of the ECMOSS results
are numerous. For the sake of simplicity of dissemination and to comply with
the requirements of European regulations, each employee in the individual
sample must have only one estimation weight, so adaptations are necessary:
- Robust estimators for several sets of domains of interest
European regulations require the dissemination of results in sets of domains that intersect and are not included in one another, such as intersections of sectors and ranges of numbers of employees and crossings of sectors and regions. Sampled units may belong to more than one dissemination domain.
Ideally, the processing of influential units should be done in each domain of interest separately, so that a single observation may be associated with a different estimation weight for each dissemination domain to which it belongs. However, this solution is not possible for the reasons mentioned above.
Another solution is to apply both of the methods to the crossings of all the dissemination domains. The risk is then in applying the processes for estimators calculated on very small populations, for which many units are influential. Thus, for the estimation on real dissemination domains, too many units would be winsorized. The resulting estimators will be less precise than the robust estimators adapted to each domain, but potentially also less precise than the unprocessed estimators of the influential units because they are too biased.
- Robust estimators for all modalities of a domain of interest
For a given set of domains (e.g., industry sectors), an observation can be identified as influential and processed for estimation on more than one dissemination domain, and thus have more than one final estimate weight. This is the case if the selection of an observation belonging to a dissemination domain has an influence on the selection of other units belonging to other dissemination domains (e.g., in the case of a stratified sampling, if the dissemination domains intersect the drawing strata).
This situation is impossible for the Beaumont, Haziza and Ruiz-Gazen estimator, for which we assume the Poisson sampling design. However, this can happen for the Kokic and Bell method and its extensions as we implement them, because some dissemination domains do not consist of groupings of the pseudo-strata that we have formed. The situation is then the same as that exposed in the case of several sets of dissemination domains: the only way to maintain a unique estimation weight for each sampled unit is to apply the methods to pseudo-dissemination domains close to the real dissemination domains but made up of groupings of winsorization pseudo-strata. These pseudo-domains are in fact formed by intersections of sectors, a range of the number of employees of the businesses and the geographic location of the establishments (distinguishing only the three modalities specified above).
To evaluate the
performance in terms of precision gains or losses of the methods defined above,
we carried out a set of simulations based on the ECMOSS sampling design and
data on wages and hours worked from the social security files, available for
all employees and for which we are therefore able to compare the average hourly
wages observed in the population with their various estimators. In these
simulations, we compared the efficiency of the methods applied directly to each
dissemination domain, which lead to the optimal results, and to the
pseudo-dissemination domains defined above.
4.2.7 Simulations
The simulations
are conducted in the social security files, from which the sample of employees
is selected and which are available for all French employees. They are
implemented as follows:
- the ECMOSS sampling design (including the selection of responding
establishments and employees) is applied 5,000 times to produce 5,000 samples
of employees, denoted
- for each sample and each dissemination domain, we calculate the
usual expansion estimator of
- the Kokic and Bell winsorization and conditional bias are applied
to each sample according to different specifications:
- the Kokic and Bell winsorization, classical or adapted to Poisson sampling, is applied only as though the real dissemination domains were the pseudo-dissemination domains defined above.
- The Beaumont, Haziza and Ruiz-Gazen estimator is applied in each activity sector taken separately on the one hand, and on the other hand as though the pseudo-dissemination domains defined above were the real dissemination domains. For each dissemination domain, we can thus compare the performances of the conditional bias estimator applied in its optimal specification for this domain (producing an estimator
of the average hourly wages in the domain) to
the conditional bias method and the Kokic and Bell method (producing estimators
and
for the extension) applied according to specifications that are sub-optimal for this domain but simpler to implement.
For each robust estimator and each domain, we calculate the mean relative bias (RB) and the relative mean square error (RMSE) for all simulations by:
where, for
example, for the classical Kokic and Bell method,
designates the average hourly wage observed in
the social security files in the domain
and
designates the usual expansion estimator of
this parameter. Relative bias compares the bias of the robust estimator to the
real value of the parameter. The relative mean square error measures the gain
or loss of precision provided by the robust estimators relative to the usual estimator.
4.2.8 Simulation results
Among the
different estimators tested in our simulations, the estimator obtained by
applying the adaptation of the Kokic and Bell method to Poisson sampling is
distinguished by extremely poor performances, summarized in Table 4.4.
Application of the Kokic and Bell method extended to Poisson sampling for the
ECMOSS results in a significant or even dramatic deterioration in the precision
of the estimates.
Table 4.4
Statistics on the MSE ratio of the robust Kokic and Bell estimators applied to the Poisson sampling in the different domains of interest
Table summary
This table displays the results of Statistics on the MSE ratio of the robust Kokic and Bell estimators applied to the Poisson sampling in the different domains of interest. The information is grouped by Statistic (appearing as row headers), (équation) and Domain (appearing as column headers).
| Statistic |
|
| Domain |
| NACE* Workforce |
NACE |
NACE* NUTS |
| Min. |
18 |
128 |
33 |
| Mean |
490 |
1,858 |
324 |
| Max. |
4,437 |
8,606 |
2,466 |
Figures 4.1,
4.2 and 4.3 focus on presenting the results of the conditional bias and
classical Kokic and Bell methods, applied under the hypothesis of a stratified
simple random sampling.

Description for Figure 4.1
Bar charts presenting the relative mean square errors for the estimators of average hourly wage by sector. Relative mean square errors are on the y-axis, going from 0 to 150%. NACE sections are on the x-axis, going from B to S. There are three bars for each sections representing the following estimators: conditional bias in the pseudo-domains, conditional bias in the NACE sections and Kokic and Bell in the pseudo-domains. No estimator shows a systematically higher or lower relative mean square error, it depends on the sections. Sections C, N and O have the highest relative MSE.

Description for Figure 4.2
Box-plots presenting the distribution of relative mean square errors in each domain. Relative mean square errors are on the y-axis, going from 40 to 160%. Three domains are shown: NACE, NACE*No. et NACE*NUTS. There are three box-plots for each domain: one for the conditional bias, one for the optimal conditional bias and one for Kokic-Bell. Box-plots for the optimal conditional bias show a smaller dispersion compared to the other estimators. The median of conditional bias box-plots is lower compared to the other estimators.
Figure 4.1 shows the relative mean square
errors of the robust average hourly wage estimators in each section of the
Statistical classification of economic activities in the European Community
(NACE, a grouping of business sectors into 21 categories, of which 18 are in
the ECMOSS field) and Figure 4.2 shows the distribution of relative mean
square errors in each domain (among all sections, section crossings, and number
of business employees, or crossings of sector and location of the establishment).
For almost all domains of interest, the robust
estimators considered provide gains in precision over the usual expansion
estimator. The domains in which the robust estimators have a higher error than
the usual estimator are also those where the estimation variance is the lowest
originally. The processes for influential units considered in these figures
(conditional bias and classical Kokic and Bell method) are thus able to reduce
estimation errors when necessary without causing too much loss of precision
when the estimators are not affected by influential units.
The biases of the average hourly wage
estimators in the sectors are low (see Figure 4.3), except in some domains
where the sample size is small (A: Agriculture, forestry and fishing; K: Financial
and insurance activities; R: Arts, entertainment and recreation). The same
results are also observed for the other domains.

Description for Figure 4.3
Bar charts presenting the relative biases for estimators of average hourly wage by sector. Relative bias is on the y-axis, going from -3 to 1%. NACE sections are on the x-axis, going from A to S. There are three bars for each sections representing the following estimators: conditional bias in the pseudo-domains, conditional bias in the NACE sections and Kokic and Bell in the pseudo-domains. The biases of the average hourly wage estimators in the sectors are low, except in some domains where the sample size is small (A: Agriculture, forestry and fishing; K: Financial and insurance activities; R: Arts, entertainment and recreation).
The application of
conditional bias methods adapted to each domain gives the best results for the
estimation in the NACE sections, but not necessarily in the other dissemination
domains. The NACE sections are much more aggregated than the pseudo-domains
used for the identification of influential units, so the bias introduced by the
processing of influential units is more significant in the cases where the
application is made on pseudo-domains, compared to the optimal version applied
directly to the NACE sections. In the other domains, the identification of
influential units at a finer level than the real dissemination domain makes it
possible to identify more influential units and thus substantially reduce the
estimation variance, without introducing too much additional bias, when the
domain used to identify the influential units and the real dissemination
domains are close. Differences in how the sampling design is described to apply
each of the two methods and the actual sampling design may explain why the use
of the Beaumont, Haziza and Ruiz-Gazen robust estimator in each dissemination
domain does not necessarily translate into greater precision gains.
The differences
between the results obtained with the conditional bias and Kokic and Bell
methods under the hypothesis of the stratified simple random sampling design
are, however, small. Note however that, for the implementation of these
simulations, we use the population data as observations of the additional
interest variables not from the sample to calculate the winsorization thresholds
in the Kokic and Bell method. Since we also evaluate the performance of the
different estimators based on these data, the Kokic and Bell method is favoured
a priori.
The extension of
the Kokic and Bell method to Poisson sampling results in a significant
deterioration in the precision of the estimators.
The discrepancies
between the performances of the two implementations of the Kokic and Bell
method are thus very high. However, these implementations are both based on two
hypotheses:
- a hypothesis on the sampling design used to select the sample;
- a hypothesis on the distribution of the variable of interest in
subpopulations
In both
applications of the Kokic and Bell method, the first hypothesis is not
respected. The violation of this hypothesis is, however, a priori more
significant when we apply the Kokic and Bell method as though the sample had
been selected by a stratified simple random sampling in pseudo-strata
constructed ad-hoc, because in so doing we assume that the selection
probabilities are identical in these pseudo-strata, which is not at all
verified. The Kokic and Bell method applied as though the employees had been
selected by Poisson sampling, for its part, considers real simple inclusion
probabilities, but neglects the links between the indicators of belonging to
the sample of different employees.
However, the
population model postulated for the Kokic and Bell method extended to the
Poisson case is not valid, since the simple inclusion probabilities are not
proportional to the variable of interest considered. It is more complex to
assess the validity of the population model used for the classical Kokic and
Bell method; up to a point, it is still possible to consider that the results
of the variable of interest in a pseudo-stratum are derived from the same law
whose expectation and variance can be estimated by the mean and the empirical
variance of the results of the variable of interest in the stratum.
Also, the
performance differences of the two implementations of the Kokic and Bell method
are complex to analyze. A first possible explanation is that the performances
of the method are more sensitive to violations of the hypothesis on the law of
the observations than to those on the form of the sampling design. This finding
was shared by Fizzala (2017) in the case of an application of winsorization in
the context of corporate profiling. In our ECMOSS simulations, we observe that
the classical Kokic and Bell method, based on the hypothesis of stratified
simple random samplings, gives very valid results despite the fact that this
hypothesis is only partially respected. Future extensions of this work could
consist of validating this explanation on the basis of simulations. Another
explanation for these differences in performance may lie in the relationship
between the two hypotheses in the case of the extension of Kokic and Bell to
Poisson sampling. Indeed, while in the case of the classical Kokic and Bell
method, the hypotheses on the sampling design and the law of the variable of
interest in each stratum are unrelated, in the case of the Poisson sampling,
the population model involves selection probabilities and therefore implies
additional constraints on the sampling design. Therefore, the fact that the
selection probabilities are not proportional to the variable of interest
implies that, for the extension of the Kokic and Bell to Poisson sampling, the
hypotheses on the sampling design and the population are simultaneously
violated, which could explain this explosion of errors of the estimator.
However, the
conditional bias and classical Kokic and Bell methods, whatever the
configuration, seem to be able to identify influential units for the estimation
of the parameters affected, and thus guarantee significant gains in precision
even when they are applied in a setting that is remote from their original
hypotheses and the actual sampling design of the survey.
Appendix
A Demonstrations of the formulas
for the extension of the Kokic and Bell method in the case of a Poisson
sampling
A.1 Calculation of the mean square error of the
winsorized estimator
First, we will calculate
with
Furthermore,
finally:
Assuming in each
stratum that:
-
-
- and that
are independent and of density
and noting that:
and so that:
we obtain:
and that:
In the end, taking
the expectation under the model of expression (A.1) and applying
simplifications (A.2), (A.3), (A.4), we obtain, after some additional
simplifications:
Given that the
are
assumed to be independent and follow the same law within the strata, it is
sufficient to consider a random variable
that has
the same law as one of the
, i.e., verifying:
-
-
- and that
are independent and of density
Thus, we can also
consider that a random variable
to
calculate the expectation with respect to the model of the winsorized indicator.
The previous expression is rewritten:
A.2 Search for thresholds to minimize the MSE
To determine the
value of the thresholds
leading
to the optimum of
we use
the same property as Kokic and Bell in their demonstration, i.e., that:
and so that
By deriving
relative to
and
after simplification, we obtain that:
where
-
-
-
-
-
Equation (A.5) is
reduced to:
Finally, by noting
that
and
assuming that
we
obtain that the threshold
minimizing the MSE verifies the equation:
which is reduced further to
It remains to be shown that
tends
toward zero when
However,
and according
to hypothesis (2.8) relating to inclusion probabilities, we have that,
Which
implies
and
thus:
However, it is
possible to demonstrate from hypothesis (2.8) that
Thus:
tends
toward zero when
is thus
equivalent in each stratum to
when the
size of the population and the sample tend toward infinity.
References
Chambers, R. (1986). Outlier robust finite population estimation. Journal of the American Statistical Association, 81, 1063-1069.
Beaumont, J.-F., Haziza, D. and Ruiz-Gazen, A. (2013). A unified approach to robust estimation in finite population sampling. Biometrika, 100, 555-569.
Clark, R.G. (1995). Winsorization methods in sample surveys. Master’s thesis, Department of Statistics, Australian National University.
Dalén, J. (1987). Practical estimators of a population total which reduce the impact of large observations. R & D Report, Statistics Sweden.
Demoly, E., Fizzala, A. and Gros, E. (2014). Méthodes et pratiques des enquêtes entreprises à l’Insee. Journal de la Société Française de Statistique, 155-4.
Deroyon, T. (2015). Traitement des observations atypiques d’une enquête par winsorisation : application aux Enquêtes Sectorielles Annuelles. Actes des Journées de Méthodologie Statistique.
Fizzala, A. (2017). Adaptations of Winsorization Caused by Profiling - An Example Based on the French SBS Survey. European Establishment Survey Workshop, Southampton.
Favre-Martinoz, C., Haziza, D. and Beaumont, J.-F. (2015). A method of determining the winsorization threshold, with an application to domain estimation. Survey Methodology, 41, 1, 57-77. Paper available at https://www150.statcan.gc.ca/n1/fr/pub/12-001-x/2015001/article/14199-eng.pdf.
Favre-Martinoz, C., Haziza, D. and Beaumont, J.-F. (2016). Robust inference in two-phase sampling designs with application to unit nonresponse. Scandinavian Journal of Statistics, 43, 1019-1034.
Kokic, P.N., and Bell, P.A. (1994). Optimal winsorizing cut-offs for a stratified finite population estimation. Journal of Official Statistics, 10-4, 419-435.
Moreno-Rebollo, J.-L., Muñoz-Reyez, A.M. and Muñoz-Pichardo, J.M. (1999). Influence diagnostics in survey sampling: Conditional bias. Biometrika, 86, 923-968.
Moreno-Rebollo, J.-L., Muñoz-Reyez, A.M., Jimenez-Gamero, J.-L. and Muñoz-Pichardo, J.M. (2002). Influence diagnostics in survey sampling: Estimating the conditional bias. Metrika, 55, 209-214.
Rivest, L.-P., and Hurtubise, D. (1995). On searls’ winsorized mean for skewed populations. Survey Methodology, 21, 2, 107-116. Paper available at https://www150.statcan.gc.ca/n1/fr/pub/12-001-x/1995002/article/14399-eng.pdf.
Tambay, J.-L. (1988). An integrated approach for the treatment of outliers in sub-annual surveys. Proceedings of the Survey Research Methods Section, American Statistical Association, 229-234.