Two local diagnostics to evaluate the efficiency of the empirical best predictor under the Fay-Herriot model
Section 6. Simulation study
A
simulation study was conducted to evaluate the effectiveness of
and
in detecting which of the direct and EB
estimators is preferable. We considered
140 domains representing Canadian cities. In
this simulation study, the vector of auxiliary variables is:
The auxiliary variable
is obtained from administrative files and is
defined as the ratio of the number of employment insurance beneficiaries in
city
to the number of people over 15 years of age
in city
The sample size in city
was obtained from the Canadian Labour Force
Survey (LFS). Of the 140 cities, 2 have a sample size smaller than 10, 10 have
a sample size smaller than 30, 40 have a sample size smaller than 60, and 68
have a sample size smaller than 100, representing almost 50% of the cities. For
these 68 cities, the estimated coefficients of variation of the LFS unemployment
rates are in most cases too large to publish direct estimates of the
unemployment rate; as a result, small area estimation techniques are required
for these domains. In contrast, there are also 17 of the 140 cities with a
sample size larger than 1,000 for which the direct estimate of the unemployment
rate is reliable.
The
population parameter
was simulated for the
domains using the actual values of
and
It can be interpreted as the proportion of
unemployed people in city
The parameter
was generated using the beta distribution with
mean
and variance
where
and
(0.0484, 0.95). These values of
and
were chosen from real data. We set
Then, we manually changed the values of
for four domains (cities) in order to have a
local effect
equal to
Cities with different sample sizes were
chosen: 10, 100, 501 and 3,773. In the rest of this section, the smallest of
these four cities is identified by City 1
the second smallest by City 2
the second largest by City 3
and the largest by City 4
3,773).
A
stratified simple random sampling with replacement design was considered where
strata coincide with domains. The direct estimator
of
is simply the proportion of sampled people in
area
who have the characteristic of interest (e.g.,
being unemployed). Under such a simple design, it is easy to see that the
direct estimator can be generated as follows:
It is therefore not necessary to create the
population of people in domain
to generate
We proceeded in this way in the simulation.
The design variance of
is given by
and its estimator by
The smoothed variance
is estimated using the smoothing model in
Section 5 with
In
order to simulate a realistic scenario, the underlying assumptions of the
Fay-Herriot model are not entirely satisfied in our simulation. For example,
the errors
and
do not exactly follow normal distributions. We
used a beta distribution to generate
the normality assumption of
is therefore not satisfied although the
deviation from the normal distribution is not severe in our simulation. The
estimates
were generated from a binomial distribution, which
can be approximated by a normal distribution for domains with a large value of
The relationship between the simulated estimates
and the auxiliary vectors
is similar to the one observed with the real
LFS estimates. Moreover, our simulation scenario is such that the assumption
is not satisfied since, for this simple
design,
(see remark in Section 2). However,
we note that the correlation coefficient between
and
is 0.98,
which indicates that the deviation from the assumption
is
modest. As mentioned in the previous paragraph, the smoothing model in Section 5
is used to estimate
This
allows us to remain in a realistic framework where the postulated smoothing
model is different from the true model used to generate the estimates
We
conducted a design-based simulation study, i.e., the population parameters
were generated only once. We repeated sample
selection
10,000 times. For each replicate
a direct estimate
was generated and a smoothed variance estimate
was calculated as described above. The EB
estimate was then calculated as:
where
and
and
are
calculated as described in Section 5. The generalized least squares method
was used to obtain
and the
restricted maximum likelihood method was used to obtain
Calculations were performed using Statistics
Canada’s small area estimation system (Hidiroglou, Beaumont and Yung, 2019).
For
each replicate, standardized residuals
and diagnostics
and
were also calculated for the
domains. We recorded whether the direct
estimator was preferred over the EB estimator for each of the two diagnostics.
Decision thresholds were used for this purpose. Below the thresholds, the
direct estimator is used. For Diagnostic 1, thresholds of 50% and 75% were used
and for Diagnostic 2, thresholds of 5% and 25% were used.
From the previous quantities, calculated for each of the
10,000 replicates, the Monte Carlo averages of Diagnostics 1 and 2 were
calculated for the
domains:
and
The selection rate of the direct estimator was
also calculated for each of the two diagnostics, i.e., the percentage of times
a given diagnostic led to the selection of the direct estimator.
The
Monte Carlo approximation of
was calculated as:
From this Monte Carlo MSE, the relative efficiency of
the EB estimator was calculated as:
This
ratio is positive when the EB estimator is less efficient than the direct
estimator under the design. A diagnostic is potentially useful if it is
negatively correlated with this ratio.
Figures 6.1
and 6.2 present the Monte Carlo averages of Diagnostic 1 and 2 respectively as
a function of the relative efficiency of the EB estimator defined in equation
(6.1). The four cities whose values of
have been changed, Cities 1 to 4, are shown in purple, orange, green and red. In the legend, the
sample size of these cities has been indicated. The values of the parameter
for Cities 1 to 4 are 0.01, 0.08, 0.35 and
0.81 respectively. All other cities are shown in blue.
First,
we can see in Figures 6.1 and 6.2 that the EB estimator is more efficient than
the direct estimator for City 1 (in purple) since this city is to the left of
the vertical line (negative relative efficiency) despite the strong local
effect. The explanation of this phenomenon is shown in Figure 3.1. It
shows that the range of values of
for which the B estimator is more efficient than
the direct estimator increases as
decreases. Since
is small for City 1
it is not surprising to observe a negative
relative efficiency despite a pronounced local effect. For City 2 (in orange),
the direct estimator is slightly more efficient than the EB estimator. On the
other hand, for Cities 3 (in green) and 4 (in red), the direct estimator is
much more efficient than the EB estimator. Note also that there are five cities
for which the direct estimator is more efficient than the EB estimator: Cities
2 to 4 as well as two other cities whose values of
were randomly generated and not manually
modified. One of these cities has the smallest value of
and the other has the largest value of
after excluding the four cities that had their
value manually modified. These two cities have large values of
(0.62 and 0.49).
Figures 6.1
and 6.2 indicate that our two diagnostics seem to be quite effective in
detecting cases where the direct estimator is more efficient than the EB
estimator except for City 2
where the Monte Carlo average of Diagnostic 1
is very high at 0.97. However, this is a domain where choosing the least
efficient estimator is not really problematic since there is very little
difference between the efficiencies of the two estimators. Apart from this
specific case, Diagnostic 1 seems to have better properties than Diagnostic 2.
The Monte Carlo average of Diagnostic 1 is very close to 1 when the EB
estimator is significantly more efficient than the direct estimator, decreases
slowly when the efficiencies of the two estimators approach each other and
becomes small when the direct estimator is significantly more efficient than
the EB estimator. Not exactly the same behaviour is observed for Diagnostic 2.
The Monte Carlo average of Diagnostic 2 is small when the direct estimator is
significantly more efficient than the EB estimator but it is not close to 1
when the EB estimator is significantly more efficient than the direct estimator.
Furthermore, it seems to increase when the efficiencies of the two estimators
come closer, which is counterintuitive.

Description of figure 6.1
Figure representing the Monte Carlo average of Diagnostic 1 estimates for the 140 domains (representing Canadian cities) versus the relative efficiency of the EB estimator defined in equation (6.1). The values of
for four domains (cities) were manually changed in order to have a local effect
equal to 5
The four domains, colored in purple, orange, green and red respectively, were selected with different sample sizes. All other cities are shown in blue. First, we can see that the EB estimator is more efficient than the direct estimator for City 1 since this city is to the left of the vertical line (negative relative efficiency) despite the strong local effect. Since
is small for City 1, we observe a negative relative efficiency despite a pronounced local effect. For City 2, the direct estimator is slightly more efficient than the EB estimator. On the other hand, for Cities 3 and 4, the direct estimator is much more efficient than the EB estimator. Note also that there are five cities for which the direct estimator is more efficient than the EB estimator: Cities 2 to 4 as well as two other cities whose values of
were randomly generated and not manually modified. One of these cities has the smallest value of
and the other has the largest value of
after excluding the four cities that had their value manually modified. These two cities have large values of
Diagnostic 1 seems to be quite effective in detecting cases where the direct estimator is more efficient than the EB estimator except for City 2 where the Monte Carlo average of Diagnostic 1 is very high at 0.97. The Monte Carlo average of Diagnostic 1 is very close to 1 when the EB estimator is significantly more efficient than the direct estimator, decreases slowly when the efficiencies of the two estimators approach each other and becomes small when the direct estimator is significantly more efficient than the EB estimator.

Description of figure 6.2
Figure representing the Monte Carlo average of Diagnostic 2 estimates with the same context as Figure 6.1. Just as Diagnostic 1 in Figure 6.1, we first see that the EB estimator is more efficient than the direct estimator for City 1 since this city is to the left of the vertical line (negative relative efficiency) despite the strong local effect. Since
is small for City 1, we observe a negative relative efficiency despite a pronounced local effect. For City 2, the direct estimator is slightly more efficient than the EB estimator. On the other hand, for Cities 3 and 4, the direct estimator is much more efficient than the EB estimator. Note also that there are five cities for which the direct estimator is more efficient than the EB estimator: Cities 2 to 4 as well as two other cities whose values of
were randomly generated and not manually modified. One of these cities has the smallest value of
and the other has the largest value of
after excluding the four cities that had their value manually modified. These two cities have large values of
Diagnostic 1 seems to have better properties than Diagnostic 2. The Monte Carlo average of Diagnostic 2 is small when the direct estimator is significantly more efficient than the EB estimator, but it is not close to 1 when the EB estimator is significantly more efficient than the direct estimator. Furthermore, it seems to increase when the efficiencies of the two estimators come closer, which is counterintuitive.

Description of figure 6.2
Figure representing the Monte Carlo average of Diagnostic 2 estimates with the same context as Figure 6.1. Just as Diagnostic 1 in Figure 6.1, we first see that the EB estimator is more efficient than the direct estimator for City 1 since this city is to the left of the vertical line (negative relative efficiency) despite the strong local effect. Since
is small for City 1, we observe a negative relative efficiency despite a pronounced local effect. For City 2, the direct estimator is slightly more efficient than the EB estimator. On the other hand, for Cities 3 and 4, the direct estimator is much more efficient than the EB estimator. Note also that there are five cities for which the direct estimator is more efficient than the EB estimator: Cities 2 to 4 as well as two other cities whose values of
were randomly generated and not manually modified. One of these cities has the smallest value of
and the other has the largest value of
after excluding the four cities that had their value manually modified. These two cities have large values of
Diagnostic 1 seems to have better properties than Diagnostic 2. The Monte Carlo average of Diagnostic 2 is small when the direct estimator is significantly more efficient than the EB estimator, but it is not close to 1 when the EB estimator is significantly more efficient than the direct estimator. Furthermore, it seems to increase when the efficiencies of the two estimators come closer, which is counterintuitive.
Figures 6.3
and 6.4 show the selection rate of the direct estimator over the 10,000
replicates for Diagnostics 1 and 2. Similar conclusions can be drawn as those
obtained by analyzing Figures 6.1 and 6.2. As expected, the thresholds of
75% for Diagnostic 1 and 25% for Diagnostic 2 allow better detection of cases
where the direct estimator is more efficient than the EB estimator, but these
thresholds also lead to the direct estimator being chosen a little too often when
it was less efficient than the EB estimator. This is particularly notable for
Diagnostic 2. This error can be dampened by decreasing the thresholds, but this
also reduces the selection rate of the direct estimator when it is more efficient
than the EB estimator. As noted earlier, Diagnostic 1 appears to have better
properties than Diagnostic 2, regardless of the thresholds chosen, with very
small selection rates of the direct estimator when it is significantly less efficient
than the EB estimator. This seems to show the limitations of a fully
design-based approach, such as the one presented in Section 4.2, to address
the challenge of small domain sample sizes.

Description of figure 6.3
Figure representing the selection rate of the direct estimator over the 10,000 replicates for Diagnostic 1. As in Figure 6.1, we see the 140 domains, including the four manually modified. For each replicate, we recorded whether the direct estimator was preferred over the EB estimator for Diagnostic 1. Decision thresholds were used for this purpose. Below the thresholds, the direct estimator is used. For Diagnostic 1, thresholds of 50% and 75% were used, found respectively in the superior and inferior graph of Figure 6.3. Similar conclusions can be drawn as those obtained by analyzing Figure 6.1. The thresholds of 75% for Diagnostic 1 allow better detection of cases where the direct estimator is more efficient than the EB estimator, but these thresholds also lead to the direct estimator being chosen a little too often when it was less efficient than the EB estimator. As noted earlier, Diagnostic 1 appears to have better properties than Diagnostic 2, regardless of the thresholds chosen, with very small selection rates of the direct estimator when it is significantly less efficient than the EB estimator. This seems to show the limitations of a fully design-based approach, such as the one presented in Section 4.2, to address the challenge of small domain sample sizes.

Description of figure 6.4
Figure representing the selection rate of the direct estimator over the 10,000 replicates for Diagnostic 2. As in Figure 6.2, we see the 140 domains, including the four manually modified. For each replicate, we recorded whether the direct estimator was preferred over the EB estimator for Diagnostic 2. Decision thresholds were used for this purpose. Below the thresholds, the direct estimator is used. For Diagnostic 2, thresholds of 5% and 25% were used, found respectively in the superior and inferior graph of Figure 6.4. Similar conclusions can be drawn as those obtained by analyzing Figure 6.2. The thresholds of 25% for Diagnostic 2 allow better detection of cases where the direct estimator is more efficient than the EB estimator, but these thresholds also lead to the direct estimator being chosen a little too often when it was less efficient than the EB estimator. This is particularly notable for Diagnostic 2. As noted earlier, Diagnostic 1 appears to have better properties than Diagnostic 2, regardless of the thresholds chosen, with very small selection rates of the direct estimator when it is significantly less efficient than the EB estimator. This seems to show the limitations of a fully design-based approach, such as the one presented in Section 4.2, to address the challenge of small domain sample sizes.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa