Development of a small area estimation system at Statistics Canada
Section 6. Application to Labour Force Survey (LFS) data
Statistics
Canada’s LFS is a monthly survey
with a stratified two-stage design. It is designed to produce reliable
unemployment rate estimates for the 55 Employment Insurance Economic Regions
(EIER) in
Canada. The unemployment rate in any given area
is defined as the ratio
where
is a
binary variable indicating whether person
is
unemployed
or not
and
is a
binary variable indicating whether person
is in
the labour force
or not
The
direct estimator of
is the
calibration composite estimator described in Fuller and Rao (2001). See also
Singh, Kennedy and Wu (2001) and Gambino, Kennedy and Singh (2001). It can be written in the
weighted form
where
is a
calibration composite weight for person
As
mentioned above, the calibration composite estimator is reliable for the
estimation of the unemployment rate for the 55 EIERs. There is also interest in
obtaining reliable estimates for 149 areas (cities) in
Canada
. Among them, there are 34
Census Metropolitan Areas (CMA) and 115 Census Areas (CA). The CMAs are the largest cities in terms of
population size and they usually have a large sample size as well. Some of the
CAs have a very small sample size, sometimes even 0. For those CAs and other
larger CAs, the sample size is not large enough to produce sufficiently
reliable direct estimates of the monthly unemployment rate. Our objective was
to investigate whether the Fay-Herriot model could be used to obtain monthly
estimates that would be reliable enough to be published.
We
constructed an auxiliary variable
for area
given by
where
is the number of employment insurance
beneficiaries in area
and
is the number of persons aged 15 years or
older in area
The numerator is obtained from an
administrative source, whereas the denominator is a Census projection computed
by Statistics Canada. We used the vector
along with
to obtain SAE estimates. We used May 2016 data
in this investigation to allow the comparison of
direct and SAE estimates with 2016 Census estimates.
Some
of the 149 areas of interest had a very small sample size in the LFS: they were
not used in the Fay-Herriot and smoothing models. As a rule of thumb, we
excluded from the models, areas where the number of sampled persons in the labour force was smaller than 10. There were 9
such areas; among them, six had no sampled person in the labour force. Also, there were 9 other areas
where the direct unemployment rate estimate,
and its direct variance estimate,
were both equal to 0. As these direct
estimates were not deemed to be reliable enough, their associated areas were
excluded from the models. This resulted in using only 131 areas in the models.
For those areas, the small area estimates are EBLUP estimates, with the
remaining 18 being synthetic estimates.
The
estimator
of the direct variance
was obtained via the Rao-Wu bootstrap. The
estimates of the smooth design variances were then obtained by using
A graph of the residuals of the smoothing
model,
versus the predicted values,
did not reveal any obvious model
misspecification. Figure 6.1 shows a graph of direct variances estimates,
versus smooth variance estimates,
The red line is the identity line. If the
smoothing model is appropriate, for any value of
the average of direct variance estimates for
areas around area
should be roughly equal to
This means that the red line should pass
roughly through the middle of the points everywhere. From a quick inspection of
Figure 6.1, we observe that the red line is close to the middle of the
points although probably slightly above the middle due to some extreme values
of
This may result in a slight overestimation of
the true smooth variance
A slight overestimation is not a major issue.
What has to be avoided is an underestimation of
as it typically leads to underestimating the
MSE of the SAE estimate. This would provide the user with a false impression of
precision.
Overall,
we were satisfied with our smoothed variance estimates. However, for areas with
large sample sizes, we set
as our estimate of
We assumed that direct variance estimates were
stable enough when the sample size is large. As a rule of thumb, we set
when the number of sampled persons in the labour force was greater than 400. This
replacement occurred for 35 areas. The strategy was used to avoid possible
small model biases in
for the largest areas, which could result in
EBLUP estimates that become significantly different from the direct estimates.
This is not a desirable property for areas with a
large sample size.
The
smooth variance estimates were then used to obtain small area estimates for the
149 areas of interest. Figure 6.2 shows a graph of small area and direct
estimates as a function of sample size (number of sampled persons in the labour force). The small area estimates are
much less volatile than direct estimates, especially for the areas with the
smallest sample sizes. For the largest areas, as expected, both estimates are
similar.
We
first evaluated the quality of the underlying Fay-Herriot model before looking
at the MSE estimates. Figure 6.3 shows the graph of direct estimates,
versus predicted values,
The red line is the identity line and the blue
line is a nonparametric smoothing spline curve. If the linearity assumption
holds, the blue line should be close to the red line and the latter should pass
roughly through the middle of the points everywhere. Figure 6.3 does not
give any indication that the linearity assumption of the Fay-Herriot model is
questionable.

Description for Figure 6.1
Scatter plot presenting the direct variances estimates (ranging from 0 to 300 on the y-axis) versus GVF smooth variance estimates (ranging from 0 to 60 on the x-axis). Composite estimates are presented for 131 areas and the parameter is the unemployment rate. A red line representing the identity line is added to the graph. The red line is close to the middle of the points although probably slightly above the middle due to some extreme values of direct variance estimates. This may result in a slight overestimation of the true smooth variance.

Description for Figure 6.2
Graph of small area and direct estimates (ranging from 0 to 40 on the y-axis) as a function of sample size (ranging from 0 to 1,239 on the x-axis). Composite estimates for 131 areas, synthetic estimates for 18 areas and direct estimates for 143 areas are represented and the parameter is the unemployment rate. The small area estimates are much less volatile than direct estimates, especially for the areas with the smallest sample sizes. For the largest areas, as expected, both estimates are similar.

Description for Figure 6.3
Scatter plot of direct estimates (ranging from 0 to 30 on the y-axis) versus model predicted values (ranging from 5 to 12 on the x-axis). The parameter is the unemployment rate, 131 areas are represented and the R-squared is 0.63. The red line is the identity line and the blue line is a nonparametric smoothing spline curve. If the linearity assumption holds, the blue line should be close to the red line and the latter should pass roughly through the middle of the points everywhere. This graph does not give any indication that the linearity assumption of the Fay-Herriot model is questionable.
It is also informative to compute a measure that
indicates the strength of
for the prediction of
To this end, we developed and implemented a
coefficient of determination, or
value, associated with the linking model
Note that the coefficient of determination
associated with the combined model,
is not of interest as the objective is not the
prediction of
but the prediction of
Our coefficient of determination is given by
where
is the
dimension of
and
is the
sample variance of
(see
equation (A.6) for the exact definition of the function
The
details of the derivation of the above coefficient of determination are
provided in the Appendix. Figure 6.3 indicates that the
value is
0.63. The linking model is thus neither weak nor extremely strong but,
hopefully, strong enough to achieve efficiency gains over the direct estimator.
The system also produces estimates of the parameters of the Fay-Herriot model
along with their standard errors. From this output, we found out that estimates
of both the intercept and slope parameters of the Fay-Herriot model were
significantly different from 0 using a standard Wald test at the 0.05 significance level.
Figure 6.4 shows a graph of standardized residuals,
versus standardized predicted values,
The red line is a horizontal line at zero and
the blue line is a nonparametric smoothing spline curve. Similarly to
Figure 6.3, the blue line should be close to the red line under linearity
and the latter should pass roughly through the middle of the points everywhere.
Again, Figure 6.4 does not indicate any obvious failure of the linearity
assumption underlying the Fay-Herriot model.

Description for Figure 6.4
Scatter plot of model standardized residuals (from -3 to 8 on the y-axis) versus model standardized predicted values (from 0 to 8 on the x-axis). The parameter is the unemployment rate and 131 areas are represented. The red line is a horizontal line at zero and the blue line is a nonparametric smoothing spline curve. The blue line should be close to the red line under linearity and the latter should pass roughly through the middle of the points everywhere. Again, this graph does not indicate any obvious failure of the linearity assumption underlying the Fay-Herriot model.
Figure 6.5 shows a graph of squared standardized
residuals versus standardized predicted values. The red line is a horizontal
line at one and the blue line is again a nonparametric smoothing spline curve.
This graph is used to check the homoscedasticity assumption; i.e., the
assumption that the model variance
is constant. Under homoscedasticity, the blue
line should be close to the red line everywhere. The graph does not reveal any
obvious presence of heteroscedasticity.

Description for Figure 6.5
Scatter plot of the square of model standardized residuals (ranging from 0 to 30 on the y-axis) versus model standardized predicted values (ranging from 0 to 8 on the x-axis). The parameter is the unemployment rate and 131 areas are represented. The red line is a horizontal line at one and the blue line is again a nonparametric smoothing spline curve. This graph is used to check the homoscedasticity assumption under which the blue line should be close to the red line everywhere. The graph does not reveal any obvious presence of heteroscedasticity.
Figure 6.6 shows a QQ-plot of standardized residual
quantiles versus standard normal quantiles. It is used to verify the normality
assumption of the errors
and
The graph does indicate a modest departure
from normality. However, Rao and Molina (2015, page 138) argued that EBLUP
estimates and their corresponding MSE estimates are generally robust to
deviations from normality.
The system also computes Cook’s distances to identify
areas that could have a significant influence on the estimate
The Cook distance for area
is given by
where
is the
estimate of
obtained
after deleting area
A plot
of the influences
is
provided in Figure 6.7. One area seems to have a relatively large
influence compared with other areas
This
area has the largest standardized predicted value and the second largest
predicted value. Its standardized residual is -1.88, which is not extreme,
although not very small either. Its sample size is large (number of sampled
persons in the labour force close to 500)
and its smooth variance estimate,
is
relatively small compared with other
areas. All these reasons explain why this area was detected as being
influential. In this application, we decided to keep this area in the model as
its influence was not large enough to make a big difference in the SAE
estimates and their corresponding MSE estimates.

Description for Figure 6.6
QQ-plot of REML standardized residual quantiles (ranging from -3 to 5 on the y-axis) versus standard normal quantiles (ranging from -3 to 3 on the x-axis). The parameter is the unemployment rate. There are 131 areas, the sample mean is 0.048, the sample variance is 0.994, the normal test -value is 0.000 and the sign test -value is 0.294. This graph is used to verify the normality assumption of the errors. The graph indicates a modest departure from normality.

Description for Figure 6.7
Scatter plot of Cook’s distances. The REML influence measure is on the y-axis, ranging from 0.0 to 1.3. The area number from 1 to 151 is on the x-axis. The parameter is the unemployment rate. One area seems to have a relatively large influence compared with other areas (influence measure = 1.2851). This area has the largest standardized predicted value and the second largest predicted value. Its standardized residual is -1.88. Its sample size is large and its smooth variance estimate is relatively small compared with other areas. All these reasons explain why this area was detected as being influential.
Since the Fay-Herriot model and smoothing model were
both reasonable, we computed MSE estimates to evaluate the magnitude of the
efficiency gains, if any, obtained by using the Fay-Herriot model.
Figure 6.8 shows the estimated direct Coefficient of Variation (CV),
defined as
and the estimated SAE Relative Root Mean
Square Error (RRMSE), defined as
where
is an estimate of the MSE,
and
is the small area estimate (EBLUP or synthetic
estimate) of
The sample size (number of sampled persons in
the labour force) is given on the
horizontal axis. The estimated direct CVs are in general much larger than the
estimated SAE RRMSEs, especially for the areas with the smallest sample sizes.
The estimated SAE RRMSEs are never above 20% whereas the estimated direct CV is
over 300% for one area. The estimated SAE RRMSEs are also very stable as a
function of the sample size unlike the erratic behavior of the estimated direct
CVs. For the areas with the largest sample sizes, both estimates are very
similar, as expected. This indicates that SAE methods can lead to a substantial
increase of precision over direct estimation methods, particularly for the
smallest areas.

Description for Figure 6.8
Graph of estimated direct CVs and SAE RRMSEs (ranging from 0 to 4 on the y-axis) as a function of sample size (ranging from 0 to 1,239 on the x-axis). Composite estimates for 131 areas, synthetic estimates for 18 areas and direct estimates for 131 areas are represented and the parameter is the unemployment rate. The estimated direct CVs are in general much larger than the estimated SAE RRMSEs, especially for the areas with the smallest sample sizes. The estimated SAE RRMSEs are never above 20% whereas the estimated direct CV is over 300% for one area. The estimated SAE RRMSEs are also very stable as a function of the sample size unlike the erratic behavior of the estimated direct CVs. For the areas with the largest sample sizes, both estimates are very similar. This indicates that SAE methods can lead to a substantial increase of precision over direct estimation methods, particularly for the smallest areas.
For the month of May 2016, we had the luxury of having a
very reliable source for the estimation of the unemployment rates: the 2016
long form Census administered to roughly one-fourth
of the households throughout
Canada
.
The Census sample size is much larger than the LFS sample size in all the areas
of interest. Therefore, we used the 2016 Census direct estimates, denoted by
as a gold standard for evaluating the accuracy
of both the LFS direct estimates and SAE estimates. We computed Absolute
Relative Differences (ARD) between LFS direct estimates and Census estimates,
as well as ARDs between SAE estimates and
Census estimates,
These ARDs were then averaged within 5
different homogeneous subgroups with respect to sample size. Table 6.1
summarizes the results.
Table 6.1
Average ARD of SAE estimates and LFS direct estimates expressed in percentage
Table summary
This table displays the results of Average ARD of SAE estimates and LFS direct estimates expressed in percentage. The information is grouped by Sample size (appearing as row headers), Average ARD between LFS direct estimates and Census estimates, Average ARD between SAE estimates and Census estimates and Average ARD between HB estimates and Census estimates (appearing as column headers).
Sample size |
Average ARD between LFS direct estimates and Census estimates |
Average ARD between SAE estimates and Census estimates |
Average ARD between HB estimates and Census estimates |
28 smallest areas |
70.4% |
17.7% |
18.3% |
Next 28 smallest areas |
38.7% |
18.9% |
19.0% |
Next 28 smallest areas |
26.2% |
13.8% |
14.1% |
Next 28 smallest areas |
20.9% |
12.7% |
13.0% |
28 largest areas |
13.2% |
10.2% |
10.3% |
Overall |
33.9% |
14.7% |
14.9% |
As expected, the ARD between the LFS and Census direct
estimates decreases as the sample size increases. This may suggest that the
conceptual differences between these two surveys and nonsampling errors are
reasonably small compared with the sampling error, especially for the smallest
areas where the sampling error may be the main contributor to the ARD. The SAE
estimates are much closer to the Census estimates than the LFS direct estimates, particularly for the smallest areas
where improvement is most needed. This confirms that our underlying models are
reasonable in this application.
For comparison purposes, we also computed HB estimates,
based on the matched Fay-Herriot model with
the noninformative priors for
and
provided in Section 5. We then computed
ARDs between HB estimates and Census estimates,
Results are given in the last column of
Table 6.1. The averaged ARDs of the HB estimates are close to those of the
EBLUP estimates.