# Development of a small area estimation system at Statistics Canada

Section 6. Application to Labour Force Survey (LFS) data

Statistics Canada’s LFS is a monthly survey with a stratified two-stage design. It is designed to produce reliable unemployment rate estimates for the 55 Employment Insurance Economic Regions (EIER) in Canada. The unemployment rate in any given area $i$ is defined as the ratio

$${\theta}_{i}=\frac{{\displaystyle {\sum}_{j\in {U}_{i}}{y}_{1j}}}{{\displaystyle {\sum}_{j\in {U}_{i}}{y}_{2j}}},$$

where ${y}_{1j}$ is a binary variable indicating whether person $j$ is unemployed $\left({y}_{1j}=1\right)$ or not $\left({y}_{1j}=0\right),$ and ${y}_{2j}$ is a binary variable indicating whether person $j$ is in the labour force $\left({y}_{2j}=1\right)$ or not $\left({y}_{2j}=0\right).$ The direct estimator of ${\theta}_{i}$ is the calibration composite estimator described in Fuller and Rao (2001). See also Singh, Kennedy and Wu (2001) and Gambino, Kennedy and Singh (2001). It can be written in the weighted form

$${\widehat{\theta}}_{i}=\frac{{\displaystyle {\sum}_{j\in {s}_{i}}{w}_{j}{y}_{1j}}}{{\displaystyle {\sum}_{j\in {s}_{i}}{w}_{j}{y}_{2j}}},$$

where ${w}_{j}$ is a calibration composite weight for person $j.$

As mentioned above, the calibration composite estimator is reliable for the estimation of the unemployment rate for the 55 EIERs. There is also interest in obtaining reliable estimates for 149 areas (cities) in Canada . Among them, there are 34 Census Metropolitan Areas (CMA) and 115 Census Areas (CA). The CMAs are the largest cities in terms of population size and they usually have a large sample size as well. Some of the CAs have a very small sample size, sometimes even 0. For those CAs and other larger CAs, the sample size is not large enough to produce sufficiently reliable direct estimates of the monthly unemployment rate. Our objective was to investigate whether the Fay-Herriot model could be used to obtain monthly estimates that would be reliable enough to be published.

We constructed an auxiliary variable ${z}_{1i},$ for area $i,$ given by ${z}_{1i}={N}_{i}^{\text{EIB}}/{N}_{i}^{15+},$ where ${N}_{i}^{\text{EIB}}$ is the number of employment insurance beneficiaries in area $i$ and ${N}_{i}^{15+}$ is the number of persons aged 15 years or older in area $i.$ The numerator is obtained from an administrative source, whereas the denominator is a Census projection computed by Statistics Canada. We used the vector ${z}_{i}={\left(1,\text{\hspace{0.17em}}{z}_{1i}\right)}^{T},$ along with ${b}_{i}=1,$ $i=1,\text{\hspace{0.17em}}\dots ,\text{\hspace{0.17em}}m,$ to obtain SAE estimates. We used May 2016 data in this investigation to allow the comparison of direct and SAE estimates with 2016 Census estimates.

Some of the 149 areas of interest had a very small sample size in the LFS: they were not used in the Fay-Herriot and smoothing models. As a rule of thumb, we excluded from the models, areas where the number of sampled persons in the labour force was smaller than 10. There were 9 such areas; among them, six had no sampled person in the labour force. Also, there were 9 other areas where the direct unemployment rate estimate, ${\widehat{\theta}}_{i},$ and its direct variance estimate, ${\widehat{\psi}}_{i},$ were both equal to 0. As these direct estimates were not deemed to be reliable enough, their associated areas were excluded from the models. This resulted in using only 131 areas in the models. For those areas, the small area estimates are EBLUP estimates, with the remaining 18 being synthetic estimates.

The estimator ${\widehat{\psi}}_{i}$ of the direct variance ${\psi}_{i}$ was obtained via the Rao-Wu bootstrap. The estimates of the smooth design variances were then obtained by using ${x}_{i}={\left(1,\text{\hspace{0.17em}}\mathrm{log}\left({z}_{1i}\right),\text{\hspace{0.17em}}\mathrm{log}\left(1-{z}_{1i}\right),\text{\hspace{0.17em}}\mathrm{log}\left({N}_{i}^{15+}\right)\right)}^{T}.$ A graph of the residuals of the smoothing model, $\mathrm{log}\left({\widehat{\psi}}_{i}\right)-{x}_{i}^{\prime}\widehat{\alpha},$ versus the predicted values, ${x}_{i}^{\prime}\widehat{\alpha},$ did not reveal any obvious model misspecification. Figure 6.1 shows a graph of direct variances estimates, ${\widehat{\psi}}_{i},$ versus smooth variance estimates, ${\widehat{\tilde{\psi}}}_{i}.$ The red line is the identity line. If the smoothing model is appropriate, for any value of ${\widehat{\tilde{\psi}}}_{i},$ the average of direct variance estimates for areas around area $i$ should be roughly equal to ${\widehat{\tilde{\psi}}}_{i}.$ This means that the red line should pass roughly through the middle of the points everywhere. From a quick inspection of Figure 6.1, we observe that the red line is close to the middle of the points although probably slightly above the middle due to some extreme values of ${\widehat{\psi}}_{i}.$ This may result in a slight overestimation of the true smooth variance ${\tilde{\psi}}_{i}={E}_{mp}\left({\widehat{\psi}}_{i}\right).$ A slight overestimation is not a major issue. What has to be avoided is an underestimation of ${\tilde{\psi}}_{i},$ as it typically leads to underestimating the MSE of the SAE estimate. This would provide the user with a false impression of precision.

Overall, we were satisfied with our smoothed variance estimates. However, for areas with large sample sizes, we set ${\widehat{\tilde{\psi}}}_{i}={\widehat{\psi}}_{i}$ as our estimate of ${\tilde{\psi}}_{i}.$ We assumed that direct variance estimates were stable enough when the sample size is large. As a rule of thumb, we set ${\widehat{\tilde{\psi}}}_{i}={\widehat{\psi}}_{i}$ when the number of sampled persons in the labour force was greater than 400. This replacement occurred for 35 areas. The strategy was used to avoid possible small model biases in ${\widehat{\tilde{\psi}}}_{i}$ for the largest areas, which could result in EBLUP estimates that become significantly different from the direct estimates. This is not a desirable property for areas with a large sample size.

The smooth variance estimates were then used to obtain small area estimates for the 149 areas of interest. Figure 6.2 shows a graph of small area and direct estimates as a function of sample size (number of sampled persons in the labour force). The small area estimates are much less volatile than direct estimates, especially for the areas with the smallest sample sizes. For the largest areas, as expected, both estimates are similar.

We first evaluated the quality of the underlying Fay-Herriot model before looking at the MSE estimates. Figure 6.3 shows the graph of direct estimates, ${\widehat{\theta}}_{i},$ versus predicted values, ${z}_{i}^{T}\widehat{\beta}.$ The red line is the identity line and the blue line is a nonparametric smoothing spline curve. If the linearity assumption holds, the blue line should be close to the red line and the latter should pass roughly through the middle of the points everywhere. Figure 6.3 does not give any indication that the linearity assumption of the Fay-Herriot model is questionable.

## Description for Figure 6.1

Scatter plot presenting the direct variances estimates (ranging from 0 to 300 on the y-axis) versus GVF smooth variance estimates (ranging from 0 to 60 on the x-axis). Composite estimates are presented for 131 areas and the parameter is the unemployment rate. A red line representing the identity line is added to the graph. The red line is close to the middle of the points although probably slightly above the middle due to some extreme values of direct variance estimates. This may result in a slight overestimation of the true smooth variance.

## Description for Figure 6.2

Graph of small area and direct estimates (ranging from 0 to 40 on the y-axis) as a function of sample size (ranging from 0 to 1,239 on the x-axis). Composite estimates for 131 areas, synthetic estimates for 18 areas and direct estimates for 143 areas are represented and the parameter is the unemployment rate. The small area estimates are much less volatile than direct estimates, especially for the areas with the smallest sample sizes. For the largest areas, as expected, both estimates are similar.

## Description for Figure 6.3

Scatter plot of direct estimates (ranging from 0 to 30 on the y-axis) versus model predicted values (ranging from 5 to 12 on the x-axis). The parameter is the unemployment rate, 131 areas are represented and the R-squared is 0.63. The red line is the identity line and the blue line is a nonparametric smoothing spline curve. If the linearity assumption holds, the blue line should be close to the red line and the latter should pass roughly through the middle of the points everywhere. This graph does not give any indication that the linearity assumption of the Fay-Herriot model is questionable.

It is also informative to compute a measure that indicates the strength of ${z}_{i}$ for the prediction of ${\theta}_{i}.$ To this end, we developed and implemented a coefficient of determination, or ${R}^{2}$ value, associated with the linking model ${\theta}_{i}={z}_{i}^{T}\beta +{b}_{i}{v}_{i}.$ Note that the coefficient of determination associated with the combined model, ${\widehat{\theta}}_{i}={z}_{i}^{T}\beta +{b}_{i}{v}_{i}+{e}_{i},$ is not of interest as the objective is not the prediction of ${\widehat{\theta}}_{i}^{}$ but the prediction of ${\theta}_{i}.$ Our coefficient of determination is given by

$${R}^{2}=1-\frac{{\widehat{\sigma}}_{v}^{2}}{\frac{\left(m-q\right)}{\left(m-1\right)}{\widehat{\sigma}}_{v}^{2}+{S}^{2}\left(\widehat{\beta}\right)},$$

where $q$ is the dimension of ${z}_{i}$ and ${S}^{2}\left(\widehat{\beta}\right)$ is the sample variance of ${z}_{i}^{T}\widehat{\beta}/{b}_{i}$ (see equation (A.6) for the exact definition of the function ${S}^{2}(\cdot )).$ The details of the derivation of the above coefficient of determination are provided in the Appendix. Figure 6.3 indicates that the ${R}^{2}$ value is 0.63. The linking model is thus neither weak nor extremely strong but, hopefully, strong enough to achieve efficiency gains over the direct estimator. The system also produces estimates of the parameters of the Fay-Herriot model along with their standard errors. From this output, we found out that estimates of both the intercept and slope parameters of the Fay-Herriot model were significantly different from 0 using a standard Wald test at the 0.05 significance level.

Figure 6.4 shows a graph of standardized residuals, $\left({\widehat{\theta}}_{i}-{z}_{i}^{T}\widehat{\beta}\right)/\sqrt{{b}_{i}^{2}{\widehat{\sigma}}_{v}^{2}+{\widehat{\tilde{\psi}}}_{i}},$ versus standardized predicted values, ${z}_{i}^{T}\widehat{\beta}/\sqrt{{b}_{i}^{2}{\widehat{\sigma}}_{v}^{2}+{\widehat{\tilde{\psi}}}_{i}}.$ The red line is a horizontal line at zero and the blue line is a nonparametric smoothing spline curve. Similarly to Figure 6.3, the blue line should be close to the red line under linearity and the latter should pass roughly through the middle of the points everywhere. Again, Figure 6.4 does not indicate any obvious failure of the linearity assumption underlying the Fay-Herriot model.

## Description for Figure 6.4

Scatter plot of model standardized residuals (from -3 to 8 on the y-axis) versus model standardized predicted values (from 0 to 8 on the x-axis). The parameter is the unemployment rate and 131 areas are represented. The red line is a horizontal line at zero and the blue line is a nonparametric smoothing spline curve. The blue line should be close to the red line under linearity and the latter should pass roughly through the middle of the points everywhere. Again, this graph does not indicate any obvious failure of the linearity assumption underlying the Fay-Herriot model.

Figure 6.5 shows a graph of squared standardized residuals versus standardized predicted values. The red line is a horizontal line at one and the blue line is again a nonparametric smoothing spline curve. This graph is used to check the homoscedasticity assumption; i.e., the assumption that the model variance ${\sigma}_{v}^{2}$ is constant. Under homoscedasticity, the blue line should be close to the red line everywhere. The graph does not reveal any obvious presence of heteroscedasticity.

## Description for Figure 6.5

Scatter plot of the square of model standardized residuals (ranging from 0 to 30 on the y-axis) versus model standardized predicted values (ranging from 0 to 8 on the x-axis). The parameter is the unemployment rate and 131 areas are represented. The red line is a horizontal line at one and the blue line is again a nonparametric smoothing spline curve. This graph is used to check the homoscedasticity assumption under which the blue line should be close to the red line everywhere. The graph does not reveal any obvious presence of heteroscedasticity.

Figure 6.6 shows a QQ-plot of standardized residual quantiles versus standard normal quantiles. It is used to verify the normality assumption of the errors ${b}_{i}{v}_{i}$ and ${e}_{i}.$ The graph does indicate a modest departure from normality. However, Rao and Molina (2015, page 138) argued that EBLUP estimates and their corresponding MSE estimates are generally robust to deviations from normality.

The system also computes Cook’s distances to identify areas that could have a significant influence on the estimate $\widehat{\beta}.$ The Cook distance for area $i$ is given by

$${D}_{i}=\frac{1}{q}{\left(\widehat{\beta}-{\widehat{\beta}}^{\left(-i\right)}\right)}^{T}{\displaystyle \sum _{j=1}^{m}\frac{{z}_{j}{z}_{j}^{T}}{{b}_{j}^{2}{\widehat{\sigma}}_{v}^{2}+{\widehat{\tilde{\psi}}}_{j}}}\left(\widehat{\beta}-{\widehat{\beta}}^{\left(-i\right)}\right),$$

where ${\widehat{\beta}}^{\left(-i\right)}$ is the estimate of $\beta $ obtained after deleting area $i.$ A plot of the influences ${D}_{i}$ is provided in Figure 6.7. One area seems to have a relatively large influence compared with other areas $\left({D}_{i}=\text{1}\text{.2851}\right).$ This area has the largest standardized predicted value and the second largest predicted value. Its standardized residual is -1.88, which is not extreme, although not very small either. Its sample size is large (number of sampled persons in the labour force close to 500) and its smooth variance estimate, ${\widehat{\tilde{\psi}}}_{i},$ is relatively small compared with other areas. All these reasons explain why this area was detected as being influential. In this application, we decided to keep this area in the model as its influence was not large enough to make a big difference in the SAE estimates and their corresponding MSE estimates.

## Description for Figure 6.6

QQ-plot of REML standardized residual quantiles (ranging from -3 to 5 on the y-axis) versus standard normal quantiles (ranging from -3 to 3 on the x-axis). The parameter is the unemployment rate. There are 131 areas, the sample mean is 0.048, the sample variance is 0.994, the normal test $p$-value is 0.000 and the sign test $p$-value is 0.294. This graph is used to verify the normality assumption of the errors. The graph indicates a modest departure from normality.

## Description for Figure 6.7

Scatter plot of Cook’s distances. The REML influence measure is on the y-axis, ranging from 0.0 to 1.3. The area number from 1 to 151 is on the x-axis. The parameter is the unemployment rate. One area seems to have a relatively large influence compared with other areas (influence measure = 1.2851). This area has the largest standardized predicted value and the second largest predicted value. Its standardized residual is -1.88. Its sample size is large and its smooth variance estimate is relatively small compared with other areas. All these reasons explain why this area was detected as being influential.

Since the Fay-Herriot model and smoothing model were both reasonable, we computed MSE estimates to evaluate the magnitude of the efficiency gains, if any, obtained by using the Fay-Herriot model. Figure 6.8 shows the estimated direct Coefficient of Variation (CV), defined as $\sqrt{{\widehat{\tilde{\psi}}}_{i}}/{\widehat{\theta}}_{i},$ and the estimated SAE Relative Root Mean Square Error (RRMSE), defined as $\sqrt{{\widehat{\varphi}}_{i}}/{\widehat{\theta}}_{i}^{\text{SAE}},$ where ${\widehat{\varphi}}_{i}$ is an estimate of the MSE, ${E}_{mp}{\left({\widehat{\theta}}_{i}^{\text{SAE}}-{\theta}_{i}\right)}^{2},$ and ${\widehat{\theta}}_{i}^{\text{SAE}}$ is the small area estimate (EBLUP or synthetic estimate) of ${\theta}_{i}^{}.$ The sample size (number of sampled persons in the labour force) is given on the horizontal axis. The estimated direct CVs are in general much larger than the estimated SAE RRMSEs, especially for the areas with the smallest sample sizes. The estimated SAE RRMSEs are never above 20% whereas the estimated direct CV is over 300% for one area. The estimated SAE RRMSEs are also very stable as a function of the sample size unlike the erratic behavior of the estimated direct CVs. For the areas with the largest sample sizes, both estimates are very similar, as expected. This indicates that SAE methods can lead to a substantial increase of precision over direct estimation methods, particularly for the smallest areas.

## Description for Figure 6.8

Graph of estimated direct CVs and SAE RRMSEs (ranging from 0 to 4 on the y-axis) as a function of sample size (ranging from 0 to 1,239 on the x-axis). Composite estimates for 131 areas, synthetic estimates for 18 areas and direct estimates for 131 areas are represented and the parameter is the unemployment rate. The estimated direct CVs are in general much larger than the estimated SAE RRMSEs, especially for the areas with the smallest sample sizes. The estimated SAE RRMSEs are never above 20% whereas the estimated direct CV is over 300% for one area. The estimated SAE RRMSEs are also very stable as a function of the sample size unlike the erratic behavior of the estimated direct CVs. For the areas with the largest sample sizes, both estimates are very similar. This indicates that SAE methods can lead to a substantial increase of precision over direct estimation methods, particularly for the smallest areas.

For the month of May 2016, we had the luxury of having a very reliable source for the estimation of the unemployment rates: the 2016 long form Census administered to roughly one-fourth of the households throughout Canada . The Census sample size is much larger than the LFS sample size in all the areas of interest. Therefore, we used the 2016 Census direct estimates, denoted by ${\widehat{\theta}}_{i}^{\text{Census}},$ as a gold standard for evaluating the accuracy of both the LFS direct estimates and SAE estimates. We computed Absolute Relative Differences (ARD) between LFS direct estimates and Census estimates, $\left|\text{\hspace{0.17em}}{\widehat{\theta}}_{i}-{\widehat{\theta}}_{i}^{\text{Census}}\text{\hspace{0.17em}}\right|\text{\hspace{0.17em}}/{\widehat{\theta}}_{i}^{\text{Census}},$ as well as ARDs between SAE estimates and Census estimates, $\left|\text{\hspace{0.17em}}{\widehat{\theta}}_{i}^{\text{SAE}}-{\widehat{\theta}}_{i}^{\text{Census}}\text{\hspace{0.17em}}\right|\text{\hspace{0.17em}}/{\widehat{\theta}}_{i}^{\text{Census}}.$ These ARDs were then averaged within 5 different homogeneous subgroups with respect to sample size. Table 6.1 summarizes the results.

Sample size | Average ARD between LFS direct estimates and Census estimates | Average ARD between SAE estimates and Census estimates | Average ARD between HB estimates and Census estimates |
---|---|---|---|

28 smallest areas | 70.4% | 17.7% | 18.3% |

Next 28 smallest areas | 38.7% | 18.9% | 19.0% |

Next 28 smallest areas | 26.2% | 13.8% | 14.1% |

Next 28 smallest areas | 20.9% | 12.7% | 13.0% |

28 largest areas | 13.2% | 10.2% | 10.3% |

Overall | 33.9% | 14.7% | 14.9% |

Note: Out of the 149 areas of interest in the LFS, 9 were excluded from this table: six where the LFS number of sampled persons in the labour force was 0 and three that were no longer in the list of CMAs /CAs after the 2016 Census. |

As expected, the ARD between the LFS and Census direct estimates decreases as the sample size increases. This may suggest that the conceptual differences between these two surveys and nonsampling errors are reasonably small compared with the sampling error, especially for the smallest areas where the sampling error may be the main contributor to the ARD. The SAE estimates are much closer to the Census estimates than the LFS direct estimates, particularly for the smallest areas where improvement is most needed. This confirms that our underlying models are reasonable in this application.

For comparison purposes, we also computed HB estimates, ${\widehat{\theta}}_{i}^{\text{HB}},$ based on the matched Fay-Herriot model with the noninformative priors for $\beta ,$ ${\sigma}_{v}^{2}$ and ${\tilde{\psi}}_{i}$ provided in Section 5. We then computed ARDs between HB estimates and Census estimates, $\left|\text{\hspace{0.17em}}{\widehat{\theta}}_{i}^{\text{HB}}-{\widehat{\theta}}_{i}^{\text{Census}}\text{\hspace{0.17em}}\right|\text{\hspace{0.17em}}/{\widehat{\theta}}_{i}^{\text{Census}}.$ Results are given in the last column of Table 6.1. The averaged ARDs of the HB estimates are close to those of the EBLUP estimates.

## Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

- Date modified: