# Development of a small area estimation system at Statistics Canada

Section 7. Conclusion

A frequent demand from users of data from NSOs is for more granularity for use in planning and policy research purposes. NSOs can no longer simply increase the sample sizes of their surveys to obtain reliable estimates at the requested level of detail. Reasons for this include the high costs of doing so, response burden concerns, as well as the difficult task of obtaining responses from sampled units. An alternative being investigated by many NSOs is the use of small area estimation techniques that provide a way to address the demand for more granular data. With this in mind, Statistics Canada began the development of an SAE production system in the early 2000s and now have such a system available to their statistical programs. The production system handles area and unit level models, with multiple options such as different methods to estimate the variance components, different linking models and both the EBLUP and HB estimation methods. It is currently being used to produce experimental estimates for several Statistics Canada statistical programs and it is expected that the first published small area estimates will be available in 2019.

As it was
mentioned in the introduction, the only existing software in 2006 that would
produce small area estimates and their associated mean squared estimates was
sponsored by the EURAREA (2004) project.
The current production system developed at Statistics Canada is written in SAS,
with its methodology closely following Rao (2003) and includes some recent
advances. As it stands, it satisfies the existing requirements for small area
estimation at Statistics Canada. However, as
the use of small area estimation becomes more common within Statistics Canada,
there will be a need to add functionality to the system to meet this demand.
The recent book authored by Rao and Molina (2015) provides an idea of how much
development has taken place in small area estimation during recent years. The
incorporation of all this development into the production system would be
extremely time consuming, expensive, and may not be directly applicable to the
needs of Statistics Canada. It, therefore, follows that options other than programming these new functionalities in
the current SAS production system should be considered. One option would be to
investigate how packages developed elsewhere, such as those written in
$R,$
can be integrated
into it. Notable packages written in R include *sae *(Molina and Marhuenda, 2015), *mme *(Lopez-Vizcaino, Lombardia and Morales, 2014), *saery* (Esteban, Morales and Perez, 2014) and *sae2* (Fay and Diallo, 2015). These packages include small area
procedures that are not in the present system such as multinomial linear mixed
models, area level models with time effects and time series area level models
supporting univariate and multivariate applications. The existing SAS production system meets the needs of
Statistics Canada at this point in time, and there are no concrete plans to add
functionality to it.

## Acknowledgements

We would like to thank the reviewers for their comments and suggestions that led to improvements in this paper.

## Appendix

### Justification of the coefficient of determination

In order to determine a coefficient of determination associated with the linking model, ${\theta}_{i}={z}_{i}^{T}\beta +{b}_{i}{v}_{i},$ we first rewrite it as

$${\tilde{\theta}}_{i}={\tilde{z}}_{i}^{T}\beta +{v}_{i},$$

where ${\tilde{\theta}}_{i}={\theta}_{i}/{b}_{i}$ and ${\tilde{z}}_{i}={z}_{i}/{b}_{i}.$ We assume that an intercept is implicitly or explicitly included in ${\tilde{z}}_{i};$ i.e., there exists a vector $\lambda $ such that ${\lambda}^{T}{\tilde{z}}_{i}=1.$ In other words, we assume that there exists a vector $\lambda $ such that ${b}_{i}={\lambda}^{T}{z}_{i}.$ If ${\tilde{\theta}}_{i},$ $i=1,\text{\hspace{0.17em}}\dots ,\text{\hspace{0.17em}}m,$ were known, we could estimate the unknown vector of model parameters $\beta $ by the least squares estimator

$${\widehat{\beta}}_{*}={\left({\displaystyle \sum _{i=1}^{m}{\tilde{z}}_{i}{\tilde{z}}_{i}^{T}}\right)}^{-1}{\displaystyle \sum _{i=1}^{m}{\tilde{z}}_{i}{\tilde{\theta}}_{i}}$$

and the unknown model variance ${\sigma}_{v}^{2}$ by the unbiased estimator

$${\widehat{\sigma}}_{v*}^{2}=\frac{{\displaystyle {\sum}_{i=1}^{m}{\left({\tilde{\theta}}_{i}-{\tilde{z}}_{i}^{T}{\widehat{\beta}}_{*}\right)}^{2}}}{m-q}.$$

The well-known adjusted coefficient of determination is

$${R}_{\text{ideal}}^{2}=1-\frac{{\widehat{\sigma}}_{v*}^{2}}{{\left(m-1\right)}^{-1}{\displaystyle {\sum}_{i=1}^{m}{\left({\tilde{\theta}}_{i}-\overline{\tilde{\theta}}\right)}^{2}}},\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(\text{A}.1)$$

where $\overline{\tilde{\theta}}={m}^{-1}{\displaystyle {\sum}_{i=1}^{m}{\tilde{\theta}}_{i}}.$ It is an ideal coefficient of determination because it cannot be computed (since ${\tilde{\theta}}_{i}$ is unknown) but this is the target we would like to estimate. Simply replacing ${\theta}_{i}$ with ${\widehat{\theta}}_{i}^{}$ does not solve the problem as ${\widehat{\theta}}_{i}^{}$ reflects the combined model and not just the linking model. The resulting coefficient of determination would typically be too small. To obtain a better estimate of ${R}_{\text{ideal}}^{2},$ we first decompose $\sum}_{i=1}^{m}{\left({\tilde{\theta}}_{i}-\overline{\tilde{\theta}}\right)}^{2$ as

$$\sum _{i=1}^{m}{\left({\tilde{\theta}}_{i}-\overline{\tilde{\theta}}\right)}^{2}}={\displaystyle \sum _{i=1}^{m}{\left({\tilde{\theta}}_{i}-{\tilde{z}}_{i}^{T}{\widehat{\beta}}_{*}\right)}^{2}}+{\displaystyle \sum _{i=1}^{m}{\left({\tilde{z}}_{i}^{T}{\widehat{\beta}}_{*}-\overline{\tilde{\theta}}\right)}^{2}}+2{\displaystyle \sum _{i=1}^{m}\left({\tilde{\theta}}_{i}-{\tilde{z}}_{i}^{T}{\widehat{\beta}}_{*}\right)\left({\tilde{z}}_{i}^{T}{\widehat{\beta}}_{*}-\overline{\tilde{\theta}}\right)}.\text{\hspace{1em}}\text{\hspace{1em}}(\text{A}.2)$$

Assuming that an intercept is implicitly or explicitly included in ${\tilde{z}}_{i}$ and from the expression for ${\widehat{\beta}}_{*},$ we have that

$$\sum _{i=1}^{m}\left({\tilde{\theta}}_{i}-{\tilde{z}}_{i}^{T}{\widehat{\beta}}_{*}\right){\tilde{z}}_{i}}=0\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(\text{A}.3)$$

and

$$\sum _{i=1}^{m}\left({\tilde{\theta}}_{i}-{\tilde{z}}_{i}^{T}{\widehat{\beta}}_{*}\right)}=0.\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(\text{A}.4)$$

From (A.4), we can rewrite $\overline{\tilde{\theta}}$ as $\overline{\tilde{\theta}}={\overline{\tilde{z}}}^{T}{\widehat{\beta}}_{*},$ where $\overline{\tilde{z}}={m}^{-1}{\displaystyle {\sum}_{i=1}^{m}{\tilde{z}}_{i}}.$ As a result, the cross product term in (A.2) vanishes and equation (A.2) reduces to

$$\begin{array}{lll}{\displaystyle \sum _{i=1}^{m}{\left({\tilde{\theta}}_{i}-\overline{\tilde{\theta}}\right)}^{2}}\hfill & =\hfill & {\displaystyle \sum _{i=1}^{m}{\left({\tilde{\theta}}_{i}-{\tilde{z}}_{i}^{T}{\widehat{\beta}}_{*}\right)}^{2}}+{\displaystyle \sum _{i=1}^{m}{\left({\tilde{z}}_{i}^{T}{\widehat{\beta}}_{*}-{\overline{\tilde{z}}}^{T}{\widehat{\beta}}_{*}\right)}^{2}}\hfill \\ \hfill & =\hfill & \left(m-q\right){\widehat{\sigma}}_{v*}^{2}+\left(m-1\right){S}^{2}\left({\widehat{\beta}}_{*}\right),\hfill \end{array}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(\text{A}.5)$$

where

$${S}^{2}\left({\widehat{\beta}}_{*}\right)=\frac{{\displaystyle {\sum}_{i=1}^{m}{\left({\tilde{z}}_{i}^{T}{\widehat{\beta}}_{*}-{\overline{\tilde{z}}}^{T}{\widehat{\beta}}_{*}\right)}^{2}}}{m-1}.\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(\text{A}.6)$$

From (A.5), it follows that the ideal coefficient of determination (A.1) can be rewritten as

$${R}_{\text{ideal}}^{2}=1-\frac{{\widehat{\sigma}}_{v*}^{2}}{\frac{\left(m-q\right)}{\left(m-1\right)}{\widehat{\sigma}}_{v*}^{2}+{S}^{2}\left({\widehat{\beta}}_{*}\right)}\equiv f({\widehat{\beta}}_{*},\text{\hspace{0.17em}}{\widehat{\sigma}}_{v*}^{2}).\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(\text{A}.7)$$

The only unknown quantities in (A.7) are ${\widehat{\beta}}_{*}$ and ${\widehat{\sigma}}_{v*}^{2}.$ A computable coefficient of determination can thus be obtained by replacing ${\widehat{\beta}}_{*}$ and ${\widehat{\sigma}}_{v*}^{2}$ in (A.7) with $\widehat{\beta}$ and ${\widehat{\sigma}}_{v}^{2},$ the consistent estimators of $\beta $ and ${\sigma}_{v}^{2}$ implemented in the SAE system and described in Section 3. The resulting coefficient of determination can be expressed as ${R}^{2}=f\left(\widehat{\beta},\text{\hspace{0.17em}}{\widehat{\sigma}}_{v}^{2}\right),$ with the function $f\left(\cdot ,\cdot \right)$ defined in (A.7), and is a consistent estimator of the ideal coefficient of determination ${R}_{\text{ideal}}^{2}.$

## References

Battese, G.E.,
Harter, R.M. and Fuller, W.A. (1988). An error-components model for prediction
of crop areas using survey and satellite data. *Journal of the American
Statistical Association*, 83, 28-36.

Brackstone, G.J. (1987). Small area data: Policy issues
and technical challenges. In *Small
Area Statistics*, (Eds., R. Platek, J.N.K. Rao, C.-E. Särndal and
M.P. Singh),

Chib, S., and Greenberg, E. (1995). Understanding the
Metropolis-Hastings algorithm. *American Statistician*, 49, 327-335.

Dick, P. (1995).
Modelling net undercoverage in the 1991 Canadian Census. *Survey Methodology*,
21, 1, 45-54. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/1995001/article/14411-eng.pdf.

Drew, D., Singh,
M.P. and Choudhry, G.H. (1982). Evaluation of small area estimation techniques
for the Canadian Labour Force Survey. *Survey Methodology*, 8, 1, 17-47.
Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/1982001/article/14328-eng.pdf.

Esteban, M.D., Morales, D. and Perez, A. (2014). saery: Small Area Estimation for Rao and Yu Model. URL http://CRAN.R-project.org/package=saery. R package version 1.0.

Estevao, V.,
Hidiroglou, M.A. and You, Y. (2015). *Area
Level Model, Unit Level, and Hierarchical Bayes Methodology Specifications*.
Internal document, Statistics Canada.

EURAREA (2004). *Enhancing Small Area Estimation Techniques
to meet European Needs*. https://cordis.europa.eu/project/rcn/58374_en.html.

Fay, R.E., and Diallo, M. (2015). sae2: Small Area Estimation: Time-Series Models. URL http://CRAN.Rproject.org/package=sae2. R package version 0.1-1.

Fay, R.E., and
Herriot, R.A. (1979). Estimation of income for small places: An application of
James-Stein procedures to Census data. *Journal of the American Statistical
Association*, 74, 269-277.

Fuller, W.A.,
and Rao, J.N.K. (2001). A regression composite estimator with application to
the Canadian Labour Force Survey. *Survey Methodology*, 27, 1, 45-51.
Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2001001/article/5853-eng.pdf.

Gambino, J.,
Kennedy, B. and Singh, M.P. (2001). Regression composite estimation for the
Canadian Labour Force Survey: Evaluation and implementation. *Survey
Methodology*, 27, 1, 65-74. Paper available at https://www150.statcan.gc.ca/n1/pub/12-001-x/2001001/article/5855-eng.pdf.

Gelfand, A.E., and Smith, A.F.M. (1990). Sample-based approaches to calculating
marginal densities. *Journal of the American Statistical Association*, 85,
972-985.

Ghangurde, P.D.,
and Singh, M.P. (1977). Synthetic estimation in periodic household surveys. *Survey Methodology*, 3, 2, 152-181.

Gonzalez, M.E.,
and Hoza, C. (1978). Small-area estimation
with application to unemployment and housing estimates. *Journal of the American Statistical
Association*, 73, 7-15.

Kott, P. (1989).
Robust small domain estimation using random effects modeling. *Survey
Methodology*, 15, 1, 3-12. Paper available at https://www150.statcan.gc.ca/n1/pub/12-001-x/1989001/article/14581-eng.pdf.

Li, H., and
Lahiri, P. (2010). Adjusted maximum
method in the small area estimation problem. *Journal of Multivariate
Analysis*, 101, 882-892.

Lopez-Vizcaino, E., Lombardia, M.J. and Morales, D. (2014). mme: Multinomial Mixed Effects Models, 2014. URL http://CRAN.R-project.org/package=mme. R package version 0.1-5.

Molina,
*The R Journal*, 7, 1, 81-98.

Prasad, N.G.N.,
and Rao, J.N.K. (1990). The estimation of the mean
squared error of small-area estimators. *Journal of the American Statistical
Association*, 85, 163-171.

Prasad, N.G.N.,
and Rao, J.N.K. (1999). On robust small area estimation using a simple random
effects model. *Survey Methodology*, 25, 1, 67-72. Paper available at https://www150.statcan.gc.ca/n1/pub/12-001-x/1999001/article/4713-eng.pdf.

Rao, J.N.K.
(2003). *Small Area Estimation*.

Rao, J.N.K., and
Molina,
*Small Area Estimation*.

Rivest, L.-P.,
and Belmonte, E. (2000). A conditional mean squared error of small area
estimators. *Survey Methodology*, 26, 1, 67-78. Paper available at https://www150.statcan.gc.ca/n1/pub/12-001-x/2000001/article/5179-eng.pdf.

Rubin-Bleuer, S.
(2014). *Specifications for EBLUP and Pseudo-EBLUP Estimators with
Nonnegligible Sampling Fractions*.
Statistics

Rubin-Bleuer,
S., Jang, L. and Godbout, S. (2016). The Pseudo-EBLUP estimator for a weighted
average with an application to the Canadian Survey of Employment, Payrolls and
Hours. *Journal of Survey Statistics and Methodology*, 4, 417-435.

Singh, M.P., and
Tessier, R. (1976). Some estimators for domain totals. *Journal of the
American Statistical Association*,
71, 322-325.

Singh, A.C.,
Kennedy, B. and Wu, S. (2001). Regression composite estimation for the Canadian
Labour Force Survey with a rotating panel design. *Survey Methodology*,
27, 1, 33-44. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2001001/article/5852-eng.pdf.

Stukel, D., and
Rao, J.N.K. (1997). Small-area estimation under two-fold nested error regression
model. *Journal of Statistical Planning
and Inference*, 78, 131-147.

Wang, J., and
Fuller, W.A. (2003). The mean square error of small area estimators constructed
with estimated area variances. *Journal of American Statistical Association*,
98, 716-723.

Wang, J.,
Fuller, W.A. and Qu, Y. (2008). Small area estimation under
a restriction. *Survey Methodology*, 34, 1, 29-36. Paper available at https://www150.statcan.gc.ca/n1/pub/12-001-x/2008001/article/10619-eng.pdf.

You, Y., and
Rao, J.N.K. (2002). A pseudo empirical best linear unbiased prediction approach
to small area estimation using survey weights. *The Canadian Journal of
Statistics*, 30, 431-439.

You, Y., Rao,
J.N.K. and Dick, P. (2004). Benchmarking hierarchical Bayes small area
estimators in the Canadian census undercoverage
estimation. *Statistics in Transition*,
6, 631-640.

You, Y., Rao,
J.N.K. and Hidiroglou, M. (2013). On the performance of self benchmarked small
area estimators under the Fay-Herriot area level model. *Survey Methodology*,
39, 1, 217-229. Paper available at https://www150.statcan.gc.ca/n1/pub/12-001-x/2013001/article/11830-eng.pdf.

- Date modified: