# Development of a small area estimation system at Statistics Canada

Section 4. Unit level model

The original unit level model was proposed by Battese et al. (1988). They assumed the following nested error model

$${y}_{ij}={z}_{ij}^{T}\beta +{v}_{i}+{e}_{ij}\text{\hspace{1em}}\text{for}\text{\hspace{1em}}i=1,\text{\hspace{0.17em}}\dots ,\text{\hspace{0.17em}}m\text{\hspace{1em}}\text{and}\text{\hspace{1em}}j\in {U}_{i}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(4.1)$$

where
${v}_{i}\text{\hspace{0.17em}}\stackrel{\text{ind}}{{\displaystyle \sim}}\text{\hspace{0.17em}}\left(0,\text{\hspace{0.17em}}{\sigma}_{v}^{2}\right)$
are the
random effects and are independent of the random errors,
${e}_{i\text{}j},$
with
${e}_{i\text{}j}\text{\hspace{0.17em}}\stackrel{\text{ind}}{{\displaystyle \sim}}\text{\hspace{0.17em}}\left(0,\text{\hspace{0.17em}}{\sigma}_{e}^{2}\right).$
The
production system includes a slight modification to the error structure of the
random errors. That is,
${e}_{ij}\text{\hspace{0.17em}}\stackrel{\text{ind}}{{\displaystyle \sim}}\text{\hspace{0.17em}}\left(0,\text{\hspace{0.17em}}{\sigma}_{e}^{2}/{a}_{ij}\right),$
where
${a}_{i\text{}j}>0$
are
positive constants that account for heteroscedasticity.

The production system computes small area estimates for means $\left({\overline{Y}}_{ic}={\displaystyle {\sum}_{j\in {U}_{i}}{c}_{ij}{y}_{ij}}/{\displaystyle {\sum}_{j\in {U}_{i}}{c}_{ij}}\right)$ and totals $\left({Y}_{ic}={\displaystyle {\sum}_{j\in {U}_{i}}{c}_{ij}{\overline{Y}}_{i}}\right).$ The ${c}_{ij}$ values are fixed positive constants known for all population units. The addition of ${c}_{ij}$ was necessary to allow the use of the system by some business surveys conducted at Statistics Canada (see Rubin-Bleuer, Jang and Godbout, 2016). The available auxiliary data are either totals ${Z}_{ic}={\displaystyle {\sum}_{j\in {U}_{i}}{c}_{ij}{z}_{ij}},$ or means ${\overline{Z}}_{ic}={\displaystyle {\sum}_{j\in {U}_{i}}{c}_{ij}{z}_{ij}}/{\displaystyle {\sum}_{j\in {U}_{i}}{c}_{ij}}.$

In what follows, we provide the estimators of the population means ${\overline{Y}}_{ic},$ say ${\widehat{\theta}}_{i}^{\text{SAE}},$ where $i=1,\text{\hspace{0.17em}}\dots ,\text{\hspace{0.17em}}M.$ Estimates of the corresponding totals ${Y}_{ic},$ are obtained by multiplying ${\widehat{\theta}}_{i}^{\text{SAE}}$ by ${\sum}_{j=1}^{{N}_{i}}{c}_{ij}}.$

The design weighted sample mean of the $y\u2019s$ and $z\u2019s$ are respectively

$${\overline{y}}_{iwc}={\left({\displaystyle \sum _{j\in {s}_{i}}{w}_{ij}{c}_{ij}}\right)}^{-1}{\displaystyle \sum _{j\in {s}_{i}}{w}_{ij}{c}_{ij}{y}_{ij}}$$

and

$${\overline{z}}_{iwc}={\left({\displaystyle \sum _{j\in {s}_{i}}{w}_{ij}{c}_{ij}}\right)}^{-1}{\displaystyle \sum _{j\in {s}_{i}}{w}_{ij}{c}_{ij}{z}_{ij}}.$$

The model based weighted means are

$${\overline{y}}_{ia}={\left({\displaystyle \sum _{j\in {s}_{i}}{a}_{ij}}\right)}^{-1}\left({\displaystyle \sum _{j\in {s}_{i}}{a}_{ij}{y}_{ij}}\right)$$

and

$${\overline{z}}_{ia}={\left({\displaystyle \sum _{j\in {s}_{i}}{a}_{ij}}\right)}^{-1}\left({\displaystyle \sum _{j\in {s}_{i}}{a}_{ij}{z}_{ij}}\right).$$

Battese et al. (1988) did not include survey design weights in their procedure, thereby forsaking design consistency unless the design was self-weighting. We refer to this estimator as EBLUP $\left({\widehat{\theta}}_{i}^{\text{EBLUP}}\right).$ However, EBLUP is the most efficient estimator under model (4.1), with error structure ${e}_{ij}\text{\hspace{0.17em}}\stackrel{\text{ind}}{{\displaystyle \sim}}\text{\hspace{0.17em}}\left(0,\text{\hspace{0.17em}}{\sigma}_{e}^{2}/{a}_{i\text{}j}\right),$ and this is the reason that it is included in the production system.

Kott (1989), Prasad and Rao (1999), and You
and Rao (2002) proposed the use of design-consistent model based estimators for the area means by including the survey
weight. The You and Rao (2002) procedure was suitably modified to reflect the
heteroscedastic residuals and the
${c}_{ij}\u2019s.$
The resulting Pseudo-EBLUP estimator, denoted
as PEBLUP
$\left({\widehat{\theta}}_{i}^{\text{PEBLUP}}\right),$
was included in the production system as it is
design consistent.

The EBLUP estimator is defined as

$${\widehat{\theta}}_{i}^{\text{EBLUP}}=\{\begin{array}{ll}{\widehat{\gamma}}_{ia}{\overline{y}}_{ia}+{\left({\overline{Z}}_{ic}-{\widehat{\gamma}}_{ia}{\overline{z}}_{ia}\right)}^{T}{\widehat{\beta}}^{\text{EBLUP}}\hfill & \text{if}\text{\hspace{0.17em}}\text{\hspace{0.17em}}i\in A\hfill \\ {\overline{Z}}_{ic}^{T}{\widehat{\beta}}^{\text{EBLUP}}\hfill & \text{if}\text{\hspace{0.17em}}\text{\hspace{0.17em}}i\in \overline{A}\hfill \end{array}$$

where ${\widehat{\gamma}}_{ia}={\left({\widehat{\sigma}}_{v}^{2}+{\widehat{\sigma}}_{e}^{2}/{\displaystyle {\sum}_{j\in {s}_{i}}{a}_{ij}}\right)}^{-1}{\widehat{\sigma}}_{v}^{2}.$ The terms ${\overline{y}}_{ia}$ and ${\overline{z}}_{ia},$ are the previously defined model based weighted means for $y$ and $z$ respectively. The regression vector $\beta $ is estimated as

$${\widehat{\beta}}^{\text{EBLUP}}={\left({\displaystyle \sum _{i=1}^{m}{\displaystyle \sum _{j\in {s}_{i}}{a}_{ij}{c}_{ij}\left({z}_{ij}-{\widehat{\gamma}}_{iac}{\overline{z}}_{iac}\right){z}_{ij}^{T}}}\right)}^{-1}{\displaystyle \sum _{i=1}^{m}{\displaystyle \sum _{j\in {s}_{i}}{a}_{ij}{c}_{ij}\left({z}_{ij}-{\widehat{\gamma}}_{iac}{\overline{z}}_{iac}\right){y}_{ij}}}.$$

The PEBLUP estimator, ${\widehat{\theta}}_{i}^{\text{PEBLUP}},$ is given by

$${\widehat{\theta}}_{i}^{\text{PEBLUP}}=\{\begin{array}{ll}{\widehat{\gamma}}_{iwc}{\overline{y}}_{iwc}+{\left({\overline{Z}}_{ic}-{\widehat{\gamma}}_{iwc}{\overline{z}}_{iwc}\right)}^{T}{\widehat{\beta}}^{\text{PEBLUP}}\hfill & \text{if}\text{\hspace{0.17em}}\text{\hspace{0.17em}}i\in A\hfill \\ {\overline{Z}}_{ic}^{T}{\widehat{\beta}}^{\text{PEBLUP}}\hfill & \text{if}\text{\hspace{0.17em}}\text{\hspace{0.17em}}i\in \overline{A}\hfill \end{array}$$

where ${\widehat{\gamma}}_{iwc}={\left({\widehat{\sigma}}_{v}^{2}+{\widehat{\sigma}}_{e}^{2}{\delta}_{iwc}^{2}\right)}^{-1}\left({\widehat{\sigma}}_{v}^{2}\right),$ and ${\delta}_{iwc}^{2}={\left({\displaystyle {\sum}_{j\in {s}_{i}}{w}_{ij}{c}_{ij}}\right)}^{-2}\left({\displaystyle {\sum}_{j\in {s}_{i}}{\left({w}_{ij}{c}_{ij}\right)}^{2}}/{a}_{ij}\right).$ The terms ${\overline{y}}_{iwc}$ and ${\overline{z}}_{iwc},$ are the previously defined design based weighted means for $y$ and $z$ respectively. The regression vector $\beta $ is estimated as

$${\widehat{\beta}}^{\text{PEBLUP}}={\left({\displaystyle \sum _{i=1}^{m}{\displaystyle \sum _{j\in {s}_{i}}{w}_{ij}{a}_{ij}\left({z}_{ij}-{\widehat{\gamma}}_{iwa}{\overline{z}}_{iwa}\right){z}_{ij}^{T}}}\right)}^{-1}{\displaystyle \sum _{i=1}^{m}{\displaystyle \sum _{j\in {s}_{i}}{w}_{ij}{a}_{ij}\left({z}_{ij}-{\widehat{\gamma}}_{iwa}{\overline{z}}_{iwa}\right){y}_{ij}}}$$

where ${\overline{z}}_{iwa}={\left({\displaystyle {\sum}_{j\in {s}_{i}}{w}_{ij}{a}_{ij}}\right)}^{-1}{\displaystyle {\sum}_{j\in {s}_{i}}{w}_{ij}{a}_{ij}{z}_{ij}},$ ${\widehat{\gamma}}_{iwa}={\left({\widehat{\sigma}}_{v}^{2}+{\widehat{\sigma}}_{e}^{2}{\delta}_{iwa}^{2}\right)}^{-1}{\widehat{\sigma}}_{v}^{2}$ and with ${\delta}_{iwa}^{2}$ computed as ${\delta}_{iwa}^{2}={\left({\displaystyle {\sum}_{j\in {s}_{i}}{w}_{ij}{a}_{ij}}\right)}^{-2}\left({\displaystyle {\sum}_{j\in {s}_{i}}{\left({w}_{ij}{a}_{ij}\right)}^{2}}/{a}_{ij}\right).$

The components of variance, ${\sigma}_{e}^{2}$ and ${\sigma}_{v}^{2},$ are estimated using the fitting-of-constants (not weighted by the survey weights) method, as given by Battese et al. (1988) or Rao (2003). The resulting estimators of ${\sigma}_{e}^{2}$ are always greater than or equal to zero, but the estimator of ${\sigma}_{v}^{2}$ may be negative. If ${\stackrel{\u2322}{\sigma}}_{v}^{2}<0,$ it is set to zero, implying that there are no area effects. The associated estimated MSEs are obtained by extending You and Rao (2002) and Stukel and Rao (1997).

Note that if the sample $s$ is selected from the universe $U,$ the realized sampling fraction, ${f}_{i}={n}_{i}/{N}_{i},$ could be non-negligible. For estimating a population mean, ${\overline{Y}}_{i},$ Rao and Molina (2015), accounted for non-negligible sampling fractions by expressing it as

$${\overline{Y}}_{i}={f}_{i}{\overline{y}}_{is}+\left(1-{f}_{i}\right){\overline{y}}_{\overline{\mathit{is}}}$$

where ${\overline{y}}_{is}$ is the sample mean of the ${i}^{\text{th}}$ sampled area and ${\overline{y}}_{\overline{\mathit{is}}}$ is the sample mean of the non-sampled units within that area. They predicted ${\overline{y}}_{\overline{\mathit{is}}}$ using the unit level model given by equation (4.1). Their expressions correspond to the case when ${c}_{ij}=1.$ This estimator was extended by Rubin-Bleuer (2014) to include the EBLUP and PEBLUP estimators for the case that ${c}_{ij}$ is arbitrary. Specific details that also account for MSE estimation can be found in Estevao et al. (2015).

## 4.1 Benchmarking

The current production system does not have a procedure to benchmark the estimates obtained via the unit level model. However, the difference adjustment approach can be suitably modified to allow this. The EBLUP and PEBLUP estimators are of the form

$${\widehat{\theta}}_{i}^{\text{SAE}}=\{\begin{array}{ll}{\widehat{\gamma}}_{i}^{*}{\overline{y}}_{i}^{*}+{({\overline{Z}}_{ic}-{\widehat{\gamma}}_{i}^{*}{\overline{z}}_{i}^{*})}^{T}{\widehat{\beta}}^{\text{SAE}}\hfill & \text{if}i\in A\hfill \\ {\overline{Z}}_{ic}^{T}{\widehat{\beta}}^{\text{SAE}}\hfill & \text{if}i\in \overline{A}\hfill \end{array}$$

where ${\widehat{\gamma}}_{i}^{*},$ ${\overline{y}}_{i}^{*},$ ${\overline{z}}_{i}^{*},$ and ${\widehat{\beta}}^{\text{SAE}}$ correspond to the terms defined previously: ${\widehat{\gamma}}_{i}^{*}$ is equal to ${\widehat{\gamma}}_{ia}$ for EBLUP, and to ${\widehat{\gamma}}_{iwc}$ for PEBLUP; ${\overline{y}}_{i}^{*}$ is equal to ${\overline{y}}_{ia}$ for EBLUP, and to ${\overline{y}}_{iwc}$ for PEBLUP; ${\overline{z}}_{i}^{*}$ is equal to ${\overline{z}}_{ia}$ for EBLUP, and to ${\overline{z}}_{iwc}$ for PEBLUP; and, ${\widehat{\beta}}^{\mathrm{SAE}}$ is equal to ${\widehat{\beta}}^{\text{EBLUP}}$ for EBLUP, and to ${\widehat{\beta}}^{\text{PEBLUP}}$ for PEBLUP.

Suppose that ${\widehat{\theta}}_{i}^{\text{SAE}}$ needs to be benchmarked to ${\widehat{\theta}}^{*}.$ The corresponding benchmarked estimator is

$${\widehat{\theta}}_{i}^{\text{SAE},\text{\hspace{0.17em}}b}=\{\begin{array}{ll}{\widehat{\theta}}_{i}^{\text{SAE}}+{\alpha}_{i}\left({\theta}^{*}-{\displaystyle \sum _{d\in A}{\omega}_{d}}{\widehat{\theta}}_{d}^{\text{SAE}}\right)\hfill & \text{if}\text{\hspace{0.17em}}\text{\hspace{0.17em}}i\in A\hfill \\ {\overline{Z}}_{ic}^{T}{\widehat{\beta}}^{\text{SAE}}\hfill & \text{if}\text{\hspace{0.17em}}\text{\hspace{0.17em}}i\in \overline{A}\hfill \end{array}$$

where ${\alpha}_{i}={\left({\displaystyle {\sum}_{d\in A}{\omega}_{d}^{2}{\tau}_{d}}\right)}^{-1}\left({\omega}_{i}{\tau}_{i}\right).$ The ${\omega}_{i}$ term is defined as follows: ${\omega}_{i}=1$ if the benchmarking is to a total and ${\omega}_{i}={N}_{i}/N$ if the benchmarking is for the mean. Possible choices of the ${\tau}_{i}\u2019s$ are ${\widehat{\sigma}}_{v}^{2}+{\widehat{\sigma}}_{e}^{2}{\delta}_{ia}^{2},$ ${\delta}_{ia}^{2}={\left({\displaystyle {\sum}_{j=1}^{{n}_{i}}{a}_{ij}}\right)}^{-1},$ for EBLUP, and ${\widehat{\sigma}}_{v}^{2}+{\widehat{\sigma}}_{e}^{2}{\delta}_{iwc}^{2}$ for PEBLUP.

## 4.2 Mean squared error estimation

The mean squared error estimates of the unit level estimators are based on estimating its mean squared error, given model (4.1) and error structure ${e}_{ij}\text{\hspace{0.17em}}\stackrel{\text{ind}}{{\displaystyle \sim}}\text{\hspace{0.17em}}\left(0,\text{\hspace{0.17em}}{\sigma}_{e}^{2}/{a}_{ij}\right).$ Table 4.1 displays these estimated MSE’s.

Estimator | mse |
---|---|

EBLUP | $\text{mse}\left({\widehat{\theta}}_{i}^{\text{EBLUP}}\right)=\{\begin{array}{ll}{g}_{1ia}+{g}_{2ia}+2{g}_{3ia}\hfill & \text{for}i\in A\hfill \\ {\overline{Z}}_{i}^{T}\mathrm{var}\left({\widehat{\beta}}^{\text{EBLUP}}\right){\overline{Z}}_{i}+{\widehat{\sigma}}_{v}^{2}\hfill & \text{for}i\in \overline{A}\hfill \end{array}$ |

PEBLUP | $\text{mse}\left({\widehat{\theta}}_{i}^{\text{PEBLUP}}\right)=\{\begin{array}{ll}{g}_{1iw}+{g}_{2iw}+2{g}_{3iw}\hfill & \text{for}\text{\hspace{0.17em}}\text{\hspace{0.17em}}i\in A\hfill \\ {\overline{Z}}_{i}^{T}\mathrm{var}\left({\widehat{\beta}}^{\text{PEBLUP}}\right){\overline{Z}}_{i}+{\widehat{\sigma}}_{v}^{2}\hfill & \text{for}\text{\hspace{0.17em}}\text{\hspace{0.17em}}i\in \overline{A}\hfill \end{array}$ |

The various $g$ terms in Table 4.1 can be interpreted in a similar way to those associated with the area level MSE’s. The ${g}_{1i}\u2019s$ are denoted as ${g}_{1ia}$ for EBLUP, and ${g}_{1iw}$ for PEBLUP account for most of the MSE if the number of areas is large. The ${g}_{2i}\u2019s$ account for the estimation of $\beta ,$ and the $2{g}_{3i}\u2019s$ account for the estimation of ${\sigma}_{v}^{2}$ and ${\sigma}_{e}^{2}.$

The estimated variances of ${\widehat{\beta}}^{\text{EBLUP}}$ and ${\widehat{\beta}}^{\text{PEBLUP}}$ are respectively given by

$$\mathrm{var}\left({\widehat{\beta}}^{\text{EBLUP}}\right)={\widehat{\sigma}}_{e}^{2}{\left({\displaystyle \sum _{i\in A}{\displaystyle \sum _{j\in {s}_{i}}{a}_{ij}\left({z}_{ij}-{\widehat{\gamma}}_{ia}{\overline{x}}_{ia}\right){z}_{ij}^{T}}}\right)}^{-1}$$

and

$$\mathrm{var}\left({\widehat{\beta}}^{\text{PEBLUP}}\right)={\widehat{\sigma}}_{e}^{2}{\left({\displaystyle \sum _{i\in A}{\displaystyle \sum _{j\in {s}_{i}}{z}_{ij}^{*}{z}_{ij}^{*T}}}\right)}^{-1}\left({\displaystyle \sum _{i\in A}{\displaystyle \sum _{j\in {s}_{i}}{z}_{ij}^{*}{z}_{ij}^{*T}}}/{a}_{ij}\right){\left({\displaystyle \sum _{i\in A}{\displaystyle \sum _{j\in {s}_{i}}{z}_{ij}^{*}{z}_{ij}^{*T}}}\right)}^{-1}$$

where ${z}_{ij}^{*}={w}_{ij}{a}_{ij}\left({z}_{ij}-{\widehat{\gamma}}_{iwa}{\overline{z}}_{iwa}\right).$

The specific form of the $g$ terms and the estimated variances can be found in Estevao et al. (2015).

- Date modified: