Sample allocation for efficient model-based small area estimation
Section 2. Allocations which utilize the model

2.1 Choosing the model

Pfeffermann (2013) presents a wide variety of models and methods for small area estimation. Our model is one of this assortment, a unit-level mixed model

$y_{d k} = x_{d k}^{^{'}} β + v_{d} + e_{d k}; k = 1, \dots, N_{d}; d = 1, \dots, D, (2.1)$

where $v_{d} ’ s$ are random area effects with mean zero and variance $σ_{v}^{2}$ and $e_{d k} ’ s$ are random effects with mean zero and variance $σ_{e}^{2} .$ Furthermore, $E (y_{d k}) = x_{d k}^{^{'}} β$ and $V (y_{d k}) = σ_{v}^{2} + σ_{e}^{2}$ (total variance). Matrix $V$ is the variance-covariance matrix of the study variable $y .$ This model can be used when unit-level values are available for the auxiliary variables $x .$ We use one auxiliary variable in our study.

Two important measures are needed in developing one of these types of allocations. The first one is a common intra-area correlation $ρ$ and the second one is the ratio $δ$ between variance components. They are defined as follows:

$ρ = σ_{v}^{2} / (σ_{v}^{2} + σ_{e}^{2}) and δ = σ_{e}^{2} / σ_{v}^{2} = 1 / ρ - 1 . (2.2)$

Before estimating area parameters, the variance components, regression coefficients and area effects must be estimated from the sample data. The BLUE estimator (Best Linear Unbiased Estimator) of $β,$ noted $\tilde{β},$ is obtained according to the theory of the general linear model, and it is replaced with its EBLUP estimate $\hat{β} .$

The EBLUP estimate (predicted value) for the area total $Y_{d}$ of the study variable is the sum of the observed $y -$ values and predicted $y -$ values for units outside the sample:

${\hat{Y}}_{d, Eblup} = \sum_{k \in s_{d}} y_{d k} + \sum_{k \in {\bar{s}}_{d}} {\hat{y}}_{d k} = \sum_{k \in s_{d}} y_{d k} + \sum_{k \in {\bar{s}}_{d}} x_{d k}^{^{'}} \hat{β} + (N_{d} - n_{d}) {\hat{v}}_{d} . (2.3)$

We use the Prasad-Rao approximation (See Rao 2003) of MSE (Mean Squared Error) for finite populations:

$mse ({\hat{Y}}_{d, Eblup}) = g_{1 d} ({\hat{σ}}_{v}^{2}, {\hat{σ}}_{e}^{2}) + g_{2 d} ({\hat{σ}}_{v}^{2}, {\hat{σ}}_{e}^{2}) + 2 g_{3 d} ({\hat{σ}}_{v}^{2}, {\hat{σ}}_{e}^{2}) + g_{4 d} ({\hat{σ}}_{v}^{2}, {\hat{σ}}_{e}^{2}), (2.4)$

where the four components $g_{1 d},$ $g_{2 d},$ $g_{3 d}$ and $g_{4 d}$ are defined as follows:

$\begin{array}{l} g_{1 d} ({\hat{σ}}_{v}^{2}, {\hat{σ}}_{e}^{2}) & = {(N_{d} - n_{d}^{*})}^{2} (1 - {\hat{γ}}_{d}) {\hat{σ}}_{v}^{2}, \\ g_{2 d} ({\hat{σ}}_{v}^{2}, {\hat{σ}}_{e}^{2}) & = {(N_{d} - n_{d}^{*})}^{2} {({\bar{x}}_{d}^{*} - {\hat{γ}}_{d} {\bar{x}}_{d})}^{'} {(X^{'} V^{- 1} X)}^{- 1} ({\bar{x}}_{d}^{*} - {\hat{γ}}_{d} {\bar{x}}_{d}), \\ g_{3 d} ({\hat{σ}}_{v}^{2}, {\hat{σ}}_{e}^{2}) & = {(N_{d} - n_{d}^{*})}^{2} {(n_{d}^{*})}^{- 2} {({\hat{σ}}_{v}^{2} + {\hat{σ}}_{e}^{2} {(n_{d}^{*})}^{- 1})}^{- 3} [{\hat{σ}}_{e}^{4} V ({\hat{σ}}_{v}^{2}) \\ + {\hat{σ}}_{v}^{4} V ({\hat{σ}}_{e}^{2}) - 2 {\hat{σ}}_{e}^{2} {\hat{σ}}_{v}^{2} Cov ({\hat{σ}}_{e}^{2}, {\hat{σ}}_{v}^{2})], \\ g_{4 d} ({\hat{σ}}_{v}^{2}, {\hat{σ}}_{e}^{2}) & = (N_{d} - n_{d}^{*}) {\hat{σ}}_{e}^{2} . (2.5) \end{array}$

The area sample sizes $n_{d}^{*}$ depend on the sample and are not fixed. The component $g_{1 d}$ contains the area-specific ratio ${\hat{γ}}_{d} = {\hat{σ}}_{v}^{2} / ({\hat{σ}}_{v}^{2} + {\hat{σ}}_{e}^{2} / n_{d}^{*}) .$ According to Nissinen (2009, page 53), the $g_{1 d}$ component (later simply $g 1)$ contributes generally over 90% of the estimated MSE. This component represents uncertainty as regards the variation between the areas. Of course this variation must be strong enough so that such a high proportion for $g 1$ exists.

Unfortunately, the idea of an analytical solution, which means minimizing the sum of MSE’s over areas subject to $n = \sum_{d = 1}^{D} n_{d},$ is difficult and laborious to accomplish because components of the MSE approximation (2.5) include sample information which is unknown, and some components contain complex matrix and variance-covariance operations. We have examined this allocation problem for the first time in an experimental study (Keto and Pahkinen 2009). Now we have developed an allocation based only on the component $g 1$ and auxiliary variable $x .$ The reasoning for this solution is that because $x$ and $y$ are correlated, the between-area variation in $x$ is transferred to $y .$

2.2 Model-based g1 – allocation

The $g 1 -$ allocation utilizes the auxiliary variable $x$ and the adjusted homogeneity coefficient (Keto and Pahkinen 2014). This coefficient is an approximation of an intra-class correlation (ICC) known of cluster sampling. We regard one area as one cluster. First, simple ANOVA between areas is carried out, and then the adjusted homogeneity measure of variation between the areas can be computed:

$R_{a x}^{2} = 1 - R^{2} (x) = 1 - MSW / S_{x}^{2}, (2.6)$

where $R^{2} (x)$ is the coefficient of determination from regression analysis, MSW (Mean Square Within) is the mean SS (Sum of Squares) of areas and $S_{x}^{2}$ is the variance of the auxiliary variable $x .$

Because MSE of the area total is complex, we use only the component $g 1,$ which appears in (2.4) and (2.5), for the reason we have given in Section 2.1. We search for the minimum for the sum of $g 1 ’ s$ over areas:

$\sum_{d = 1}^{D} g_{1 d} (σ_{v}^{2}, σ_{e}^{2}) = \sum_{d = 1}^{D} {(N_{d} - n_{d})}^{2} {(n_{d} / σ_{e}^{2} + 1 / σ_{v}^{2})}^{- 1} (2.7)$

subject to $n = \sum_{d = 1}^{D} n_{d} .$

We use Lagrange’s multiplier method to find the solution. Therefore, we define the function $F$ of sample sizes $n^{'} = (n_{1}, n_{2}, \dots, n_{D})$ and $λ :$

$F (n, λ) = \sum_{d = 1}^{D} g_{1 d} (σ_{v}^{2}, σ_{e}^{2}) = \sum_{d = 1}^{D} {(N_{d} - n_{d})}^{2} {(n_{d} / σ_{e}^{2} + 1 / σ_{v}^{2})}^{- 1} + λ (\sum_{d = 1}^{D} n_{d} - n) . (2.8)$

We set the derivative of $F$ with respect to the area sample size $n_{d}$ to zero and solve for $n_{d} .$ The expression for area sample size $n_{d}^{g 1}$ is as follows:

$n_{d}^{g 1} = \frac{(N_{d} + δ) (n + δ D)}{N + δ D} - δ = \frac{N_{d} n - (N - N_{d} D - n) (1 / ρ - 1)}{N + D (1 / ρ - 1)}, (2.9)$

where the ratio $δ$ and the intra-area correlation $ρ$ are defined in (2.2). The only unknown member in (2.9) is the intra-area correlation $ρ .$ Therefore we substitute the known homogeneity measure (2.6) of the auxiliary variable $x$ for $ρ .$ Thus the final expression for computing area sample sizes is

$n_{d}^{g 1} = \frac{N_{d} n - (N - N_{d} D - n) (1 / R_{a x}^{2} - 1)}{N + D (1 / R_{a x}^{2} - 1)} . (2.10)$

It is easy to prove that $\sum_{d = 1}^{D} n_{d}^{g 1} = n .$ The computed sample sizes are rounded to the nearest integer. Sometimes compromises must be made. It can be concluded by the examination of (2.10) that the sample size increases when the size of area $N_{d}$ increases, but not proportionally. Under certain circumstances, such as low homogeneity coefficient, low overall sample size $n$ or small size of area, $N_{d}$ can lead to negative area sample size $n_{d}^{g 1} .$ In this situation the negative value is changed to zero. A special case occurs if the total variation is only between areas causing value one to the measure of homogeneity (2.6), and (2.10) is reduced to proportional allocation.

2.3 Model-assisted MC-allocation

Molefe and Clark (2015) have used the following composite estimator for estimating the mean of the study variable $y$ for area $d :$

${\tilde{y}}_{d}^{C} = (1 - φ_{d}) {\bar{y}}_{d r} + φ_{d} {\hat{β}}^{'} {\bar{X}}_{d} . (2.11)$

This estimator is a combination of two estimators: the synthetic estimator ${\hat{\bar{Y}}}_{d (syn)} = {\hat{β}}^{'} {\bar{X}}_{d},$ where $\hat{β}$ is the estimated regression coefficient and ${\bar{X}}_{d}$ is the area population means of auxiliary variables $x,$ and a direct estimator ${\bar{y}}_{d r} = {\bar{y}}_{d} + {\hat{β}}^{'} ({\bar{x}}_{d} - {\bar{X}}_{d}),$ where ${\bar{y}}_{d}$ and ${\bar{x}}_{d}$ are the area $d$ sample means of $y$ and $x .$ We use one auxiliary variable in our study. The coefficients $φ_{d}$ are set with the intent to minimize the MSE of the estimator (2.11). The approximated design-based MSE of the estimator under certain conditions and assumptions is given by the expression

${MSE}_{p} ({\tilde{y}}_{d}^{C}; {\bar{Y}}_{d}) \approx {(1 - φ_{d})}^{2} v_{d (syn)} + φ_{d}^{2} B_{d}^{2}, (2.12)$

where $v_{d (syn)}$ is the sampling variance of the synthetic estimator ${\hat{\bar{Y}}}_{d (syn)}$ and $B_{d} = β_{U}^{^{'}} {\bar{X}}_{d} - {\bar{Y}}_{d}$ is the bias when ${\hat{\bar{Y}}}_{d (syn)}$ is used to estimate ${\bar{Y}}_{d},$ with $β_{U}$ denoting the approximate design-based expectation of $\hat{β} .$

The population contains $N$ units and $D$ strata defined by areas, and stratified sampling is used. A random sample SRSWOR (Simple Random Sampling without Replacement) of $n_{d}$ units is selected from stratum $d (d = 1, \dots, D)$ containing $N_{d}$ units. The relative size of area $d$ is $P_{d} = N_{d} / N .$

A two-level linear model $ξ$ conditional on the values of $x$ is assumed, with uncorrelated stratum random effects $u_{d}$ and random effects $ε_{i} :$

$\begin{array}{l} y_{i} & = β^{'} x_{i} + u_{d} + ε_{i} \\ E_{ξ} (u_{d}) & = E_{ξ} (ε_{i}) = 0 \\ V_{ξ} (u_{d}) & = σ_{u d}^{2} \\ V_{ξ} (ε_{i}) & = σ_{e d}^{2} \end{array}}, (2.13)$

where $i$ refers to all units in stratum $d .$ This model implies that $V_{ξ} (y_{i}) = σ_{u d}^{2} + σ_{e d}^{2}$ for all population units and ${cov}_{ξ} (y_{i}, y_{j})$ equals $ρ_{d} σ_{d}^{2}$ for units $i \neq j$ in the same stratum and zero for units from different strata, where $ρ_{d} = σ_{u d}^{2} / (σ_{u d}^{2} + σ_{e d}^{2}) .$ A simplifying assumption that $ρ_{d} = ρ$ are equal for all strata is defined.

After making some other simplifying assumptions and solving the optimal weight $φ_{d}$ in (2.12), the final approximate optimum anticipated MSE or approximate model assisted mean squared error is obtained of (2.12):

${AMSE}_{d} = E_{ξ} {MSE}_{p} ({\tilde{y}}_{d}^{C} [φ_{d (opt)}]; {\bar{Y}}_{d}) \approx σ_{d}^{2} ρ (1 - ρ) {[1 + (n_{d} - 1) ρ]}^{- 1} . (2.14)$

Next the criterion $F$ using anticipated MSE’s of the small area mean and overall mean estimators for model-assisted allocation is defined and developed into the final approximative form:

$\begin{array}{l} F & = \sum_{d = 1}^{D} N_{d}^{q} {AMSE}_{d} + G N_{+}^{(q)} E_{ξ} {var}_{p} ({\hat{\bar{Y}}}_{r}) \\ \approx \sum_{d = 1}^{D} N_{d}^{q} σ_{d}^{2} ρ (1 - ρ) {[1 + (n_{d} - 1) ρ]}^{- 1} + G N_{+}^{(q)} \sum_{d = 1}^{D} σ_{d}^{2} P_{d}^{2} n_{d}^{- 1} (1 - ρ) . (2.15) \end{array}$

Optimal sample sizes for the areas are obtained by minimizing (2.15) subject to $\sum_{d} n_{d} = n .$ Expression (2.15) follows the idea of Longford (2006). The weight $N_{d}^{q}$ reflects the inferential priority (importance) for area $d,$ with $0 \leq q \leq 2,$ and $N_{+}^{(q)} = \sum_{d = 1}^{D} N_{d}^{q} .$ The quantity $G$ is a relative priority coefficient on the population level. Ignoring the goal of estimating the population mean corresponds to $G = 0,$ and the attention is then only focused on area level estimation. On the other hand, the larger the value of $G,$ the more the second component in (2.15) dominates and the more the area level estimation is ignored.

We assume first that the population estimation has no priority $(G = 0)$ and the unit survey cost are fixed. In this case minimization of (2.15) with respect of $n_{d}$ has a unique solution

$n_{d , opt} = \frac{n \sqrt{σ_{d}^{2} N_{d}^{q}}}{\sum_{d = 1}^{D} \sqrt{σ_{d}^{2} N_{d}^{q}}} + \frac{1 - ρ}{ρ} (\frac{\sqrt{σ_{d}^{2} N_{d}^{q}}}{D^{- 1} \sum_{d = 1}^{D} \sqrt{σ_{d}^{2} N_{d}^{q}}} - 1) . (2.16)$

The formula (2.16) contains two unknown parameters, the intra-class correlation $ρ$ and the area-specific variance $σ_{d}^{2} .$ We replace $ρ$ with an adjusted homogeneity coefficient of the auxiliary variable $x .$ This coefficient is an approximation of the ICC (Intra-Class Correlation) (Section 2.2). Parameter $σ_{d}^{2}$ is replaced with the variance of $x$ in area $d .$ The reason for both replacements is that $y$ is correlated with $x .$ If also the population estimation has a priority $(G > 0)$ then (2.16) does not apply and $F$ must be minimized numerically by using, for example, the NLP method, as we have done (Excel Solver, NLP option).

Table 2.1
Summary of model-based and model-assisted allocations
Table summary
This table displays the results of Summary of model-based and model-assisted allocations. The information is grouped by Method (appearing as row headers), Computing sample size xxxxx for area xxxxx and Optimality level (appearing as column headers).
Method	Computing sample size $n_{d}$ for area $d$	Optimality level
Model-based $g 1$	$n_{_{d}}^{g 1} = \frac{N_{d} n - (N - N_{d} D - n) (1 / R_{a x}^{2} - 1)}{N + D (1 / R_{a x}^{2} - 1)},$ where $R_{a x}^{2}$ is the adjusted homogeneity measure of auxiliary variable $x .$	Area
Model-assisted MCG0	$n_{d , opt} = \frac{n \sqrt{σ_{d}^{2} N_{d}^{q}}}{\sum_{d = 1}^{D} \sqrt{σ_{d}^{2} N_{d}^{q}}} + \frac{1 - ρ}{ρ} (\frac{\sqrt{σ_{d}^{2} N_{d}^{q}}}{D^{- 1} \sum_{d = 1}^{D} \sqrt{σ_{d}^{2} N_{d}^{q}}} - 1)$	Jointly area and population
MCG50	Minimization of $F = \sum_{d = 1}^{D} N_{d}^{q} σ_{d}^{2} ρ (1 - ρ) {[1 + (n_{d} - 1) ρ]}^{- 1} + G N_{+}^{(q)} \sum_{d = 1}^{D} σ_{d}^{2} P_{d}^{2} n_{d}^{- 1} (1 - ρ)$ with respect of $n_{d} .$ Parameter $ρ$ is replaced with $R_{a x}^{2}$ and $σ_{d}^{2}$ with $S_{d}^{2} (x) .$	Jointly area and population

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: semi-annual

Ottawa

Date modified:: 2017-06-22

Language selection

Search and menus

Search

Sample allocation for efficient model-based small area estimation
Section 2. Allocations which utilize the model

2.1 Choosing the model

2.2 Model-based g1 – allocation

2.3 Model-assisted MC-allocation

Sample allocation for efficient model-based small area estimation Section 2. Allocations which utilize the model

2.1 Choosing the model

2.2 Model-based g1 – allocation

2.3 Model-assisted MC-allocation

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Sample allocation for efficient model-based small area estimation
Section 2. Allocations which utilize the model