Sample allocation for efficient model-based small area estimation
Section 2. Allocations which utilize the model
2.1 Choosing the model
Pfeffermann
(2013) presents a wide variety of models and methods for small area estimation.
Our model is one of this assortment, a unit-level mixed model
where
are random area effects with mean zero and
variance
and
are
random effects with mean zero and variance
Furthermore,
and
(total
variance). Matrix
is the variance-covariance matrix of the study
variable
This model can be used when unit-level values
are available for the auxiliary variables
We use one auxiliary variable in our study.
Two important measures are needed in
developing one of these types of allocations. The first one is a common
intra-area correlation
and the second one is the ratio
between variance components. They are defined
as follows:
Before estimating area
parameters, the variance components, regression coefficients and area effects must
be estimated from the sample data. The BLUE estimator (Best Linear Unbiased Estimator) of
noted
is obtained according to the theory of the general linear model, and it is
replaced with its EBLUP estimate
The EBLUP estimate (predicted value)
for the area total
of the study variable is the sum of the observed
values and predicted
values for units outside the sample:
We use the
Prasad-Rao approximation (See Rao 2003) of
MSE (Mean Squared Error) for finite populations:
where the four components
and
are defined as follows:
The area sample sizes
depend on the sample and are
not fixed. The component
contains the area-specific ratio
According to Nissinen (2009, page 53), the
component (later simply
contributes generally over 90% of the
estimated MSE. This component represents uncertainty as regards the variation
between the areas. Of course this variation must be strong enough so that such
a high proportion for
exists.
Unfortunately, the idea of an
analytical solution, which means minimizing the sum of MSE’s over areas subject
to
is difficult and laborious to accomplish
because components of the MSE approximation (2.5) include sample information
which is unknown, and some components contain complex matrix and
variance-covariance operations. We have examined this allocation problem for
the first time in an experimental study (Keto and Pahkinen 2009). Now we have
developed an allocation based only on the component
and auxiliary variable
The reasoning for this solution is that because
and
are correlated, the between-area variation in
is transferred to
2.2 Model-based g1 – allocation
The
allocation utilizes the auxiliary variable
and the adjusted homogeneity coefficient (Keto and Pahkinen 2014). This
coefficient is an approximation of an intra-class correlation (ICC) known of
cluster sampling. We regard one area as one cluster. First, simple ANOVA
between areas is carried out, and then the adjusted homogeneity measure of
variation between the areas can be computed:
where
is the
coefficient of determination from regression analysis, MSW (Mean Square Within) is the mean SS (Sum of Squares) of areas and
is the
variance of the auxiliary variable
Because MSE of the area total is
complex, we use only the component
which appears in (2.4) and (2.5), for the
reason we have given in Section 2.1. We search for the minimum for the sum of
over areas:
subject to
We use Lagrange’s multiplier
method to find the solution. Therefore, we define the function
of sample sizes
and
We set the derivative of
with respect to the area sample size
to zero
and solve for
The
expression for area sample size
is as
follows:
where the ratio
and the
intra-area correlation
are defined in (2.2). The only unknown member
in (2.9) is the intra-area correlation
Therefore we substitute the known homogeneity
measure (2.6) of the auxiliary variable
for
Thus the
final expression for computing area sample sizes is
It is easy to prove that
The computed sample sizes are rounded to the
nearest integer. Sometimes compromises must be made. It can be concluded by the
examination of (2.10) that the sample size increases when the size of area
increases, but not proportionally. Under
certain circumstances, such as low homogeneity coefficient, low overall sample
size
or small size of area,
can lead to negative area sample size
In this situation the negative value is
changed to zero. A special case occurs if the total variation is only between
areas causing value one to the measure of homogeneity (2.6), and (2.10) is
reduced to proportional allocation.
2.3 Model-assisted
MC-allocation
Molefe and Clark (2015) have used
the following composite estimator for estimating the mean of the study variable
for area
This estimator is a combination
of two estimators: the synthetic estimator
where
is the
estimated regression coefficient and
is the
area population means of auxiliary variables
and a direct estimator
where
and
are the
area
sample means of
and
We use one auxiliary variable in our study. The
coefficients
are set
with the intent to minimize the MSE of the estimator
(2.11). The approximated design-based MSE of the estimator under certain
conditions and assumptions is given by the expression
where
is the
sampling variance of the synthetic estimator
and
is the
bias when
is used
to estimate
with
denoting
the approximate design-based expectation of
The population contains
units and
strata defined by areas, and stratified sampling is used. A random
sample SRSWOR (Simple Random Sampling without Replacement) of
units is selected from stratum
containing
units. The relative size of area
is
A two-level linear model
conditional on the values of
is assumed, with uncorrelated stratum random effects
and random effects
where
refers to all units in stratum
This model implies that
for all population units and
equals
for
units
in the same stratum and zero for units from
different strata, where
A
simplifying assumption that
are
equal for all strata is defined.
After making some other simplifying
assumptions and solving the optimal weight
in (2.12), the final approximate optimum
anticipated MSE or approximate model assisted mean squared error is obtained of
(2.12):
Next the criterion
using anticipated MSE’s of the small area mean
and overall mean estimators for model-assisted allocation is defined and
developed into the final approximative form:
Optimal sample sizes for the areas
are obtained by minimizing (2.15) subject to
Expression (2.15) follows the idea of Longford
(2006). The weight
reflects the inferential priority (importance)
for area
with
and
The quantity
is a relative priority coefficient on the population level. Ignoring the
goal of estimating the population mean corresponds to
and the attention is then only focused on area level estimation. On the
other hand, the larger the value of
the more the second component in (2.15) dominates and the more the area
level estimation is ignored.
We assume first that the population
estimation has no priority
and the unit survey cost are fixed. In this case minimization of (2.15)
with respect of
has a unique solution
The formula (2.16) contains two
unknown parameters, the intra-class correlation
and the
area-specific variance
We
replace
with an
adjusted homogeneity coefficient of the auxiliary variable
This coefficient is an approximation of the
ICC (Intra-Class Correlation) (Section 2.2). Parameter
is
replaced with the variance of
in area
The reason for both replacements is that
is correlated with
If also the population estimation has a
priority
then (2.16) does not apply and
must be minimized numerically by using, for
example, the NLP method, as we have done (Excel Solver, NLP option).
Table 2.1
Summary of model-based and model-assisted allocations
Table summary
This table displays the results of Summary of model-based and model-assisted allocations. The information is grouped by Method (appearing as row headers), Computing sample size xxxxx for area xxxxx and Optimality level (appearing as column headers).
| Method |
Computing sample size for area |
Optimality level |
| Model-based |
where is the adjusted homogeneity measure of auxiliary variable |
Area |
Model-assisted
MCG0
|
|
Jointly area and population |
| MCG50 |
Minimization of
with respect of Parameter is replaced with and with
|