Are probability surveys bound to disappear for the production of official statistics?
Section 4. Model-based approaches
Model-based
approaches can eliminate the selection bias of the non-probability source and
enable valid statistical inferences, provided that their underlying assumptions
hold. The objective of the methods in Sections 4.1, 4.2 and 4.3 is to
reduce respondent burden and costs by eliminating data collection for some
variables of interest in a probability sample. The greater the number of
variables of interest for which the values are not collected, the greater the
reduction in data collection costs and respondent burden. However, these
methods assume that the variables of interest are measured without error in the
non-probability sample
From the
non-probability sample
we can
obtain the naive estimator
of the
total
where
is the
number of units in
and
is the
size of the population It is
well known that the selection bias of the naive estimator may be significant
(see, for example, Bethlehem, 2016). The objective of the methods in
Sections 4.1, 4.2 and 4.3 is to reduce the bias of the naïve estimator by
using a vector of auxiliary variables,
We use
to
denote the matrix that contains the values of vector
We
assume that
is
measured without error in both samples and
Section 4.4
briefly discusses small area estimation and the area-level model of Fay and
Herriot (1979). Small area estimation methods are generally used to improve the
precision of estimates for population sub-groups (domains) that have a small
probability sample size. They require collecting the variable
in the
probability sample, but not in the non-probability sample. Therefore, they do
not require the condition Ideally,
the non-probability sample contains variables correlated to
4.1 Calibration of
the non-probability sample
The most natural
approach to correcting the selection bias of a non-probability source is to
model the relationship between the variable of interest
and the
auxiliary variables
and then
predict the total
by
predicting the variable
for each
unit outside the non-probability sample. This prediction approach is described
in Royall (1970) and generalized in Royall (1976); see also Elliott and
Valliant (2017). Readers are referred to Valliant, Dorfman and Royall (2000)
for more details. With this approach, inferences are conditional on
and
As a
result,
is
considered random as well as
(unless
. If a probability sample is used,
is also
considered random. It is usually assumed that the nonprobability sample
selection mechanism is not informative:
Assumption 3:
and
are
independent after conditioning on
Assumption 3 is
the key to eliminating the selection bias. The more access we have to auxiliary
variables that are strongly related to both
and
the more
plausible assumption 3 becomes. In other words, the richer
is, the
more the conditional independence between
and
becomes
a realistic assumption. This assumption, called the exchangeability assumption, is discussed in Mercer, Kreuter, Keeter and Stuart (2017). Schonlau
and Couper (2017) also discuss the selection of auxiliary variables and
emphasize their key role in reducing selection bias.
Often, a linear
model is considered where it is assumed that the observations
are
mutually independent with
and
where
is a
vector of unknown model parameters and
is a
known function of
. The best linear unbiased predictor of
(see,
for example, Valliant, Dorfman and Royall, 2000) is given by
where
The predictor
can also
be re-written in the weighted form
where
It can easily be
shown that
is a
calibrated weight that satisfies the calibration equation
Therefore, the prediction approach is
equivalent to calibration when a linear model is used to describe the
relationship between
and
The
calibration equation satisfies what Mercer et al. (2017) call the composition assumption. This approach
requires knowing the vector of control totals
If it is
unknown, an alternative is to replace it in (4.1) or (4.2) with an estimate,
from a
probability survey (Elliott and Valliant, 2017). If assumptions 1 to 3 are
satisfied, it can be shown that the predictor
is
unbiased, i.e.,
whether
or
is used,
provided that the latter is design-unbiased, i.e.,
Of
course, the unbiasedness property of the predictor
requires
the linear model to be valid.
Remark: In practice, auxiliary variables for which the population total is
known are usually few in number and not sufficiently predictive of the variable
for
eliminating the selection bias. These may be supplemented with other auxiliary
variables for which the total can be estimated using an existing probability
survey. Therefore, the vector of population totals may be a blend of known and
estimated totals. If the probability survey itself is calibrated to known
population totals, then only the estimated totals
from the
probability survey can be used.
A linear model is
not always appropriate. This is the case when the variable
is
categorical. Another typical example occurs when it is desired to estimate the
total of a quantitative variable in a domain of interest. The variable
is then
defined as the product of that quantitative variable and a binary variable
indicating domain membership. To model such a variable, it is natural to
consider a mixture of a degenerate distribution at 0 and a continuous
distribution. When the relationship between
and
is not
linear, model-assisted calibration of Wu and Sitter (2001) can be used to
preserve the weighted form of the predictor
while
taking into account the non-linearity of the relationship. Suppose that we
replace the above linear model with a non-linear (or non-parametric) model such
that
where
is some
function. The Wu and Sitter (2001) calibration first involves predicting
by
where
is a
model-based estimate of
Then,
the total
is
calculated, and weights,
are
found that satisfy the calibration equation:
In other words,
the equation (4.2) can be used, where
is
replaced with
This
method requires knowing the population size
as well
as the vector
for all
units in the population
If
and
are
unknown, they can be replaced with estimates from a probability survey. For
example, we can replace
with
and
with
The
approach can also be extended to the case of multiple variables of interest.
We mentioned that
the selection bias may be considerably reduced if
is rich
and contains variables that are related to both
and
which
makes assumption 3 more realistic. It can therefore be useful in practice to
consider a large number of potential auxiliary variables and select the most
relevant ones using a variable selection technique. Chen, Valliant and Elliott
(2018) suggest the LASSO technique for selecting auxiliary variables and show
its good properties.
It should be noted
that the predictor
reduces
to the naive estimator,
in the
simplest case possible where only one constant auxiliary variable is used:
The
naive estimator is usually highly biased. Its bias can be significantly reduced
if the population
can be
subdivided into
disjoint
and exhaustive post-strata,
of size
The
post-stratification model,
is then
postulated, which is an important special case of the above linear model.
Assuming that the variance
is
constant for
the
predictor
is
written:
where
is the
set of units in
that are
part of the sample
and
is the
size of
If the
population sizes
are
unknown, they can be replaced with estimates,
from a
probability survey, where
is the
set of units in
that are
part of the sample
. Regression trees could prove to be an
interesting approach for forming post-strata, especially when the auxiliary
variables are categorical.
If multiple
categorical auxiliary variables are available, it can be useful to form a large
number of post-strata to reduce the selection bias. If many auxiliary variables
are crossed, the sample sizes
could
become very small, thereby making the estimators
very
unstable. Gelman and Little (1997) suggest using a multi-level regression model
to obtain estimators
more
stable than
They
then consider the post-stratified predictor:
Nowadays, this method is known as Mr.P or MRP
(Multilevel Regression and Poststratification); see, for example, Mercer
et al. (2017). A similar approach would use small area estimation methods
(Rao and Molina, 2015) to stabilize the estimators
Although
such methods are likely to produce much more precise estimates of the average
of variable
over the
population
it
remains to be determined whether such methods can produce significant
efficiency gains for estimating the overall total
compared
to the simple post-stratified predictor
It seems
that regression trees provide another way to control the instability of the
estimators
since a
criterion is generally used to prevent an overly narrow subdivision of the
population. These various methods warrant further investigation in future
research. Precise estimation of population sizes
if not
known, is also a problem not to be overlooked when the population is divided
into a large number of post-strata.
4.2 Statistical
matching
Statistical
matching, or data fusion, is an approach developed for combining data from two
different sources that contain both source-specific variables and common
variables. Readers are referred to D’Orazio, Di Zio and Scanu (2006) or
Rässler (2012) for a review of statistical matching methods. In the context of
this article, statistical matching involves modelling the relationship between
and the
auxiliary variables
which
are common to both sources, using data from the non-probability sample. As with
calibration, the non-probability sample selection mechanism is assumed to be
non-informative, and the auxiliary variables must be chosen carefully in order
to make assumption 3 as plausible as possible. Once a model has been
determined, it is used to predict the
values
in a probability sample. Statistical matching can be viewed as an imputation
problem with an imputation rate of 100%. The predictor of
obtained
from the probability sample, takes the form:
where
is the
imputed value for the unit
As in
calibration, inferences are conditional on
and
Assumption 3, in a statistical matching
context, can be viewed as analogous to the Population Missing At Random (PMAR)
assumption introduced by Berg, Kim and Skinner (2016) in a non-response
context.
If the linear
regression model
is used,
the imputed value for the unit
is
and the
resulting predictor is given by
If
assumptions 1 to 3 are satisfied and
statistical matching produces an unbiased
predictor,
i.e.,
Also, if
for a
certain known vector
it can
be shown that
and the
predictor
is
equivalent to the predictor
if we
replace
in (4.1)
with
It can
also be shown that, for a post-stratification model where we impute
with
the
predictor
reduces
to
Therefore, statistical matching and
calibration produce similar predictors, even identical in some cases, when a
linear model is postulated and the totals
are
estimated.
Choosing between
statistical matching or calibration can depend on the user’s perspective. For
example, if it is the content of the non-probability source, in terms of
variables of interest, that is relevant to the user, then it seems natural to
weight the non-probability sample in the hopes of reducing the selection bias
for all variables of interest. The calibration technique or the methods in
Section 4.3 are obvious choices for such weighting. Conversely, if instead it
is the content of the probability survey that is relevant, then statistical
matching is the appropriate choice. This method enriches the probability survey
by imputing the missing variables of interest.
Statistical
matching is easily generalized to non-linear or non-parametric models such that
The
imputed values
are
simply obtained by predicting the missing values
using
the chosen model. The predictor
remains
unbiased if assumptions 1 to 3 are satisfied and if
Donor or
nearest neighbour imputation is a non-parametric imputation method commonly
used for handling non-response (see, for example, Beaumont and Bocci, 2009)
that does not require a linear relationship between
and
In the
context of matching non-probability and probability samples, donor imputation
was popularized by Rivers (2007). For a given unit
the
method involves finding the nearest donor, with respect to the auxiliary
variables
among
the units of the non-probability sample and replacing the missing value
with the
value
from this donor. For donor imputation, the condition
is
satisfied if, for each recipient
the
donor has exactly the same values of
as the
recipient. When one or more auxiliary variables are continuous, this condition
is satisfied only asymptotically in general. A very large non-probability
sample provides a large pool of donors, which should help to approximately
satisfy this condition.
Remark: In some applications, a very large non-probability panel of
volunteers,
is
available, which contains a few auxiliary variables for matching,
but no
variable of interest. Ideally, the variables of interest would be collected for
all units of the panel
but that
is impossible due to the cost and the burden on the panel members. Therefore,
in practice, a sub-sample
of
is
selected using random or non-random sampling methods. Quota sampling (e.g.,
Deville, 1991) is often considered in this context. In addition to collecting
the variables of interest for all units of
there
may also be interest in collecting other auxiliary variables for matching in
order to enhance the vector
The
matching can then be done to the probability sample, often much smaller in
size, as long as the latter contains the same auxiliary variables as those of
the non-probability sub-sample
By
carefully choosing the auxiliary variables for the matching, the potential for
bias reduction is increased (Schonlau and Cooper, 2017). The implementation
proposed by Rivers (2007) is slightly different. Rivers (2007) suggests
conducting the matching between the probability sample and the panel
using
the auxiliary variables available in both sources. The variables of interest
are collected only for the set of donors in
who have
been matched to a unit in the probability sample, which allows for a
significant reduction of data collection costs and burden. The implicit
assumption is that the panel members, initially volunteers, are more likely to
respond than individuals chosen at random in the population. Obviously,
non-response is unavoidable, and this problem must be dealt with, potentially
through imputation. The advantage of this method is that the matching is
carried out using the panel
rather
than a sub-sample of this panel; the pool of donors is larger. However, the
matching cannot be done using the enhanced vector of auxiliary variables
because it is not available for the units in
which
limits the potential for bias reduction.
Lavallée and
Brisbane (2016) point out the connection between statistical matching and
indirect sampling (Lavallée, 2007; Deville and Lavallée, 2006). They propose an
estimator obtained by imputing each missing value
by a
weighted average of the
values
of nearest donors. In reality, their estimator can also be obtained
equivalently by imputing the missing values using fractional donor imputation
(for example, Kim and Fuller, 2004). The use of more than one donor to impute
the missing values yields a typically modest variance reduction.
Several imputation
methods used in practice can be considered linear (Beaumont and Bissonnette,
2011). This is the case for linear regression imputation, donor imputation and
fractional donor imputation. An imputation method is said to be linear if the
imputed value
can be
written as
where
is a
function of
or
but not
of
For
example, for donor or nearest-neighbour imputation,
if the
unit
is the
donor for the recipient
otherwise
For a
linear imputation method, the estimator
can be
rewritten as a weighted sum over the non-probability sample:
where
Therefore, for linear imputation methods,
statistical matching is an alternative to calibration and to the methods in
Section 4.3 if the objective is to properly weight the non-probability
sample.
So far, we have
considered only the estimation of the total
However,
the probability sample contains other variables, and there may be interest in
the relationship between two or more variables, some from the probability
survey and others imputed from the non-probability sample. As an example,
suppose that the estimation of the total
is of
interest, where
is a
variable collected in the probability survey, but not available in the
non-probability sample. It could, for example, define membership in a domain of
interest. Statistical matching can be used to estimate this parameter by
We use
to
denote the vector that contains the values of the variable
It can
be shown that
is
unbiased,
if
assumptions 1 to 3 are satisfied in addition to the following assumption:
Assumption 4:
and
are
independent after conditioning on
and
Assumption 4 is
known as the conditional independence assumption in the statistical matching
literature.
4.3 Inverse
propensity score weighting
Instead of
modelling the relationship between
and
the
relationship between
and
could be
modelled. The main advantage of this approach is to simplify the modelling
effort when there are multiple variables of interest since there is always only
one variable
. With this approach, inferences are
conditional on
and
Also, it
is usually assumed that assumption 3 is valid and thus
The
probability of participation
is then
estimated by
and the
estimate
is
calculated, where
The
assumption that
must be
made. It is called the positivity assumption by Mercer et al. (2017). It may also be required in the
calibration and statistical matching approaches. For example, empty post-strata
may
occur if it is not satisfied. To fix this issue, these empty post-strata are usually
collapsed with other non-empty post-strata. This collapsing may jeopardize the
validity of assumption 3 if the collapsed post-strata are different.
The estimation of
can be
achieved by postulating a parametric model
where
is some
function, normally bounded by 0 and 1, and
is a
vector of unknown model parameters. The logistic function
predominates in the applications (see Kott,
2019, for a recent application). The estimator of
is
denoted by
and the
estimated probability by
Ideally,
would be
estimated using
for all
the units in the population
similar
to what would be done in a non-response context. For example, assuming the
logistic function is used,
could be
estimated by solving the maximum likelihood equation:
This is impossible
when
is not
known for all units
which is
almost always the case in practice. Iannacchione, Milne and Folsom (1991)
proposed another unbiased estimation equation for
(see
also Deville and Dupont, 1993):
The main advantage
of equation (4.4) is that it does not require knowing
for each
unit
However,
it is necessary to have access to the vector of totals
from an
external source. An interesting property of equation (4.4) is that the
resulting weights
satisfy
the calibration equation
just
like the weights
given in
(4.2). Indeed, it can be shown that solving (4.4) yields
if the
model
is used.
However, this is a less natural model than the above logistic model for
modelling a probability.
To get around the
problem of missing values
Chen
et al. (2019) suggest estimating
in (4.3)
using a probability survey. The equation to be solved becomes:
Equation (4.5) is
unbiased conditionally on
and
provided
that the probability survey allows for unbiased estimation, conditionally on
and
of any
population total that is not a function of
such as
Assumptions 1 and 3 are required, but not
assumption 2. Using the idea of Iannacchione et al. (1991), an alternative
to (4.5) is obtained by solving:
Equation (4.6)
produces weights
that
satisfy the calibration equation
(see
also Lesage, 2017; Rao, 2020). The estimators of
obtained
using (4.5) or (4.6) are likely less efficient than those obtained using (4.3)
or (4.4). If
or the
vector
is
known, then using (4.3) or (4.4) is preferable. Otherwise, the estimating
equations (4.5) or (4.6) can be used provided that
is
collected in a probability survey. Note that the indicators
do not
need to be observed in the probability sample.
Equations (4.5)
and (4.6) may be more difficult to solve than equations (4.3) and (4.4) and may
not have a solution. Consider, for example, the case where there is only one
auxiliary variable:
Using
(4.5) or (4.6), it can be seen that the estimated probability reduces to:
If the
size of the probability sample is sufficiently large, it is expected that
For
small sample sizes, it may happen that
due to
the variability of
In that
case, equations (4.5) and (4.6) would not have a solution if the logistic
function is used since it requires that
To avoid
this issue, it may be helpful to consider other functions not bounded by 1,
such as
Kim and Wang
(2019) suggest using the probability sample to estimate the participation
probability. Assuming the logistic function is used, the equation to be solved
is:
The method
requires knowing the indicators
in the
probability sample and the validity of assumptions 1, 2 and 3 to ensure the
estimating equation is unbiased. Also, the probability sample size is usually
small relative to the non-probability sample size, and it can be numerically
difficult to estimate
especially when
contains
a large number of variables and the overlap between the two samples is small.
Lee
(2006), see also Rivers (2007), Valliant and Dever (2011) and Elliott and
Valliant (2017), proposes to combine the two samples and then estimate
using logistic regression. It seems that the
author implicitly assumes that the two samples do not overlap, i.e., that
for all units in
Using again the logistic function, the
resulting estimating equation is:
where
is a
certain weight for the units in the non-probability sample. The method is
somewhat similar to the one proposed by Chen et al. (2019), but the
estimating equation (4.7) is not unbiased, conditionally on
and
unlike
equations (4.5) and (4.6). However, if we assume
and if
is
small, equation (4.7) becomes approximately equivalent to equation (4.5). Yet
Lee (2006) does not directly use the estimated probabilities resulting from
(4.7). The author uses them only to order the union of the two samples and then
create homogeneous classes. Using homogeneous classes brings some robustness to
model misspecification and can help prevent very small estimated probabilities
and thus very large weights. In the context of non-response, forming
homogeneous imputation or reweighting classes was studied by Little (1986),
Eltinge and Yansaneh (1997), and Haziza and Beaumont (2007), among others.
Haziza and Lesage (2016) illustrate the robustness of the method when the
function
is
misspecified. The method is used regularly in Statistics Canada surveys for
dealing with non-response.
Rather than using
(4.7), homogeneous classes could be formed by starting with the unbiased
equations (4.5) or (4.6). These initial estimated probabilities are denoted by
The
sample
can then
be sorted by
and
divided into
homogeneous classes of equal or unequal sizes.
The set of units in
that are
part of class
is
denoted by
whereas
the set of units in
that are
part of class
is
denoted by
The
weight
for a
unit
is equal
to the inverse of the estimated participation rate in class
and is
given by
where
and
is the
number of units in
This
weight ensures the calibration property:
The
number of classes must be large enough to capture a high percentage of the
variability of the initial probabilities
thereby reducing the bias. On the other hand,
it must not be too large to prevent the occurrence of empty classes since the
weights
cannot
be calculated if
Regression trees can prove to be an effective
alternative for forming classes. In a non-response context, they have been
studied by Phipps and Toth (2012). The estimator
obtained
after forming homogeneous classes has exactly the same form as the
post-stratified estimator described in the calibration approach in
Section 4.1; the only difference is that the classes are built by modelling
rather
than
Assumption 3 may
not be realistic in some contexts so that
In this
case, the participation probability
might be
modelled using a vector of explanatory variables
defined
using the variable of interest
(or
variables of interest if there are several) and potentially other auxiliary
variables
A
parametric model,
can be
considered for modelling the participation probability. Equations (4.5) and
(4.6) cannot be used to estimate
because
(and
therefore
is not
available in the probability sample. However, an equation similar to (4.6) can
be used:
The vector
of the
same size as
contains
calibration variables, also called instrumental variables in the econometric
literature. We use
to
denote the matrix that contains the values of vector
Equation
(4.8) requires knowing the calibration variables
for both
samples. However, the explanatory variables
can be
observed only for the units in the non-probability sample. Equation (4.8)
produces weights
that
satisfy the calibration equation
An
equation similar to (4.8) was originally proposed by Deville (1998) to deal
with non-response (see also Kott, 2006; Haziza and Beaumont, 2017). Equation
(4.8) is unbiased, conditionally on
and
if the
instrumental variables
can be
selected such that the following assumption is satisfied:
Assumption 5:
and
are
independent after conditioning on
and
Assumption 3 is no
longer required, but is replaced with another assumption. The choice of
instrumental variables
that
satisfy assumption 5 is not always obvious in practice. They must not be
predictive of
after
conditioning on
Ideally,
for efficiency reasons, the instrumental variables are selected so as to be
predictive of
without
compromising assumption 5. Unlike equations (4.5) and (4.6), equation (4.8)
cannot be used to form homogeneous classes because the participation
probabilities
cannot
be calculated for the units in the probability sample. As such, the property of
robustness that comes with homogeneous classes is lost. Because of these
drawbacks, equation (4.8) should be considered only when there are strong
reasons to believe that assumption 3 is not appropriate.
Once weights
have
been calculated using one of the methods in this section, they can still be
adjusted through calibration. The objective of this calibration is to improve
the precision of the estimator
and also
obtain a double robustness property (see Chen et al., 2019).
In general, the
variable
is
observed for the entire non-probability sample, and the inverse propensity-score
weighted estimator,
or a
weighted estimator obtained by calibration or statistical matching can be used.
Sometimes, the non-probability sample is too large and the variable
can only
be collected for a sub-sample of
Quota
sampling (e.g., Deville, 1991) is a commonly used method for drawing the
sub-sample if auxiliary variables are available for
An
alternative to quota sampling is to calculate the weights
for the
entire non-probability sample and use them to select a random sub-sample with
probabilities proportional to the weights. The variable
is then
collected only for the sub-sample, and the estimates are obtained as if the
sub-sample was drawn from the population using an equal probability design.
This approach is called inverse sampling in the literature on probability
surveys (see, for example, Hinkins, Oh and Scheuren, 1997; or Rao, Scott and
Benhin, 2003) and was proposed by Kim and Wang (2019) for non-probability
samples.
4.4 Small area
estimation
In most surveys,
it is desired to estimate the total of the variable
not just
for the entire population
but also
for different subgroups of the population, called domains. Probability surveys
conducted by national statistical agencies generally produce reliable estimates
for domains with a sufficient number of sample units. Their bias is controlled
through the various sampling and data collection procedures, and their variance
is typically small enough to draw accurate conclusions. When the domain of
interest contains few sample units, the survey estimates may become unstable to
the point of being unusable even when their bias stays under control. To remedy
a lack of data in a domain of interest, small area estimation methods may be
considered. These methods offset the lack of observed data in a domain through
model assumptions that link auxiliary data to survey data. Two types of models
are commonly used: unit-level models and area-level models. The area-level
model of Fay and Herriot (1979) is undoubtedly the most popular. It requires
auxiliary data to be available at the domain level only, unlike unit-level
models, which require auxiliary variables for each unit of the population
Readers
are referred to Rao and Molina (2015) for an excellent coverage of the various
approaches. Below, we focus on the Fay-Herriot model.
Suppose it is desired to estimate
totals,
where
are
disjoint
subsets of the population. Using a probability survey,
can be
estimated by
where
is the
set of sample units that fall within domain
The
estimator
is
called the direct estimator of
because
it only uses
values of units belonging to domain
Small
area estimation techniques generally lead to indirect estimators that combine
the sample
values of domain
with
values of units outside domain
We
assume that a vector of auxiliary variables is available at the area level, and
these variables come from sources independent of the probability sample. This
vector for domain
is
denoted by
For
example, the vector
could be
considered, where
is the
population size in domain
is the
average of variable
in a
non-probability sample,
is the
set of units in the non-probability sample that are in domain
and
is the
size of the non-probability sample in domain
If the
population size
is
unknown, it can be replaced with an estimate independent of the probability
survey. We use
to
denote the matrix that contains the values of vector
Note
that the vector
is
hidden in the matrix
in this
section.
The Fay-Herriot
model has two components: the sampling model and the linking model. The
sampling model is based on the assumption that, conditionally on
the
direct estimators
are
independent and unbiased, i.e.,
Their
design variance is denoted by
The
sampling model is usually written in the form:
where
is the
sampling error such that
and
The
independence assumption of the estimators
(and
therefore of the sampling errors
can be
questioned when the strata do not coincide with the domains of interest.
Section 8.2 of Rao and Molina (2015) discusses methods that take into
account correlated sampling errors. In practice, it is often assumed that these
correlations are weak, and they are ignored.
The linking model
assumes that, conditionally on
the
totals
are
independent,
and
where
are
known constants used for controlling heteroscedasticity and
and
are
unknown model parameters. The linking model is usually written in the form:
where
is the
model error such that
and
When the
parameters of interest,
are
totals, it is often appropriate to let
From
(4.9) and (4.10), we obtain the combined model:
where
is the
combined error. When using the Fay-Herriot model (4.11), inferences are usually
made conditionally on
It can
easily be shown that
and
where
is
called the smooth design variance (Beaumont and Bocci, 2016; and Hidiroglou,
Beaumont and Yung, 2019).
Now suppose that
it is desired to predict the total
using a
linear predictor
where
are
constants to be determined. A linear predictor uses all the data from the
probability sample for predicting
not just
the data from domain
This
explains how it derives its efficiency. However, not all linear predictors are
appropriate for predicting
A
strategy often used for determining the constants
is to
minimize the variance of the prediction error,
subject
to the constraint that the predictor must be unbiased,
The
resulting predictor, called the Best Linear Unbiased Predictor (BLUP), is
denoted by
and can
be written in the form (see, for example, Rao and Molina, 2015):
where
is bounded
by 0 and 1, and
The predictor
(4.12) is a weighted average of the direct estimator
and a
prediction,
often
called the synthetic estimator. More weight is given to the direct estimator
when the smooth design variance,
is small
relative to the variance of the linking model,
The
predictor
is then
similar to the direct estimator. This situation normally occurs when the sample
size in the domain is large. Conversely, if the direct estimator is unstable
and has a large smooth design variance, more weight is given to the synthetic
estimator. If the number of domains is large, the prediction variance of
is
approximately equal to
Since
the
constant
can be
interpreted as being a variance reduction factor resulting from using
instead
of
Therefore, the variance reduction is greater
when
is
small, i.e., when the direct estimator is not precise. On the other hand, if
the linking model is not properly specified, there is greater risk of
significant bias when
is
small. To better understand this point, suppose that the real linking model is
such that
for some
function
Under
this model, it can be shown that the bias of the predictor
is given
by
where
If the linear
model
is
valid, the bias disappears. Otherwise, the bias is not zero and increases as
decreases or as the specification error of the
linking model,
increases. When
is close
to 1, the bias is usually negligible, but so is the variance reduction.
Remark: Note that the predictor
and the
bias (4.13) depend on the variance
If the
linear model (4.10) is not valid, the parameters
and
no
longer exist. Yet, the linking model (4.10) can still be postulated and its
parameters can be estimated from the observed data as if the model were valid.
The model variance
which
enters in the calculation of the predictor
and the
bias (4.13), can be viewed as being the value towards which an estimator of
converges.
The predictor
(4.12) cannot be calculated because it depends on the unknown variances
and
When
and
in
(4.12) are replaced with estimators
and
the BLUP (4.12) becomes the empirical best linear unbiased predictor, denoted as
There
are a number of methods for estimating
(see Rao
and Molina, 2015). One of the most commonly used methods is restricted maximum
likelihood. To estimate
we
assume that a design-unbiased estimator of
is
available, denoted by
This
assumption is formally written:
It
follows that
Therefore, the estimator
is
unbiased for
but can
be very unstable when the domain sample size is small. A more efficient
approach for estimating
involves
modelling
given
the auxiliary variables
In
practice, a linear model is often used for
and it
is assumed that the model errors follow a normal distribution (for example,
Rivest and Belmonte, 2000). Beaumont and Bocci (2016), see also Hidiroglou
et al. (2019), provide a method of moments for estimating
that
does not require the normality assumption.
The Fay-Herriot
model requires the availability of auxiliary data only at the domain level. The
variable
must be
measured without error in the probability survey, but it is not essential for
the auxiliary source to be perfect. This leaves the door open to all kinds of
files external to the probability survey such as big data files. Kim, Wang, Zhu
and Cruze (2018) is a recent example where an extension of the Fay-Herriot
model was used with auxiliary data from satellite images. Small area estimation
methods often achieve significant and sometimes impressive variance reductions
(see, for example, Hidiroglou et al., 2019). The trade-off for obtaining
these gains is the introduction of model assumptions and the risk that these
assumptions do not hold. Therefore, model validation is a critical step in
producing small area estimates, as in any model-based approach.
Small area
estimation methods are generally used to improve the efficiency of estimators
for domains with a small sample size. They could also be used to reduce the
data collection costs and respondent burden by reducing the overall sample size
of a probability survey for a few, if not all, survey variables. The estimates
obtained from the reduced sample and the Fay-Herriot model, for example, could
thus have a precision similar to the direct estimates from the probability
survey obtained from the full sample. In this context, small area estimation
methods would not be used to improve the precision for domains containing few
units, but instead to reduce the overall data collection effort while
preserving the quality of the estimates.