Cost optimal sampling for the integrated observation of different populations
Section 4. Informative contexts and optimization problem
Optimization
problems as presented in (3.1) are quite theoretical since one needs to know
the values of the variables of interest in both populations
and
and the
values of actual links among the units of the two populations. We now present
three more concrete contexts involving various amount of information. We start
from two contexts in which the information is very rich, whereas the third
context considers a case in which the information is very poor. The latter
context is the most common, although the growing
availability of administrative registers and statistical software tools for
data integration increases the plausibility of the first two contexts.
Context 1. The sampling frames for
and
are available. All the values
and
are known and the values of
are unknown but can be predicted by suitable
superpopulation models.
This context may
be realistic in countries, such as the Nordic ones, having well established
register-based systems (Wallgren and Wallgren, 2014) in which the units of a
given statistical register have unique identifiers of good quality, which
allows identification of the same unit in the whole systems of registers. For
the agricultural example, this means that one can link each farm to one or more
rural households, and each rural household to one or more farms.
The working models that we study can be
expressed under the following forms:
where, omitting the subscripts for sake of brevity,
are vectors of predictors (available in the
two sampling frames),
are the vectors of regression coefficients and
are known functions,
are the error terms,
are the predicted values and
denote the expectations under the models. The
predictors
in the unit and cluster level models can be
different. We assume that the parameters of the models are known, although in
practice they are usually estimated.
Even if the model
is not
known, the model expectations at cluster level for the population
can
be derived from a model defined at elementary unit level, indicated with
The
elementary unit level model can be stated as
where
is the
intra-cluster correlation.
The model
expectations at cluster level on the right-hand side of (4.1) can be easily
derived as:
for
Note that the working models (4.1) are variable
specific. They are introduced as useful tools for developing the sampling
design, but they are not necessarily representing exactly the real models
generating the data.
According to
(4.1), the model predictions and the variances of the
variables are given by
and
Thus, in the optimization problem (3.1), the
variance terms,
and
are replaced by the Anticipated Variances.
Denoting with
the expectation under the sampling design, the
anticipated variance (AV) of
may be reformulated as follows:
We have
and
The same result may be derived for the estimate
Thus, we obtain the following expressions:
where
and
are given by expressions (A.2) and (B.2) of
Appendices A and B.
The problem (3.1)
for searching the optimal
vector is
then reformulated as follows:
Remark 4.1.
The anticipated variances in (4.5) have cumbersome formulae. A conservative
simplified expression of
is given in Remark 4.1 of Falorsi and
Righi (2015). More simplified conservative approximations of both
and
are obtained by approximating the sampling
design variance with the Poisson sampling variance. We then have
replacing
and
by
and
respectively, where
and
(see Appendix B). Conservative
approximations are a safe choice in this setting, since they eliminate the risk
of defining an insufficient sample size for the expected accuracies.
Remark 4.2.
Lavallée and Labelle-Blanchet (2013) deal with the problem of indirect sampling
applied to skewed populations by suggesting eight alternative methods for
modifying the links,
to reduce the variance of the estimates in the
presence of skewed populations, while keeping estimation unbiased. Using the
methods 2 and 3 proposed by these authors, the algorithm can run by simply
replacing the links
by weighted links,
in
Context 2. The links
are not known with certainty but the
probabilities of links existing,
are available.
To include the
linkage uncertainty in the optimization, we assume the links follow a Bernoulli
model
where
and
We
assume the parameters
to be
known, although in practice they are usually estimated with probabilistic
record linkage procedures (Lavallée and Caron, 2001). For the agricultural
example, such a situation would occur when, for instance, the population of
farms is linked to the population of rural households using probabilistic
record linkage because no common identifier exists. In this framework, the
anticipated variance must take into account both models
and
Since
and
the problem (4.5) can be reformulated as
follows:
where
with
and
The main results
for the derivation of the expression of
are
given in Appendix C. These are derived using Taylor series approximation
and postulating the independence of the process which generates the links
with the
one that creates the variables of interest
Under
these approximations, the predicted values
are
obtained as
where
with
The uncertainty on
total survey costs, which depends both on the selected sample and the model
uncertainty on costs, obliges us to consider the expected costs
in the
optimization problem. Steel and Clark (2014) show how the uncertainty on the
expected costs can affect the accuracy of the sample design.
Context 3. Data integration is not possible because the record linkage process does not
provide good linkages, or simply because the frame of population
does not exist.
This
is the most common context in developing countries. It may also characterize
specific survey contexts in developed countries, for instance in the case of
hard-to-reach populations. Returning to the agricultural example, this would mean
that one might have a list of farms, but not a list of rural households. In
this case, the problem of optimal integrated sampling can be solved by using
all the available information, even if of poor quality. In the following, three
options for dealing with the optimization problem are illustrated starting from
the option which requires the minimum of information to those which need more
information that could be expensive to obtain.
Option 3.1. Building the predictions of the
variables and decreasing the variance
thresholds
by a
scale factor. Suppose that from the frame of
population
it is possible
to know the values of a size variable
related
to the total links
of the units
For instance,
if the population
is a population
of farms and the population
is a population
of households, then the number of workers in the farms
can represent a
good approximation of the total number of links,
of the farm.
Suppose further that the totals or the
estimated totals,
are available at
certain domain level,
defined at
geographic level, with
and
for
Then the
predicted
variables can
be defined as:
pour
where
denotes the geographic
domain
for the
population
In practice,
the ratio approach in (4.10) assumes that unit
can be given a
share of the total
proportional to
the size of the unit itself. Other examples of building the predictions of the
values are
illustrated in Section 5.3.2 of Guidelines
on Integrated Survey Framework (FAO, 2015).
Having
determined the predictions,
it
may be reasonable to assume that the following relationship holds:
where
Under (4.11), it is straightforward to show
that
where
The sampling variance
may be computed using expressions (2.2),
(2.3), (2.4) and (2.5) by substituting the variable
the prediction
The optimization problem for searching for the
optimal
vector can then be reformulated as:
The sample
designer may find the solution by running the optimization problem (4.12) with
alternative reasonable choices of the
value
and
studying the sensitivity of the different solutions. Note that
where
Therefore (4.11) holds if the
values
are approximately constant.
Option 3.2. Extremal case of Context 2, with uniformity
of links in specific domains. If the
number or estimated number of clusters and of
elementary units
and
of the domains
are available,
then in the absence of information on the links
it might be
reasonable to assume that these are homogeneous over the domains; that is,
where
Furthermore,
suppose that, in this context, the predictions
and the sampling variances
could be assumed to be homogeneous within the domains
i.e.,
and
for
Then, the optimization problem may be dealt
with as an extremal case of Context 2, with uniformity of links in
specific domains.
Remark 4.3.
Note that with this option, the predictions
are equivalent to those expressed in (4.10).
Indeed, it is reasonable to consider that, in the absence of information, the
size in terms of elementary unit of the cluster
can be set as equal to its mean defined at the
domain level:
for
Then, the following approximations hold
Therefore, setting
and postulating
the independence of the process which generates the links
with the one that creates the variables of
interest
we can obtain
Option 3.3. Modeling the
values. Another alternative may be carried out by trying to model directly the
-values and the
total number of links
with models of
the type:
where
and
are vectors of
auxiliary variables. The predictions
need to be
positive. A useful model is the log-linear one (Xu and Lavallée, 2009):
The model on
the right hand side of (4.13) allows the
prediction of the total number of links
of the unit
thus defining the expected survey cost
attached to it. The optimization problem could
be carried out using the variances of the predictions of the models (4.13).
Remark 4.4. Option 3.1 requires the minimum of
information for the construction of the predictions
and needs us to
define of plausible values for the constants
Option 3.2
involves the same information as Option 3.1 for the construction of the
predictions
(see Remark 4.3)
but requires an estimate of the parameters
These estimates
can be obtained from either pilot or previous surveys conducted directly on the
population
Option 3.3
is the most complex and expensive, since it involves carrying out indirect
pilot surveys on the population
for building
plausible predictions of the parameters
and
Remark 4.5.
A good strategy that should be robust against model failure is to select a
balanced sample with respect to the auxiliary variables
In this case, the
auxiliary variables
of the balancing equations are replaced by the
augmented variables
For the calculation of the variances, the
residuals
are substituted by the modified residuals
where
with
For the modified residuals
similar expressions are used.
Remark 4.6.
A proportional-to-population-size allocation may be a reasonable strategy for
stratified sampling designs in which the total sample size
is fixed. In this case the stratum sample
size,
may be defined as
where
is the measure of the size.