Cost optimal sampling for the integrated observation of different populations
Section 5. Empirical results
The
results herein illustrated are obtained using real data from Districts 7, 8, 9
of the Gaza Province, Mozambique. They summarize the empirical results from an
evaluation study illustrated in FAO (2014). Other empirical results of the
proposed strategies (FAO, 2015) have
been conducted on the database of agricultural households from Burkina Faso’s General Census of
Agriculture and confirm the general results illustrated below.
In the analysis using Mozambique data, the population
refers to farms. The database used for the experimentation include
environmental and economic variables and gathers the information from the 2007
census of large and medium farms and from a sample survey of small farms (for
the same year). The overall number of records is about 36,890, of which 890 are
large and medium farms.
The
second population,
is the 2007 household census. The database’s
records are the individuals involved in agricultural, fishing, or forestry
activities. The database contains approximately 54,000 records and includes
several socio-demographic environmental and economic variables. The databases
of the two populations were merged, creating a Master Sampling Frame (MSF) with
artificial links between individuals and farms. The merging procedure exploited
the following variables: for individuals, the type of job and the district of
residence; for farms, the sector, the district and the number of employed
persons by type of job. Before merging, a cleaning step of
was carried out, discarding records that did
not feature the job type variable (approximately 9,000 records). Subsequently,
approximately 36,000 records of
declare to be a farmholder without any employed persons. For
these cases, a one-to-one farm-individual link was defined. The remaining
individuals were linked with the 890 farms, according to the following
hierarchical rules:
- Each farm was linked to a number of individuals equal to the
number of workers, depending upon job type;
- Individuals and farms in the same district were linked;
- Individuals were linked with private/public governmental
farms when the type of employment and the farm sector agree.
The links were
generated randomly, according to the categories defined by the hierarchical
rules. The exercise did not seek to predict the links that actually exist in
the two populations, but rather to create a realistic dataset for the
evaluation.
Although
the datasets of the two populations include several variables, in this study we
have decided to focus on two of these. For
we consider the number of animals, while for
we consider the number of trees. This is in
order to better highlight the impact (in terms of both accuracy and sample size)
that the different contexts, described in Section 4, have on the sample of
the population
Summary statistics on these variables are shown in Table 5.1.
Table 5.1
The variables used in the simulation with data from Mozambique
Table summary
This table displays the results of The variables used in the simulation with data from Mozambique. The information is grouped by Population* (appearing as row headers), Number of records*, Variable, Mean value and %CV** (appearing as column headers).
| PopulationNote * |
Number of recordsNote * |
Variable |
Mean value |
%CVNote ** |
| Farms |
36,890Note *** |
Number of animals |
11.1 |
681.6 |
| Households |
45,000 |
Number of trees |
4.5 |
107.5 |
For
both populations, we have considered as domains of interest the districts (3 domains)
and the province (1 domain). Therefore, in total we consider 8 target
totals of interest (2 variables
4 domains).
5.1 Optimal designs for the different contexts
In
the following, we address four contexts:
Context 0. No control on the sample of the population
The sample is
planned, controlling only the accuracy of the estimates of the variables of the
farms. Once the sample for
is selected,
the units of
linked to those
selected for the sample of the
population are
included in the sample via the indirect sampling mechanism. The expected
percent CVs, %CV, of the estimates obtained from the indirect sample of
households are then computed as
Context 1. Sampling frames exist for both populations. All links are known and an
integrated sample design is used, finding an optimal solution considering both
populations. Therefore, the multivariate allocation is carried out, controlling
the accuracy of estimates from both the direct sample of farms and the indirect
sample of individuals.
Context 2. Sampling frames exist for both populations, but links are estimated
probabilities and an integrated sample design is used.
Context
3. A
frame exists only for the population
An integrated
sample design is studied considering Options 3.1 and 3.2, which represent
the most feasible solutions in real contexts.
Contexts 1, 2
and 3 are those defined in Section 4. Context 0 is introduced because
it represents a useful tool for the evaluation of the integrated strategy.
A
stratified sampling mechanism is assumed for the first population
where the strata
are defined as districts (7, 8 and 9) by size
class (1, 2, 3-4, 5-9, 10-19, 20-49, 50-99, 100+) based on the number of farm
workers, thus obtaining 21 strata. As regards models (4.1), we considered mean
stratum models with
and
for
These specifications lead to a standard
SSRSWOR design for the farms where the strata coincide with the planned domains
(see Falorsi and Righi, 2015, Remark 4.2). For the evaluation we used the
exact formula of the variance for a SSRSWOR instead of using the approximation
of variance for a SSRSWOR given in Section 2; however the two expressions
are substantially equivalent. For
we also consider a mean model, defined at
district level
with
and
for
The evaluation studies use
software, developed in the
language, that implements the optimal sampling for the
standard SSRSWOR designs as well as for more general sampling designs (e.g.,
balanced designs and incomplete stratification designs). It is available at http://www.istat.it/en/tools/methods-and-it-tools/design-tools/multiwaysampleallocation).
Once installed, the software features a comprehensive user guide in English.
Another software which considers only the SSRSWOR designs is MAUSS-R available
at http://www.istat.it/it/strumenti/metodi-e-software/software/mauss-rdownload.
For each context, the variance
constraints are expressed in terms of %CVs. The analyses presented in this
section are focused on the contexts, and we use a symplified version of the
cost functions. The cost
for
observing the unit
in the
population
with the
linked units in the population
is fixed as equal to 1. More detailed analyses on costs are
presented in Section 5.2.
Some further specifications
for each context are herein illustrated (see Table 5.2).
Context
0. The variance constraints are fixed
(only for the farm estimates: number of animals) at 6.5% at the province level and at 10% at the district level, resulting
in a sample of 2,122 farms.
Context
1. The constraints for the farm and
household estimates have been fixed so as to determine a sample roughly of
2,100 farms. In this way, the variance constraints are fixed for the farm
estimates, animals at 10% at the province
level and at 15% at the district level. Those for the household estimates are
fixed at 2.5% at the province level and 5% at the district level. Note that
this choice of constraints makes it possible to carry out the comparison
between the two contexts with roughly the same sample size, even if in Context 1,
the variance constraints on the estimates of the population
are larger than
those fixed in Context 0.
Context
2. The CV constraints for the
household and farm estimates are equal to those adopted in Context 1. The
integrated observation is planned in the sample design phase by taking into
account the uncertainty in the links. This has been carried out by considering
a simplified model which assumes that, for each worker in a given farm,
there is only one strong link
with an individual in the population of
households and
weak links
with other individuals in the same district,
where
and
are probabilities, where
Let
denote the link between the worker
of the farm
and the individual
of the household
and suppose that these links follow a
Bernoulli model
where
in which
In
the simulation we have considerered different combinations of values of the
probabilities of strong links,
of weak links,
and of the number of individuals,
with a weak link. These combinations are
illustrated in Table 5.3.
Context 3. The CV
constrains for the households and farms estimates are equal to those adopted in
Context 1. In Table 5.3, we derived the allocation considering the
Option 3.2, proposed for Context 3. The results of Option 3.1
are presented at the end of this section.
Finally,
note that for all the three contexts, the optimization problem has been set up
in terms of
With a SSRSWOR design, this may be seen as a problem of
allocation for stratified sampling.
Table 5.2
Variance constraints in the different contexts
Table summary
This table displays the results of Variance constraints in the different contexts. The information is grouped by Contexts (appearing as row headers), Variance Constraints*, variable Animals and variable Trees (appearing as column headers).
| Contexts |
Variance ConstraintsNote * |
| variable Animals |
variable Trees |
| Province |
District |
Province |
District |
| Context 0 |
6.5% |
10% |
No constraints |
No constraints |
| Context 1 |
10% |
15% |
2.5% |
5% |
| Context 2 |
10% |
15% |
2.5% |
5% |
| Context 3 |
10% |
15% |
2.5% |
5% |
Table 5.3
Main results of the evaluation
Table summary
This table displays the results of Main results of the evaluation. The information is grouped by Contexts (appearing as row headers), Sample size, Realized Coefficent of variations (%), variable Animals, variable Trees, Province and District (appearing as column headers).
| Contexts |
Sample size |
Realized Coefficent of variations (%) |
| variable Animals |
variable Trees |
| Province |
District |
Province |
District |
| 7 |
8 |
9 |
7 |
8 |
9 |
| Context 0 |
2,122 |
6.5 |
10.0 |
10.0 |
10.0 |
1.5 |
6.8 |
12.7 |
1.4 |
| Context 1 |
2,106 |
8.8 |
7.5 |
4.1 |
15.0 |
1.8 |
5.0 |
5.0 |
2.0 |
| Context 2 |
0.90, 0.10,
1 |
2,146 |
8.8 |
7.2 |
4.1 |
15.0 |
2.2 |
5.0 |
5.0 |
2.4 |
| 0.50,
0.10,
5 |
2,573 |
7.5 |
6.5 |
4.0 |
12.7 |
2.5 |
5.0 |
5.0 |
2.8 |
| 0.30,
0.08,
9 |
2,767 |
7.0 |
6.4 |
4.0 |
11.9 |
2.5 |
5.0 |
5.0 |
2.8 |
| 0.10,
0.09,
9 |
2,826 |
6.9 |
6.2 |
4.0 |
11.6 |
2.5 |
5.0 |
5.0 |
2.8 |
| Context 3 |
Option 3.2 |
2,936 |
6.6 |
6.2 |
3.9 |
11.2 |
2.5 |
5.0 |
5.0 |
2.8 |
Looking
at the main results of the evaluation, highlighted in Table 5.3, the
following evidences emerge:
Context 0 vs Context 1. In the two contexts, the farm
sample size is of about 2,100 farms.
- For Context 0, the expected %CVs of the farm estimates
at the district level are exactly at the constraint level of 10%, defined for
this context.
- In Context 1, we see that all the %CVs of the farm
estimates at district level respect the constraints of 15% (defined for this
context), being however considerably lower than 10% for the districts 7 and 8,
showing that these districts are somewhat oversampled with respect to the
target precisions. This is because in the second allocation, part of the farm
sample is required to achieve the required indirect sample of households (FAO,
2014, studies this inefficiency issue in great detail).
- Considering now the precision of the estimates for the
population
we found that the expected sample sizes of
households were approximately 5,300 records in both contexts. In Context 0,
the %CVs are much higher than the desired level of 5%, being even larger than
12% in the District 8. With the sampling allocation resulting from Context 1,
the desired precision of the estimates of population
are always respected, as well as those of
population
even if the
constraints for these estimates have been defined larger than those adopted in
Context 0.
- Thus, the integrated approach to the sampling allocation
carried out in Context 1 enables control of the precision of the estimates
for both populations of interest, however paying some loss in precision for the
estimates for population
Context 1 vs Context 2. For the comparison between
the Contexts 1 and 2, the analysis focuses upon the overall sample sizes,
since the %CVs are under the constraint levels in both contexts.
- In the presence of strong links for Context 2
there is only a small increase in the sample
sizes (40 farms), while the CVs remain under the desired level of
precision, altough being slightly increased for the household estimates.
- As the links become weaker, the sample sizes increase
significantly. This is due to the achievement of the expected %CVs for the
household estimates.
- Conversely in Context 2, the expected CVs for the farm
estimates are lower than the targeted levels, suggesting that the the farms are
somewhat oversampled with respect to the target levels of precision.
Context 3 vs other
contexts. Having considered the Option 3.2 in Table 5.3, Context 3 may be
considered as an extremal case of Context 2. Even in this case, the
analysis focuses on the overall sample sizes, since all the %CVs are under the
constraint levels:
- The maximization of the links uncertainty, represented by
Option 3.2, causes an increase in the sample size of about 30%: from the sample
size of 2,106 to that of 2,936.
- Examining Context 2, we note that we obtain results
similar to those of Context 3 when the level,
of the strong link is around 10%.
- Even in this case, the farms are somewhat oversampled with
respect to the target levels of precision.
More
detailed analysis of Context 3. Below,
some more detailed analyzes are illustrated, aimed at better clarifying some
aspects of the problem of sampling allocation for the integrated observation of
two related populations. We explore Option 3.1 and the proportional
allocation proposed in Remark 4.6 because of their practical importance.
For the proportional allocation, we considered as measure of size (see Remark 4.6)
the total number of employed people. The
are obtained by
expression (4.10). In this context, we have to define the
value. In order
to identify a single
value, we
exploited the data of Context 1 and first computed for each stratum the
coefficent of variation of
Then, specific
values were
computed at stratum level, as
and finally the
value
considered in this evaluation was obtained as a weighted mean of the
values:
We computed the
weights
with two
different alternatives, resulting in the two values:
and
With the first
alternative, the
were defined
proportional to the sum of the weights
at stratum
level; while in the second alternative, the
were defined
proportional to the quantity
where
and
are the mean
value of variable
and the number
of units in the stratum, respectively. For each alternative, we ran the problem
(4.12), with the constraints defined in Table 5.2 for Context 1,
obtaining an overall sample size,
equal
respectively to 1,639 and 1,517. The main results of the experiment are
illustrated in Table 5.4, in which for both
values we show:
(i) the expected %CVs, obtained as solution of problem (4.12) under the
hypotesis that relation (4.11) holds; (ii) the true expected %CVs, that is,
those obtained under Context 1 on the basis of the stratum sample sizes
defined by the solution of the problem (4.12); and (iii) the true %CVs
obtained, under Context 1, with the proportional allocation proposed in
Remark 4.6.
Table 5.4
Expected and realized %CVs of the domain estimates of total number of trees with the sampling allocation obtained as solution of problem (4.12) and proportional allocation
Table summary
This table displays the results of Expected and realized %CVs of the domain estimates of total number of trees with the sampling allocation obtained as solution of problem (4.12) and proportional allocation. The information is grouped by Estimation Domains (appearing as row headers),
2.75,
1,639 and
2.16,
1,517 (appearing as column headers).
| Estimation Domains |
2.75,
1,639 |
2.16,
1,517 |
| Expected %CV, obtained as solution of problem (4.12), assuming that (4.11) holds |
True expected %CV, under Context 1, with allocation defined by (4.12) |
True expected %CV under Context 1, with proportional allocation |
Expected %CV, obtained as solution of problem (4.12), assuming that (4.11) holds |
True expected %CV, under Context 1, with allocation defined by (4.12) |
True expected %CV under Context 1, with proportional allocation |
| Province |
2.11 |
1.94 |
1.76 |
2.11 |
2.04 |
1.83 |
| District 7 |
4.95 |
6.80 |
6.10 |
4.95 |
8.20 |
6.34 |
| District 8 |
4.99 |
6.45 |
13.23 |
4.99 |
6.45 |
13.79 |
| District 9 |
2.36 |
2.0 |
1.81 |
2.36 |
2.0 |
1.88 |
The main findings
of this evaluation are the following:
- The strategy proposed by Option 3.1 seems to be
effective, since it allows control of the sampling errors, avoiding the
situation where these exceed by a large amount the desired accuracy for the
different estimation domains.
- With
the use of a unique
the true expected %CVs (columns 3 and 7 of the
Table 5.4) for some estimation domains are larger than the defined
benchmarks and, in some others, the estimates are much more accurate than
required.
- The choice of a larger value of the
parameter seems to be
a safe choice, if the main objective of the sampling allocation is to avoid sampling
errors in specific estimation domains that are too large.
- Even if it seems effective for the
accuracy of the overall estimate at province level, the proportional allocation
(columns 4 and 8 of the Table 5.4) does not allow control of extremal
discrepancies from the expected accuracy in some estimation domains (see
district 8).
5.2 Evaluation on
costs
This
evaluation considers Context 1 in which the sampling frames for both
populations are available, and in which it is possible to build an integrated
observation of the two populations. We focus on two observational strategies:
the first considers two independent samples, one for farms and one for
individuals. Therefore, a truly integrated analysis cannot be performed. The
second observational strategy applies an integrated sampling design that
selects a direct sample of farms and an indirect sample of the households of
the workers of the sampled farms.
We
adopted the variance constraints established for the Context 1 (see Table 5.5).
Table 5.5
Variance Constraints in the evaluation on costs
Table summary
This table displays the results of Variance Constraints in the evaluation on costs Variance Constraints * (appearing as column headers).
| Variance Constraints Note * |
|
variable Animals |
variable Trees |
| Province |
District |
Province |
District |
| 10% |
15% |
2.5% |
5% |
For
the direct sampling designs, we adopted a SSRSWOR design, where the population
was stratified by crossclassfying the
districts and the size classes of the farms, and the population
was stratified by district. The cost for
interviewing the farms varies
which leads to performing four different evaluations.
The cost
for interviewing an individual is set equal to 1.
For
indirect sampling designs, we define the overall cost of interviewing the farm
and the farms workers together by two different specifications of equation (3.2):
The increase of the
cost function (5.3) is lower than the increase of the cost function (5.2) when
increases.
We
perform a precision-constrained optimal allocation for both independent
sampling designs. The different
values (1, 2, 5 and 10) do not
affect the farm sample size while the costs increase proportionally. Given the
variance constraints in Table 5.5 with the independent strategy, the
sample sizes of farms and individual are respectively 1,010 and 3,388. The
total cost is then 4,398 when setting
In the integrated sample strategy, the costs do affect the allocation,
essentially because if the farm interview costs increases, the number of
sampled farms decreases and the allocation increases sample sizes of strata
with the largest farms.
Table 5.6
below shows the sample sizes of farms and the expected sample sizes of
individuals when cost model (5.2) is used to calculate the costs of individual
interviews in the integrated allocation. We see that the farm sample is more
than double the sample size, considering farms alone (1,101). The increase in
size is due to precision constraints on the household estimates.
Table 5.6
Sample sizes for the integrated sample allocation, when the overall individual costs are given by (5.2)
Table summary
This table displays the results of Sample sizes for the integrated sample allocation. The information is grouped by Cost per farm interview (appearing as row headers), 1, 2, 5 and 10 (appearing as column headers).
| Cost per farm interview |
1 |
2 |
5 |
10 |
| Farms |
2,388 |
2,289 |
2,190 |
2,137 |
| Individuals |
4,504 |
4,491 |
4,862 |
4,905 |
Table 5.7
below shows the allocation when equation (5.3) is used for the cost of
individual interviews in the integrated allocation.
Table 5.7
Sample sizes for the integrated sample allocation, when the overall individual costs are given by (5.3)
Table summary
This table displays the results of Sample sizes for the integrated sample allocation. The information is grouped by Cost per farm interview (appearing as row headers), 1, 2, 5 and 10 (appearing as column headers).
| Cost per farm interview |
1 |
2 |
5 |
10 |
| Farms |
2,135 |
2,121 |
2,111 |
2,108 |
| Individuals |
4,834 |
4,874 |
5,283 |
5,360 |
Tables (5.6)
and (5.7) show that the integrated sample size of farms is roughly twice that
of the independent allocation of farms. Thus the expected variance of the
estimates will be much lower than the desidered variance constraints,
suggesting that integrated sample allocation mainly depends on the variance
constraints related to the individual parameters to be estimated.
Figures 5.1
and 5.2 show the cost for independent and integrated sampling. The integrated
observational strategy is generally more expensive, except when the cost per
farm interview is equal to 1 and the cost function given by (5.3). In this
evaluation, the integrated nature of the sample is not needed as no cross
tabulation of population
variables with population
variables are examined; then, the independent
allocation will be more efficient in term of precision. Another cost function
could however partially rebalance the two observational strategies in term of
costs.

Description for Figure 5.1
This diagram shows the respective costs of the independent sample and the integrated sample according to formula (5.2) with the cost on the y-axis ranging from 0 to 30,000 and the
values on the x-axis from 0 to 10. The costs are linear and increasing with the integrated sample higher than the independent sample.

Description for Figure 5.2
This diagram shows the respective costs of the independent sample and the integrated sample according to formula (5.3) with the cost on the y-axis ranging from 0 to 25,000 and the values on the x-axis from 0 to 10. The costs are linear and increasing with the integrated sample higher than the independent sample.