Cost optimal sampling for the integrated observation of different populations
Section 5. Empirical results

Table of contents

The results herein illustrated are obtained using real data from Districts 7, 8, 9 of the Gaza Province, Mozambique. They summarize the empirical results from an evaluation study illustrated in FAO (2014). Other empirical results of the proposed strategies (FAO, 2015) have been conducted on the database of agricultural households from Burkina Faso’s General Census of Agriculture and confirm the general results illustrated below.

In the analysis using Mozambique data, the population $U^{A}$ refers to farms. The database used for the experimentation include environmental and economic variables and gathers the information from the 2007 census of large and medium farms and from a sample survey of small farms (for the same year). The overall number of records is about 36,890, of which 890 are large and medium farms.

The second population, $U^{B},$ is the 2007 household census. The database’s records are the individuals involved in agricultural, fishing, or forestry activities. The database contains approximately 54,000 records and includes several socio-demographic environmental and economic variables. The databases of the two populations were merged, creating a Master Sampling Frame (MSF) with artificial links between individuals and farms. The merging procedure exploited the following variables: for individuals, the type of job and the district of residence; for farms, the sector, the district and the number of employed persons by type of job. Before merging, a cleaning step of $U^{B}$ was carried out, discarding records that did not feature the job type variable (approximately 9,000 records). Subsequently, approximately 36,000 records of $U^{B}$ declare to be a farmholder without any employed persons. For these cases, a one-to-one farm-individual link was defined. The remaining individuals were linked with the 890 farms, according to the following hierarchical rules:

Each farm was linked to a number of individuals equal to the number of workers, depending upon job type;
Individuals and farms in the same district were linked;
Individuals were linked with private/public governmental farms when the type of employment and the farm sector agree.

The links were generated randomly, according to the categories defined by the hierarchical rules. The exercise did not seek to predict the links that actually exist in the two populations, but rather to create a realistic dataset for the evaluation.

Although the datasets of the two populations include several variables, in this study we have decided to focus on two of these. For $U^{A},$ we consider the number of animals, while for $U^{B},$ we consider the number of trees. This is in order to better highlight the impact (in terms of both accuracy and sample size) that the different contexts, described in Section 4, have on the sample of the population $U^{B} .$ Summary statistics on these variables are shown in Table 5.1.

Table 5.1
The variables used in the simulation with data from Mozambique
Table summary
This table displays the results of The variables used in the simulation with data from Mozambique. The information is grouped by Population* (appearing as row headers), Number of records*, Variable, Mean value and %CV** (appearing as column headers).
Population^Note *	Number of records^Note *	Variable	Mean value	%CV^Note **
$U^{A} :$ Farms	36,890^Note ***	Number of animals	11.1	681.6
$U^{B} :$ Households	45,000	Number of trees	4.5	107.5
Note * Districts 7, 8, 9 of the Gaza Province, Mozambique. Return to note * referrer Note %CV = (population standard deviation/mean) $\times$ 100. Return to note referrer Note * From these, 890 are large and medium farms. Return to note * referrer

For both populations, we have considered as domains of interest the districts (3 domains) and the province (1 domain). Therefore, in total we consider 8 target totals of interest (2 variables $\times$ 4 domains).

5.1 Optimal designs for the different contexts

In the following, we address four contexts:

Context 0. No control on the sample of the population $U^{B} .$ The sample is planned, controlling only the accuracy of the estimates of the variables of the farms. Once the sample for $U^{A}$ is selected, the units of $U^{B},$ linked to those selected for the sample of the $U^{A}$ population are included in the sample via the indirect sampling mechanism. The expected percent CVs, %CV, of the estimates obtained from the indirect sample of households are then computed as $% CV = (\sqrt{AV (\hat{Y})} / Y) \times 100.$

Context 1. Sampling frames exist for both populations. All links are known and an integrated sample design is used, finding an optimal solution considering both populations. Therefore, the multivariate allocation is carried out, controlling the accuracy of estimates from both the direct sample of farms and the indirect sample of individuals.

Context 2. Sampling frames exist for both populations, but links are estimated probabilities and an integrated sample design is used.

Context 3. A frame exists only for the population $U^{A} .$ An integrated sample design is studied considering Options 3.1 and 3.2, which represent the most feasible solutions in real contexts.

Contexts 1, 2 and 3 are those defined in Section 4. Context 0 is introduced because it represents a useful tool for the evaluation of the integrated strategy.

A stratified sampling mechanism is assumed for the first population $U^{A},$ where the strata $U_{h}^{A}$ are defined as districts (7, 8 and 9) by size class (1, 2, 3-4, 5-9, 10-19, 20-49, 50-99, 100+) based on the number of farm workers, thus obtaining 21 strata. As regards models (4.1), we considered mean stratum models with ${\tilde{y}}_{j, v} = {\tilde{y}}_{v, (h)}$ and $σ_{j, v}^{2} = σ_{v, (h)}^{2}$ for $j \in U_{h}^{A} .$ These specifications lead to a standard SSRSWOR design for the farms where the strata coincide with the planned domains (see Falorsi and Righi, 2015, Remark 4.2). For the evaluation we used the exact formula of the variance for a SSRSWOR instead of using the approximation of variance for a SSRSWOR given in Section 2; however the two expressions are substantially equivalent. For $U^{B},$ we also consider a mean model, defined at district level $d,$ with ${\tilde{y}}_{i, r} = {\tilde{y}}_{r, (d)},$ and $σ_{i, r}^{2} = σ_{r, (d)}^{2}$ for $i \in U_{d}^{B} .$

The evaluation studies use software, developed in the $R$ language, that implements the optimal sampling for the standard SSRSWOR designs as well as for more general sampling designs (e.g., balanced designs and incomplete stratification designs). It is available at http://www.istat.it/en/tools/methods-and-it-tools/design-tools/multiwaysampleallocation). Once installed, the software features a comprehensive user guide in English. Another software which considers only the SSRSWOR designs is MAUSS-R available at http://www.istat.it/it/strumenti/metodi-e-software/software/mauss-rdownload.

For each context, the variance constraints are expressed in terms of %CVs. The analyses presented in this section are focused on the contexts, and we use a symplified version of the cost functions. The cost $c_{j}$ for observing the unit $j$ in the population $U^{A}$ with the linked units in the population $U^{B}$ is fixed as equal to 1. More detailed analyses on costs are presented in Section 5.2.

Some further specifications for each context are herein illustrated (see Table 5.2).

Context 0. The variance constraints are fixed (only for the farm estimates: number of animals) at 6.5% at the province level and at 10% at the district level, resulting in a sample of 2,122 farms.

Context 1. The constraints for the farm and household estimates have been fixed so as to determine a sample roughly of 2,100 farms. In this way, the variance constraints are fixed for the farm estimates, animals at 10% at the province level and at 15% at the district level. Those for the household estimates are fixed at 2.5% at the province level and 5% at the district level. Note that this choice of constraints makes it possible to carry out the comparison between the two contexts with roughly the same sample size, even if in Context 1, the variance constraints on the estimates of the population $U^{A}$ are larger than those fixed in Context 0.

Context 2. The CV constraints for the household and farm estimates are equal to those adopted in Context 1. The integrated observation is planned in the sample design phase by taking into account the uncertainty in the links. This has been carried out by considering a simplified model which assumes that, for each worker in a given farm, there is only one strong link $(with value ψ)$ with an individual in the population of households and $α$ weak links $(with value τ)$ with other individuals in the same district, where $ψ$ and $τ$ are probabilities, where $ψ ≫ τ .$ Let $l_{j ω, i k}$ denote the link between the worker $ω$ of the farm $j$ and the individual $k$ of the household $i$ and suppose that these links follow a Bernoulli model $M_{l},$ where

$E_{M_{l}} (l_{j ω, i k}) = λ_{j ω, i k} = {\begin{matrix} ψ for only one worker j ω \in U^{A} and one individual i k \in U^{B} \\ τ for only one worker j ω \in U^{A} and α individuals i k \in U^{B} \end{matrix}, (5.1)$

in which $τ = \frac{1 - ψ}{α} .$

In the simulation we have considerered different combinations of values of the probabilities of strong links, $ψ,$ of weak links, $τ,$ and of the number of individuals, $α,$ with a weak link. These combinations are illustrated in Table 5.3.

Context 3. The CV constrains for the households and farms estimates are equal to those adopted in Context 1. In Table 5.3, we derived the allocation considering the Option 3.2, proposed for Context 3. The results of Option 3.1 are presented at the end of this section.

Finally, note that for all the three contexts, the optimization problem has been set up in terms of $π_{j}^{A} .$ With a SSRSWOR design, this may be seen as a problem of allocation for stratified sampling.

Table 5.2
Variance constraints in the different contexts
Table summary
This table displays the results of Variance constraints in the different contexts. The information is grouped by Contexts (appearing as row headers), Variance Constraints*, $U^{A} :$ variable Animals and $U^{B} :$ variable Trees (appearing as column headers).
Contexts	Variance Constraints^Note *
	$U^{A} :$ variable Animals		$U^{B} :$ variable Trees
	Province	District	Province	District
Context 0	6.5%	10%	No constraints	No constraints
Context 1	10%	15%	2.5%	5%
Context 2	10%	15%	2.5%	5%
Context 3	10%	15%	2.5%	5%
Note * Expressed in terms of %CV. Return to note * referrer

Table 5.3
Main results of the evaluation
Table summary
This table displays the results of Main results of the evaluation. The information is grouped by Contexts (appearing as row headers), Sample size, Realized Coefficent of variations (%), $U^{A} :$ variable Animals, $U^{B} :$ variable Trees, Province and District (appearing as column headers).
Contexts		Sample size	Realized Coefficent of variations (%)
			$U^{A} :$ variable Animals				$U^{B} :$ variable Trees
			Province	District			Province	District
			Province	7	8	9	Province	7	8	9
Context 0		2,122	6.5	10.0	10.0	10.0	1.5	6.8	12.7	1.4
Context 1		2,106	8.8	7.5	4.1	15.0	1.8	5.0	5.0	2.0
Context 2	$ψ =$ 0.90, $τ =$ 0.10, $α =$ 1	2,146	8.8	7.2	4.1	15.0	2.2	5.0	5.0	2.4
	$ψ =$ 0.50, $τ =$ 0.10, $α =$ 5	2,573	7.5	6.5	4.0	12.7	2.5	5.0	5.0	2.8
	$ψ =$ 0.30, $τ =$ 0.08, $α =$ 9	2,767	7.0	6.4	4.0	11.9	2.5	5.0	5.0	2.8
	$ψ =$ 0.10, $τ =$ 0.09, $α =$ 9	2,826	6.9	6.2	4.0	11.6	2.5	5.0	5.0	2.8
Context 3	Option 3.2	2,936	6.6	6.2	3.9	11.2	2.5	5.0	5.0	2.8

Looking at the main results of the evaluation, highlighted in Table 5.3, the following evidences emerge:

Context 0 vs Context 1. In the two contexts, the farm sample size is of about 2,100 farms.

For Context 0, the expected %CVs of the farm estimates at the district level are exactly at the constraint level of 10%, defined for this context.
In Context 1, we see that all the %CVs of the farm estimates at district level respect the constraints of 15% (defined for this context), being however considerably lower than 10% for the districts 7 and 8, showing that these districts are somewhat oversampled with respect to the target precisions. This is because in the second allocation, part of the farm sample is required to achieve the required indirect sample of households (FAO, 2014, studies this inefficiency issue in great detail).
Considering now the precision of the estimates for the population $U^{B},$ we found that the expected sample sizes of households were approximately 5,300 records in both contexts. In Context 0, the %CVs are much higher than the desired level of 5%, being even larger than 12% in the District 8. With the sampling allocation resulting from Context 1, the desired precision of the estimates of population $U^{B}$ are always respected, as well as those of population $U^{A},$ even if the constraints for these estimates have been defined larger than those adopted in Context 0.
Thus, the integrated approach to the sampling allocation carried out in Context 1 enables control of the precision of the estimates for both populations of interest, however paying some loss in precision for the estimates for population $U^{A} .$

Context 1 vs Context 2. For the comparison between the Contexts 1 and 2, the analysis focuses upon the overall sample sizes, since the %CVs are under the constraint levels in both contexts.

In the presence of strong links for Context 2 $(ψ = 0.90, τ = 0.10, α = 1),$ there is only a small increase in the sample sizes (40 farms), while the CVs remain under the desired level of precision, altough being slightly increased for the household estimates.
As the links become weaker, the sample sizes increase significantly. This is due to the achievement of the expected %CVs for the household estimates.
Conversely in Context 2, the expected CVs for the farm estimates are lower than the targeted levels, suggesting that the the farms are somewhat oversampled with respect to the target levels of precision.

Context 3 vs other contexts. Having considered the Option 3.2 in Table 5.3, Context 3 may be considered as an extremal case of Context 2. Even in this case, the analysis focuses on the overall sample sizes, since all the %CVs are under the constraint levels:

The maximization of the links uncertainty, represented by Option 3.2, causes an increase in the sample size of about 30%: from the sample size of 2,106 to that of 2,936.
Examining Context 2, we note that we obtain results similar to those of Context 3 when the level, $ψ,$ of the strong link is around 10%.
Even in this case, the farms are somewhat oversampled with respect to the target levels of precision.

More detailed analysis of Context 3. Below, some more detailed analyzes are illustrated, aimed at better clarifying some aspects of the problem of sampling allocation for the integrated observation of two related populations. We explore Option 3.1 and the proportional allocation proposed in Remark 4.6 because of their practical importance. For the proportional allocation, we considered as measure of size (see Remark 4.6) the total number of employed people. The ${\tilde{z}}_{j, r}$ are obtained by expression (4.10). In this context, we have to define the $k_{r}$ value. In order to identify a single $k_{r}$ value, we exploited the data of Context 1 and first computed for each stratum the coefficent of variation of $z_{j, r},$ $CV (z_{h, r}) .$ Then, specific $k_{r}$ values were computed at stratum level, as $k_{h r} = 1 + {[CV (z_{h, r})]}^{2}$ and finally the $k_{r}$ value considered in this evaluation was obtained as a weighted mean of the $k_{h r}$ values: $k_{r} = \sum_{h} k_{h, r} w_{h} .$ We computed the weights $w_{h}$ with two different alternatives, resulting in the two values: $k_{r} = 2.75$ and $k_{r} = 2.16.$ With the first alternative, the $w_{h}$ were defined proportional to the sum of the weights $L_{j}^{A}$ at stratum level; while in the second alternative, the $w_{h}$ were defined proportional to the quantity $\sqrt{CV (z_{h, r})} {\bar{Y}}_{r, h}^{B}$ $N_{h}^{A},$ where ${\bar{Y}}_{r, h}^{B}$ and $N_{h}^{A}$ are the mean value of variable $y_{r}$ and the number of units in the stratum, respectively. For each alternative, we ran the problem (4.12), with the constraints defined in Table 5.2 for Context 1, obtaining an overall sample size, $n^{A},$ equal respectively to 1,639 and 1,517. The main results of the experiment are illustrated in Table 5.4, in which for both $k_{r}$ values we show: (i) the expected %CVs, obtained as solution of problem (4.12) under the hypotesis that relation (4.11) holds; (ii) the true expected %CVs, that is, those obtained under Context 1 on the basis of the stratum sample sizes defined by the solution of the problem (4.12); and (iii) the true %CVs obtained, under Context 1, with the proportional allocation proposed in Remark 4.6.

Table 5.4
Expected and realized %CVs of the domain estimates of total number of trees with the sampling allocation obtained as solution of problem (4.12) and proportional allocation
Table summary
This table displays the results of Expected and realized %CVs of the domain estimates of total number of trees with the sampling allocation obtained as solution of problem (4.12) and proportional allocation. The information is grouped by Estimation Domains (appearing as row headers), $k_{r} =$ 2.75, $n^{A} =$ 1,639 and $k_{r} =$ 2.16, $n^{A} =$ 1,517 (appearing as column headers).
Estimation Domains	$k_{r} =$ 2.75, $n^{A} =$ 1,639			$k_{r} =$ 2.16, $n^{A} =$ 1,517
Estimation Domains	Expected %CV, obtained as solution of problem (4.12), assuming that (4.11) holds	True expected %CV, under Context 1, with allocation defined by (4.12)	True expected %CV under Context 1, with proportional allocation	Expected %CV, obtained as solution of problem (4.12), assuming that (4.11) holds	True expected %CV, under Context 1, with allocation defined by (4.12)	True expected %CV under Context 1, with proportional allocation
Province	2.11	1.94	1.76	2.11	2.04	1.83
District 7	4.95	6.80	6.10	4.95	8.20	6.34
District 8	4.99	6.45	13.23	4.99	6.45	13.79
District 9	2.36	2.0	1.81	2.36	2.0	1.88

The main findings of this evaluation are the following:

The strategy proposed by Option 3.1 seems to be effective, since it allows control of the sampling errors, avoiding the situation where these exceed by a large amount the desired accuracy for the different estimation domains.
With the use of a unique $k_{r},$ the true expected %CVs (columns 3 and 7 of the Table 5.4) for some estimation domains are larger than the defined benchmarks and, in some others, the estimates are much more accurate than required.
The choice of a larger value of the $k_{r}$ parameter seems to be a safe choice, if the main objective of the sampling allocation is to avoid sampling errors in specific estimation domains that are too large.
Even if it seems effective for the accuracy of the overall estimate at province level, the proportional allocation (columns 4 and 8 of the Table 5.4) does not allow control of extremal discrepancies from the expected accuracy in some estimation domains (see district 8).

5.2 Evaluation on costs

This evaluation considers Context 1 in which the sampling frames for both populations are available, and in which it is possible to build an integrated observation of the two populations. We focus on two observational strategies: the first considers two independent samples, one for farms and one for individuals. Therefore, a truly integrated analysis cannot be performed. The second observational strategy applies an integrated sampling design that selects a direct sample of farms and an indirect sample of the households of the workers of the sampled farms.

We adopted the variance constraints established for the Context 1 (see Table 5.5).

Table 5.5
Variance Constraints in the evaluation on costs
Table summary
This table displays the results of Variance Constraints in the evaluation on costs Variance Constraints * (appearing as column headers).
Variance Constraints ^Note *
$U^{A} :$ variable Animals		$U^{B} :$ variable Trees
Province	District	Province	District
10%	15%	2.5%	5%
Note * Expressed in terms of %CV. Return to note * referrer

For the direct sampling designs, we adopted a SSRSWOR design, where the population $U^{A}$ was stratified by crossclassfying the districts and the size classes of the farms, and the population $U^{B}$ was stratified by district. The cost for interviewing the farms varies $(C^{A} = 1, 2, 5 and 10),$ which leads to performing four different evaluations. The cost $C^{B}$ for interviewing an individual is set equal to 1.

For indirect sampling designs, we define the overall cost of interviewing the farm and the farms workers together by two different specifications of equation (3.2):

$c_{j} = C^{A} + L_{j}^{A} C^{B}, (5.2)$

$c_{j} = C^{A} + \sqrt{L_{j}^{A}} C^{B} . (5.3)$

The increase of the cost function (5.3) is lower than the increase of the cost function (5.2) when $L_{j}^{A}$ increases.

We perform a precision-constrained optimal allocation for both independent sampling designs. The different $C^{A}$ values (1, 2, 5 and 10) do not affect the farm sample size while the costs increase proportionally. Given the variance constraints in Table 5.5 with the independent strategy, the sample sizes of farms and individual are respectively 1,010 and 3,388. The total cost is then 4,398 when setting $C^{A} = 1.$ In the integrated sample strategy, the costs do affect the allocation, essentially because if the farm interview costs increases, the number of sampled farms decreases and the allocation increases sample sizes of strata with the largest farms.

Table 5.6 below shows the sample sizes of farms and the expected sample sizes of individuals when cost model (5.2) is used to calculate the costs of individual interviews in the integrated allocation. We see that the farm sample is more than double the sample size, considering farms alone (1,101). The increase in size is due to precision constraints on the household estimates.

Table 5.6
Sample sizes for the integrated sample allocation, when the overall individual costs are given by (5.2)
Table summary
This table displays the results of Sample sizes for the integrated sample allocation. The information is grouped by Cost per farm interview $(C^{A})$ (appearing as row headers), 1, 2, 5 and 10 (appearing as column headers).
Cost per farm interview $(C^{A})$	1	2	5	10
Farms	2,388	2,289	2,190	2,137
Individuals	4,504	4,491	4,862	4,905

Table 5.7 below shows the allocation when equation (5.3) is used for the cost of individual interviews in the integrated allocation.

Table 5.7
Sample sizes for the integrated sample allocation, when the overall individual costs are given by (5.3)
Table summary
This table displays the results of Sample sizes for the integrated sample allocation. The information is grouped by Cost per farm interview $(C^{A})$ (appearing as row headers), 1, 2, 5 and 10 (appearing as column headers).
Cost per farm interview $(C^{A})$	1	2	5	10
Farms	2,135	2,121	2,111	2,108
Individuals	4,834	4,874	5,283	5,360

Tables (5.6) and (5.7) show that the integrated sample size of farms is roughly twice that of the independent allocation of farms. Thus the expected variance of the estimates will be much lower than the desidered variance constraints, suggesting that integrated sample allocation mainly depends on the variance constraints related to the individual parameters to be estimated.

Figures 5.1 and 5.2 show the cost for independent and integrated sampling. The integrated observational strategy is generally more expensive, except when the cost per farm interview is equal to 1 and the cost function given by (5.3). In this evaluation, the integrated nature of the sample is not needed as no cross tabulation of population $U^{A}$ variables with population $U^{B}$ variables are examined; then, the independent allocation will be more efficient in term of precision. Another cost function could however partially rebalance the two observational strategies in term of costs.

Figure 5.1 Overall costs integrated vs two
independent allocations using (5.2)

Description for Figure 5.1

This diagram shows the respective costs of the independent sample and the integrated sample according to formula (5.2) with the cost on the y-axis ranging from 0 to 30,000 and the $C^{A}$ values on the x-axis from 0 to 10. The costs are linear and increasing with the integrated sample higher than the independent sample.

Figure 5.2 Overall costs – integrated vs two independent allocations using (5.3)

Description for Figure 5.2

This diagram shows the respective costs of the independent sample and the integrated sample according to formula (5.3) with the cost on the y-axis ranging from 0 to 25,000 and the $C^{A}$ values on the x-axis from 0 to 10. The costs are linear and increasing with the integrated sample higher than the independent sample.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2019-12-17

Language selection

Search and menus

Search

Cost optimal sampling for the integrated observation of different populations
Section 5. Empirical results

5.1 Optimal designs for the different contexts

5.2 Evaluation on costs

Cost optimal sampling for the integrated observation of different populations Section 5. Empirical results

5.1 Optimal designs for the different contexts

5.2 Evaluation on costs

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Cost optimal sampling for the integrated observation of different populations
Section 5. Empirical results