Cost optimal sampling for the integrated observation of different populations
Section 6. Conclusions
In this paper, we
studied the problem of the definition of optimal sampling designs for survey
strategies aiming at observing in an integrated way different statistical
populations related to each other. This is particularly relevant in the
agricultural sector where the integrated observation allows measurement of
global phenomena that affect different statistical populations such as farms
and households. The integrated observation is realized by directly sampling the
first population and indirectly observing the second population, exploiting the
links existing among the units of the two populations. We studied the problem
considering three different contexts concerning information about the links.
These range from two contexts in which the information is very rich, to the
third context considering a case in which the information is very poor. The uncertainty
on variables of the two populations, on links and on the
variables (built by the indirect sampling
mechanism) is treated by introducing suitable superpopulation models for which
expected values (of first and second order) are considered as known when
launching the algorithm for the optimal sampling. Empirical studies were
performed on real data of a developing country:
Mozambique.
The
main conclusions are summarized as follows.
Integrated vs independent observation. The integrated observation is essential to measure thoroughly global phenomena which
impact on different populations. The main advantage is that it allows the cross tabulation of population
variables with
population
variables.
Furthermore, the integrated observation is necessary when the frame for the
population
does not exist
and an indirect sampling mechanism is needed. This is the case examined in
Context 3. However, for Contexts 1 and 2, if only aggregates are
examined independently from each other in the two populations, the independent
allocation will be more efficient.
Cost issues. The loss in efficiency of the integrated observation can be reduced if, as
assumed with cost function (5.3), the average cost of observing the elementary
unit of
decreases when
the size of the indirectly observed clusters increase. In this case, the
performance of the integrated sample allocation and of the two independent
allocations could be closer or similar as in the evaluation study.
Nevertheless, it is complex to establish which relationship between
and
leads to two
strategies with similar costs, since the allocations depend on not only on the
cost of interview but also on the variability of the target parameters in the
two populations and on the set of variance constraints.
Controlling the errors in the design phase. The integrated approach to allocation enables the CVs
of the estimates for integrated populations to be controlled. If this is not
done, the CVs of the indirectly observed population might be very high.
The impact on the uncertainty on the sample sizes. An increase in the model variances (on the variables
or on the links) causes a significant increase in the sample sizes. This
stresses the need of having good models for predicting the unknown variables or
the links.
Appendix A
To obtain the
model expectation
let
be the
vector
of residuals, where
where
denotes the
vector with the values of
variable of interest and
indicates the diagonal matrix with the
inclusion probabilities. According to model
(4.1), the vector
may be expressed as
where
and
denotes the
vectors of predictions and model residuals.
Adopting the above matrix notation, the specific residuals
can be expressed as
Therefore, the model expected values of the
squared terms are given by:
where
is the
column vector of model standard errors of the
variables and
is a vector in which the
element is equal to
and all other elements are zeroes. Using the
above matrix notation and according to Falorsi and Righi (2015), the
anticipated variance can be approximated by the following expression:
Letting
and
we then have
where the scalars defined as (A.1.4), (A.1.7)
and (A.1.8) in Falorsi and Righi (2015) are respectively the elements of the vectors
and
Appendix B
Adopting the
matrix notation, the residuals
can be
expressed as
where
is the
matrix of standardized links, and
and
denote respectively the
vectors with the values of the predictions and
of the residuals of the
variable of interest being
is the
row of the matrix
Therefore, the model expected values of the
squared terms is given by:
where
is the
column vector of model standard errors of the
variables. Following the above notation, we
have:
Letting
we have
Finally, we note
the following
Appendix C
Starting from (B.2),
we have:
The above expected
value can be easily derived based on the following general result. Let
be a
generic
matrix.
The generic element
in the
position
of the
squared
matrix
is given
by
We have
Taylor’s series
first order approximation of
is given
by
Therefore, we have
where
Equation (C.2)
is derived from the following result. For
and
we
obtain
where
For
and
we obtain
Let
be a
generic
vector.
The generic element
in the
position
of the
squared
matrix
is given
by
where
Finally, denote by
the
generic element in the
position
of the matrix
Its
generic expected value is given by
References
Bethel
,
J. (1989). Sample allocation in multivariate surveys. Survey Methodology, 15, 1, 47-57. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/1989001/article/14578-eng.pdf.
Boyd, S., and Vanderberg, L. (2004). Convex Optimization.
Cambridge
University
Press.
Brewer, K.R.W., and Gregoire, T.G. (2009). Introduction to
survey sampling. In Handbook of Statistics – Sample Surveys: Design, Methods and
Applications, (Eds., D. Pfeffermann
and C.R. Rao), Elsevier B.V. 29A, 9-37.
Choudhry, G.H., Rao, J.N.K. and Hidiroglou, M.A. (2012).
On sample allocation for efficient domain estimation. Survey Methodology, 38, 1, 23-29. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2012001/article/11682-eng.pdf.
Chromy, J. (1987). Design optimization with multiple
objectives. Proceedings of the Survey
Research Methods Section, American Statistical Association, 194-199.
Cochran, W.G. (1977). Sampling Techniques.
New
York
: John Wiley & Sons, Inc.
Deville, J.-C., and Tillé, Y. (2005). Variance
approximation under balanced sampling. Journal
of Statistical Planning and Inference, 128, 569-591.
Falorsi, P.D., and Righi, P. (2015). Generalized
framework for defining the optimal inclusion probabilities of one-stage
sampling designs for multivariate and multi-domain surveys. Survey Methodology, 41, 1, 215-236.
Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2015001/article/14149-eng.pdf.
FAO (2014). Technical Report on the Integrated Survey
Framework, Technical Report Series GO-02-2014. http://gsars.org/wp-content/uploads/2014/07/Technical_report_on-ISF-Final.pdf.
FAO (2015). Guidelines on Integrated Survey Framework. GUIDELINES & HANDBOOKS
http://gsars.org/en/guidelines-for-the-integrated-survey-framework/. Accessed
on August 2016.
Khan, M.G.M., Mati, T. and Ahsan, M.J. (2010). An
optimal multivariate stratified sampling design using auxiliary information: An
integer solution using goal programming approach. Journal of Official Statistics, 26, 695-708.
Kokan, A., and Khan, S. (1967). Optimum allocation in
multivariate surveys: An analytical solution. Journal of the Royal Statistical Society, Series B, 29, 115-125.
Lavallée, P. (2002). Le sondage indirect, ou la méthode du partage des poids. Éditions de
l’Université de Bruxelles (Belgique) et Éditions Ellipses (France), 215 pages.
Lavallée, P. (2007). Indirect
Sampling.
New York
:
Springer.
Lavallée, P., and Caron, P. (2001). Estimation using the
generalised weight share method: The case of record linkage. Survey Methodology, 27, 2, 155-169.
Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2001002/article/6092-eng.pdf.
Lavallée,
P., and Labelle-Blanchet, S. (2013). Indirect sampling applied to skewed
populations. Survey Methodology, 39,
1, 183-215. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2013001/article/11829-eng.pdf.
Särndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling.
New York
: Springer-Verlag.
Steel, D.G., and
Clark
,
R.G. (2014). Potential gains from using unit level cost information in a
model-assisted framework. Survey
Methodology, 40, 2, 231-242. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2014002/article/14110-eng.pdf.
Wallgren, A., and Wallgren, B. (2014). Register-Based Statistics: Administrative
Data for Statistical Purposes.
New York
:
John Wiley & Sons, Inc.
Chichester
,
UK
. ISBN: ISBN
978-1-119-94213-9.
Winkler, W.E. (2001). Multi-Way Survey Stratification and Sampling. Research Report
Series, Statistics #2001-01. Statistical Research Division
U.S.
Bureau of the Census
Washington
D.C.
20233
.
Xu, X., and
Lavallée, P. (2009). Treatments for link nonresponse in indirect
sampling. Survey Methodology, 35, 2,
153-164. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2009002/article/11038-eng.pdf.