Cost optimal sampling for the integrated observation of different populations
Section 1. Introduction
The need to
observe together different populations related to each other is often
encountered in social or economic studies. For example, in agricultural
studies, the characteristics and behavior of farms can be linked to phenomena
not only related to the farms themselves, but also to the social activities of
individuals. This requires the study of the population of rural households, in
addition to the study of the population of farms, in some integrated way. That
is, to get an insight into an underlying phenomenon, the observations must be
carried out in an integrated way, implying that the units of a given population
have to be observed jointly with the related units of the other population. In
the agricultural example, this means that a sample of rural households should
be selected that have some relationship with the farm sample to be used for the
study.
The integrated
observation of two populations implies that if we observe the variables of the
unit
of
the first population,
we need
to observe the variables of all the units in the second population,
which
are linked with the
unit of
The
links among the units of the two populations are regulated by formal rules,
contingent dependencies or relationships created for these purposes. Continuing
with the agricultural example, these studies often refer to different
statistical populations such as farms, rural households and land parcels, the
units of which are linked to each other. The people of a given household may be
the workers of a specific farm and those workers represent the links between
the household and the farm. Furthermore, a given farm comprises specific land
parcels which represent the links between that farm and the population of land
parcels. The integrated observation of such populations allows the measurement
global phenomena of the agricultural sector. Consider a given farm: the
education level of the farm holder and the farm size, which are variables
related to the population of farms, can affect the productivity of the land (a
variable related to the statistical population of land parcels) which belongs
to the farm. This productivity may have an impact on the risk of malnutrition
of the households (population of rural households) in which the workers of the
farm live. Thus, the observation of such different units in an integrated way
provide insights into the relationships which link the level of education, the
land productivity and the risk of malnutrition. If only aggregates are
examined, then the advantage of integrated sampling is that it allows sampling
from population
without
having a frame available for it.
Another concrete
example where the methodology may be of use is for firm-establishment-employee
studies. For instance, the wellness of the households of people employed in
firms which have a well-defined policy of social responsibility may be
different from that of other types of households and the success in their
children’s schooling can be higher. In this case, the integrated observation
allows the study the behavior of different sub-classes of households defined by
a variable observable in the population of firms.
Other examples can
be found in socio-demographic studies. For instance, the phenomenon of children
who spend time in two households can be studied with the integrated
observations of the population
of
households and the population
of
children.
Generally
speaking, integrated observation may be of use for studying phenomena that
involve variables which are correlated but belong to different statistical
populations. Integrated observation allows the study of the relationships among
all the variables of interest for the given phenomena, even if they belong to
different populations. The independent observation of such populations would
not allow the observation of the set of all the related variables of interest
and hence it would not be possible to study the relationships among all the
variables describing the phenomenon.
Indirect sampling (Lavallée, 2002, 2007) provides a
natural framework for the estimation of the parameters of two target
populations that are related to each other. In the indirect sampling framework,
the units belonging to a population that are selected for a given survey can
enable the collection of information on another population, through the
relationship between the units in the two populations. Furthermore, indirect
sampling is suitable for producing statistics of populations for which there is
no sampling frame. In such a context, the sampling procedure assumes that
population
is
related to the population of interest
but only
the sampling frame of
is
available. Then, a sample is selected from
and
using the links between the two populations, a sample of units of
is
observed.
This paper studies
the problem of sampling design for integrated observation of different
populations. For this, an indirect sampling design is implemented. In
particular, the focus is on the determination of the inclusion probabilities.
Since the sum of these probabilities define the expected sample size, we
roughly define the problem as a sampling
allocation problem. In fact, the two problems (determination of the
inclusion probabilities and sampling allocation) coincide in stratified
sampling. The allocation problem for the usual (direct) sampling setting has
been dealt with in several books and papers. When one target parameter is to be
estimated for the overall population, the optimal allocation in stratified
sampling can be performed (Cochran, 1977, and Särndal, Swensson and Wretman,
1992). In particular, the optimal sample allocation minimizes the variance of
the estimated total, subject to a given budget or, reversing the problem, a
sample allocation that minimizes costs can be performed, subject to a given
sampling error constraint. In multivariate cases, where more than one
characteristic of each sampled unit must be measured, the optimal allocation
for individual characteristics are of little practical use, unless the
characteristics under study are highly correlated. This is because an
allocation that is optimal for one characteristic can be far from optimal for
others. The multidimensionality of the problem also leads to a compromise
allocation method (Khan, Mati and Ahsan, 2010), with a loss of precision
compared to the individual optimal allocations. Several authors have discussed
various criteria for obtaining a feasible compromise allocation: see, for
example, Kokan and Khan (1967), Chromy (1987), Bethel (1989) and Choudhry, Rao and
Hidiroglou (2012).
Falorsi and Righi
(2015) provide a general framework for sample design in multivariate and
multi-domain surveys. This paper offers a further generalization of this
framework to the case of integrated observation of two populations. Different
scenarios related to the level of knowledge of the links are examined: the
first scenario assumes the links between the populations are known in the design
phase; the second scenario assumes the links between
and
are
estimated in the design phase; in the third scenario, no links between
and
are
available, but auxiliary variables on
can
provide useful information on
Section 2 introduces the background and symbols. Section 3 and Section 4 illustrate the basic optimization problem and how it is applied in the
different scenarios. Empirical evidence is shown in Section 5.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2019
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa