Cost optimal sampling for the integrated observation of different populations
Section 1. Introduction

The need to observe together different populations related to each other is often encountered in social or economic studies. For example, in agricultural studies, the characteristics and behavior of farms can be linked to phenomena not only related to the farms themselves, but also to the social activities of individuals. This requires the study of the population of rural households, in addition to the study of the population of farms, in some integrated way. That is, to get an insight into an underlying phenomenon, the observations must be carried out in an integrated way, implying that the units of a given population have to be observed jointly with the related units of the other population. In the agricultural example, this means that a sample of rural households should be selected that have some relationship with the farm sample to be used for the study.

The integrated observation of two populations implies that if we observe the variables of the unit j MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOAaaaa@36E6@ of the first population, U A , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadgeaaaGcpaGaaiilaaaa@38CC@ we need to observe the variables of all the units in the second population, U B , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadkeaaaGcpaGaaiilaaaa@38CD@ which are linked with the j th MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOAamaaCa aaleqabaGaaeiDaiaabIgaaaaaaa@38F5@ unit of U A . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadgeaaaGcpaGaaiOlaaaa@38CE@ The links among the units of the two populations are regulated by formal rules, contingent dependencies or relationships created for these purposes. Continuing with the agricultural example, these studies often refer to different statistical populations such as farms, rural households and land parcels, the units of which are linked to each other. The people of a given household may be the workers of a specific farm and those workers represent the links between the household and the farm. Furthermore, a given farm comprises specific land parcels which represent the links between that farm and the population of land parcels. The integrated observation of such populations allows the measurement global phenomena of the agricultural sector. Consider a given farm: the education level of the farm holder and the farm size, which are variables related to the population of farms, can affect the productivity of the land (a variable related to the statistical population of land parcels) which belongs to the farm. This productivity may have an impact on the risk of malnutrition of the households (population of rural households) in which the workers of the farm live. Thus, the observation of such different units in an integrated way provide insights into the relationships which link the level of education, the land productivity and the risk of malnutrition. If only aggregates are examined, then the advantage of integrated sampling is that it allows sampling from population U B MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadkeaaaaaaa@3804@ without having a frame available for it.

Another concrete example where the methodology may be of use is for firm-establishment-employee studies. For instance, the wellness of the households of people employed in firms which have a well-defined policy of social responsibility may be different from that of other types of households and the success in their children’s schooling can be higher. In this case, the integrated observation allows the study the behavior of different sub-classes of households defined by a variable observable in the population of firms.

Other examples can be found in socio-demographic studies. For instance, the phenomenon of children who spend time in two households can be studied with the integrated observations of the population U A   MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadgeaaaGccaGGGcaaaa@3931@ of households and the population U B MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadkeaaaaaaa@3804@ of children.

Generally speaking, integrated observation may be of use for studying phenomena that involve variables which are correlated but belong to different statistical populations. Integrated observation allows the study of the relationships among all the variables of interest for the given phenomena, even if they belong to different populations. The independent observation of such populations would not allow the observation of the set of all the related variables of interest and hence it would not be possible to study the relationships among all the variables describing the phenomenon.

Indirect sampling (Lavallée, 2002, 2007) provides a natural framework for the estimation of the parameters of two target populations that are related to each other. In the indirect sampling framework, the units belonging to a population that are selected for a given survey can enable the collection of information on another population, through the relationship between the units in the two populations. Furthermore, indirect sampling is suitable for producing statistics of populations for which there is no sampling frame. In such a context, the sampling procedure assumes that population U A MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadgeaaaaaaa@3803@ is related to the population of interest U B , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadkeaaaGcpaGaaiilaaaa@38CD@ but only the sampling frame of U A MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadgeaaaaaaa@3803@ is available. Then, a sample is selected from U A , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadgeaaaGcpaGaaiilaaaa@38CC@ and using the links between the two populations, a sample of units of U B MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadkeaaaaaaa@3804@ is observed.

This paper studies the problem of sampling design for integrated observation of different populations. For this, an indirect sampling design is implemented. In particular, the focus is on the determination of the inclusion probabilities. Since the sum of these probabilities define the expected sample size, we roughly define the problem as a sampling allocation problem. In fact, the two problems (determination of the inclusion probabilities and sampling allocation) coincide in stratified sampling. The allocation problem for the usual (direct) sampling setting has been dealt with in several books and papers. When one target parameter is to be estimated for the overall population, the optimal allocation in stratified sampling can be performed (Cochran, 1977, and Särndal, Swensson and Wretman, 1992). In particular, the optimal sample allocation minimizes the variance of the estimated total, subject to a given budget or, reversing the problem, a sample allocation that minimizes costs can be performed, subject to a given sampling error constraint. In multivariate cases, where more than one characteristic of each sampled unit must be measured, the optimal allocation for individual characteristics are of little practical use, unless the characteristics under study are highly correlated. This is because an allocation that is optimal for one characteristic can be far from optimal for others. The multidimensionality of the problem also leads to a compromise allocation method (Khan, Mati and Ahsan, 2010), with a loss of precision compared to the individual optimal allocations. Several authors have discussed various criteria for obtaining a feasible compromise allocation: see, for example, Kokan and Khan (1967), Chromy (1987), Bethel (1989) and Choudhry, Rao and Hidiroglou (2012).

Falorsi and Righi (2015) provide a general framework for sample design in multivariate and multi-domain surveys. This paper offers a further generalization of this framework to the case of integrated observation of two populations. Different scenarios related to the level of knowledge of the links are examined: the first scenario assumes the links between the populations are known in the design phase; the second scenario assumes the links between U A MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadgeaaaaaaa@3803@ and U B MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadkeaaaaaaa@3804@ are estimated in the design phase; in the third scenario, no links between U A MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadgeaaaaaaa@3803@ and U B MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadkeaaaaaaa@3804@ are available, but auxiliary variables on U A MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadgeaaaaaaa@3803@ can provide useful information on U B . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpgpC0xc9LqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaaeaaaaaaaaa8 qacaWGvbWdamaaCaaaleqabaWdbiaadkeaaaGcpaGaaiOlaaaa@38CF@

Section 2 introduces the background and symbols. Section 3 and Section 4 illustrate the basic optimization problem and how it is applied in the different scenarios. Empirical evidence is shown in Section 5.


Date modified: