A method to find an efficient and robust sampling strategy under model uncertainty
Section 1. Introduction

We consider the problem of choosing strategy, in particular the design, for the estimation of the total of a study variable in a finite population when a set of J MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepG0lj9riW7rqqrFfpu0de9GqFf0xc9 qqpeuf0xe9q8qiYRWFGCk9vi=dbbf9v8Gq0db9qqpm0dXdHqpq0=vr 0=vr0=edbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOsaaaa@35CB@ auxiliary variables is available in a list sampling frame. We focus on the estimation of the total.

The decision about sampling strategy involves parameters which are unknown at the stage when the decision needs to be taken. After data collection the parameters can be estimated, although sometimes only under some assumptions. In practice, we often use data from previous waves of a repeated survey, frame variables or data from another survey that is similar to the one at planning stage. There is a risk that the available data do not give reliable information about relevant parameters. The method presented here involves a risk measure, which takes into account the possibility of being misled by inaccurate or incorrect beliefs about the values of the needed parameters. The risk measure is derived for the difference and the generalized regression estimators. Other than that, the measure is general. This measure and the discussion of its practical use are the main result of this paper.

One aim when selecting and devising the sampling strategy is efficiency in terms of small mean-squared error. The definition of “efficiency” is not unique, however, as it depends on the inference approach. Under the design-based approach, Godambe (1955), Lanke (1973) and Cassel, Särndal and Wretman (1977) show that there is no uniformly best linear estimator, in the sense of being best for all populations. There is no best design either. Therefore, a traditional approach for defining the strategy has been to assume that the finite population is a realization of some superpopulation model. The strategy is then defined in such a way that it minimizes the model expected value of the design mean-squared error, a parameter called anticipated mean-squared error. The adjective “anticipated” was first introduced by Isaki and Fuller (1982) to emphasize the fact that this is a conceptual mean-squared error which is calculated in advance to sampling, based only on information available prior to sampling.

Assuming that a superpopulation model holds and its parameters are known, several authors have shown that the optimal strategy should make use of a probability proportional-to-size sampling design (e.g., Hájek, 1959; Cassel, Särndal and Wretman, 1976; Nedyalkova and Tillé, 2008). In practice, however, there is not even a consensus about the existence of a generating model, let alone what model to rely on. And even if there is a model, its parameters are unknown. There is evidence, rather empirical, that probability proportional-to-size sampling is not robust towards model misspecifications (e.g., Holmberg and Swensson, 2001). A second result of this paper is to provide some theoretical evidence of this fact.

Many articles discuss robustness in the survey sampling field. Beaumont, Haziza and Ruiz-Gazen (2013), for instance, propose a robust estimator that downweights influential observations; Royall and Herson (1973) consider robustness under polynomial models; Bramati (2012) and Zhai and Wiens (2015) propose robust stratification methods. We provide theoretical evidence of lack of robustness of proportional-to-size sampling and propose a method for assisting in the decision about the sampling design.

The contents of the paper are arranged as follows. The optimal strategy under the superpopulation model is defined in Section 2. The lack of robustness of this strategy when the model is misspecified is studied in Section 3. The method for assisting on the choice of the sampling design is presented in Section 4. In Section 5, the risk measure introduced in the previous section is extended to be used together with the GREG estimator. Section 6 presents numerical illustrations of the results in the paper. First, we illustrate the lack of robustness of probability proportional-to-size sampling and the flexibility of the GREG estimator with a small simulation study. Second, we illustrate the implementation of the risk measure with real survey data. Finally, Section 7 presents some conclusions.


Date modified: