1 Introduction

Jan A. van den Brakel

The fields of randomized experiments and probability sampling are traditionally two separated domains of applied statistics. Both, however, come together if experiments are embedded in ongoing sample surveys. Randomized experiments embedded in ongoing sample surveys are frequently conducted to compare and test the effect of alternative survey implementations on the outcomes of a sample survey. The purpose of such empirical research is to improve the quality and efficiency of the underlying survey processes or to obtain more quantitative insight into the various sources of non-sampling errors. Many experiments conducted in this context are small scaled or conducted with specific groups. The value of empirical research into survey methods is strengthened as conclusions can be generalized to populations larger than the sample that is included in the experiment. Selecting experimental units randomly from a larger target population, is an important tool to secure that results of an experiment can be generalized to populations larger than the group of people included in the experiment, as emphasized by Fienberg and Tanur (1987, 1988, 1989 and 1996). This naturally leads to randomized experiments embedded in ongoing sample surveys. In the survey literature, such experiments are also referred to as split-ballot designs or interpenetrating subsampling, and date back to Mahalanobis (1946).

At national statistical offices such experiments are particularly useful to quantify discontinuities in the series of repeated surveys due to adjustments to the survey process. Repeatedly conducted surveys make up series that describe the development of target parameters. Embedded experiments can be used to avoid one or more modifications in the survey process resulting in unexplained differences in the series of a survey.

An important issue in the analysis of this kind of experiment is to find the right mode of inference. The statistical inference in survey sampling is traditionally design based or model assisted. This implies that the inference is predominantly based on the stochastic structure induced by the sampling design. A well-known design-based estimator is the Horvitz-Thompson (HT) estimator, developed by Narain (1951) and Horvitz and Thompson (1952) for unequal probability sampling from finite populations without replacement. Under the model assisted approach developed by Särndal et al. (1992), the accuracy of the HT estimator is improved by taking advantage of available auxiliary information about the complete target population, resulting in the generalized regression (GREG) estimator. Many national statistical institutes rely on this design-based and model-assisted approach to compile official statistics.

The statistical inference that is traditionally employed in the theory of design and analysis of randomized experiments is predominantly model-based. The observations that are obtained in the experiment are assumed to be the realization of a linear model. To test hypotheses about treatment effects, F-tests are derived under the assumption of normally and independently distributed observations. An exception is Kempthorne (1955), where a randomization approach is proposed in a way that is similar to the design-based inference approach in sampling theory. The F-test is used as an approximation of the randomization test. The model-based inference for randomized experiments is not necessarily appropriate for the analysis of embedded experiments, particularly if a design-based or model-assisted inference is used in the ongoing survey to compile official statistics.

In an embedded experiment the probability sample of the ongoing survey is randomly divided into different subsamples according to an experimental design. Each subsample can be considered as a probability sample drawn from the finite target population and can be used to estimate parameters such as means, totals and ratios, that are observed under the different survey implementations or treatments of the experiment using the estimation procedure that is applied in the regular survey to compile official statistics. The purpose of such embedded experiments is to compare the effect of alternative survey implementations on the main parameter estimates of the ongoing survey and to test whether the observed differences between these parameter estimates are statistically significant. This is obtained with a design-based approach where point and variance estimates for the population parameters, are (approximately) design-unbiased with respect to the sample design used to draw an initial probability sample from the target population, and the experimental design used to randomize this sample over the different subsamples. This analysis must also reflect the specific details of the regular estimation approach used to compile official statistics, as far as this is possible with the available sample size under the different treatments.

Previous research has proposed such a design-based theory for the analysis of single-factor experiments that are designed as completely randomized designs (CRDs) or randomized block designs (RBDs) to test the effect of one factor on $K \geq 2$ levels, (van den Brakel (2008);, van den Brakel and Renssen (1998, 2005); van den Brakel and van Berkel (2002)). In their approach the GREG estimator is applied to derive design-based Wald- and t-statistics to test whether the differences between finite population parameter estimates observed under the different survey implementations are significantly different. This theory is further extended to the experiments embedded in rotating panel designs by Chipperfield and Bell (2010).

From standard experimental design theory it is well known that it is efficient to test different treatment factors simultaneously in one factorial design instead of conducting separate single-factor experiments ((Hinkelmann and Kempthorne (1994); Montgomery (2001)). It can be expected that different design parameters in a survey process interact with each other, e.g. when different questionnaire designs and data collection modes are compared empirically. Factorial setups are indeed appropriate if more than one factor in the survey is adjusted and tested in an embedded experiment, since fewer experimental units are required to test the main effects of the treatment factors whereas interactions between the factors can be analyzed. Another advantage of testing different treatments simultaneously in a factorial design is that the validity of the observed results is extended, since the effects are observed over a wider range of conditions, Hinkelmann and Kempthorne (1994). Therefore the design-based theory for the analysis of embedded experiments is extended to factorial designs in this paper.

The theory for factorial designs where the effect of two factors is tested simultaneously is developed in section 2. Subsequently the methodology is extended to higher order factorial designs in section 3. In section 4, the methodology is extended to test hypotheses about ratios of population totals and designs where clusters of sampling units are randomized over the treatment combinations. In section 5 these methods are applied to a factorial experiment with advance letters in the Dutch Labor Force Survey (LFS). The paper concludes with a discussion in section 6.

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

1 Introduction