Sample empirical likelihood approach under complex survey design with scrambled responses
Section 1. Introduction

The survey sampling technique has been shown to be one of the most effective ways to collect representative information for the underlying study population of interest; see Kish (1965) and Cochran (1977), among others. This approach has been used frequently in practice to obtain important information related to health, social economics, and public opinions. However, data collection by using a complex sampling design without careful control of statistical disclosure may lead to low response rate and large measurement error (Hundepool, Domingo-Ferrer, Franconi, Giessing, Nordholt, Spicer and Wolf, 2012). Statistical disclosure control (SDC) has been defined as one of few necessary steps to release public use files by agencies such as the US Census Bureau. For instance, Krenzke, Li, Freedman, Judkins, Hubble, Roisman and Larsen (2011) produced transportation data products from the Amercian Community Survey that comply with disclosure rules. Gouweleeuw, Kooiman, Willenborg and Wolf (1998) discussed statistical data protection at Statistics Netherlands.

The idea underlying SDC is to generate some perturbation based on the original raw data file so that the risk of identifying individuals is tiny and the utility of the perturbed data file is high. Currently, there are many SDC approaches including data coarsening, variable suppression, data swapping (Fienberg and McIntyre, 2005), Parametric model-based multivariate sequential replacement (Raghunathan, Lepkowski, van Hoewyk and Solenberger, 2001), and scrambled responses or randomized response methods (Horvitz, Shah and Simmons, 1967; Fox and Tracy, 1986). For more information about those approaches, see Hundepool et al. (2012).

Inference after SDC is an important and challenging problem. Statistical analysis without taking into account SDC leads to a biased variance estimation (Raghunathan, Reiter and Rubin, 2003). Raghunathan et al. (2003) proposed using the multiple imputation (MI) procedure to generate perturbed data files and using the Rubin’s variance estimator formula for inference. However, most agencies only seek to produce one public use file, instead of many files and the validity of MI depends on the well-known congeniality condition of Meng (1994). This condition may not hold under informative sampling design (Kim and Yang, 2017). Compared with other approaches, the scrambled responses approach is very easy to implement and has good compromise of risk and utility. In addition, valid statistical inference can be developed for most complex sampling designs. Warner (1965) first proposed using a randomization device, such as a deck of cards, to estimate the proportion of sensitive characters, such as induced abortions, drug used, and so on. Tracy and Mangat (1996) contains a comprehensive review of randomized response methods. One effective randomized response method (Scrambled responses technique) is a multiplicative model considered by Eichhorn and Hayre (1983). Bar-Lev, Bobovitch and Boukai (2004) proposed an improved version of their model. Saha (2011) discussed an optional scrambled randomized response technique for practical surveys. More recently, Singh and Kim (2011) proposed using a pseudo empirical likelihood estimator with a simple random sampling without replacement (SRSWOR) design under this model. However, they only considered a point estimation under the SRSWOR design, and their proposed method may not work for other sampling designs, such as probability proportional to size design.

Empirical likelihood approach was proposed by Hartley and Rao (1968) and studied by Owen (1988, 2001) and Qin and Lawless (1994) under traditional statistical settings. Under complex survey settings, Wu and Rao (2006) considered pseudo empirical likelihood approach. Chen and Kim (2014) proposed population and sample empirical likelihood methods which are more efficient than pseudo empirical likelihood method with high entropy designs. Berger and Torres (2016), Berger (2018a, 2018b) extended the sample empirical likelihood approach in Chen and Kim (2014) to a more general setting. In this paper, we only consider single stage sampling designs, which include Poisson sampling and stratified probability proportional to size sampling designs. Our proposed approach can be generalized to multi-stage design by using the method discussed in Berger (2018b). In surveys with multi-stage design, one challenge is that we need to specify the conditions of inclusion probabilities and consider the correlation of observations within the same cluster in different stages. We also consider interval estimation by using the sample empirical likelihood method considered in Chen and Kim (2014). After estimating the scale factor consistently, the adjusted pseudo empirical likelihood ratio converges to a standard Chi-square distribution, which can be used to construct the confidence interval. External aggregated auxiliary information, such as population size by age, gender, and race, can be naturally incorporated into our proposed method to improve the efficiency of the proposed estimators. Our proposed method is practical and can be used in most public-use survey data files, such as those from the National Health and Nutrition Examination Survey (NHANES), National Health Interview Survey (NHIS), and Behavioral Risk Factor Surveillance System (BRFSS).

The paper is organized as follows. Basic notations, research questions, and the Hájek estimator are introduced in Section 2. Section 3 discusses the proposed sample empirical likelihood method. One simulation study is presented in Section 4. We apply the proposed methods to 2015-2016 National Health Nutrition and Examination Survey (NHANES) data in Section 5. In Section 6, we conclude this paper. All technique details are contained in the Appendix.


Date modified: