Model-assisted calibration of non-probability sample survey data using adaptive LASSO
Section 1. Introduction

Probability-based sampling has dominated survey research for the greater part of the past century (Stephan, 1948; Frankel and Frankel, 1987). Given complete measures on sampled units with known selection probabilities, randomization theory removes selection bias by generating representative samples of the target population. On the other hand, non-probability samples generated without known selection probabilities are automatically at risk for selection bias, as samples can differ from the target population on key statistics (Groves, 2006). Well-documented failures in 1936 and 1948 presidential election polls highlight the potential downfalls in making population inference from non-probability samples (Mosteller, 1949).

Although the probability-sampling-based framework provides survey practitioners precise mathematical tools to assess and correct sampling errors, declining response rates among traditional data collection methods raise concerns over the potentially high nonresponse bias of probability samples. Pew Research reported that their response rates (RRs) in typical telephone surveys dropped from 36% in 1997 to 9% in 2012 (Kohut, Keeter, Doherty, Dimock and Christian, 2012), suggesting that telephone-based probability sampling may no longer be a viable methodology for general population surveys. In addition, obtaining data without exercising much control over the set of units for which it is collected is often cheaper and quicker than probability sampling. For these reasons non-probability sampling is currently staging a kind of renascence (Baker, Brick, Bates, Battaglia, Couper, Dever, Gile and Tourangeau, 2013; Elliott and Valliant, 2017). Online data collection, a platform without a universal sampling frame to conduct probability-based sampling, was estimated to comprise nearly half of all U.S. survey research spending in 2012 (Terhanian and Bremer, 2012), and has almost certainly grown since then.

For many survey agencies, adjusting survey weights to known auxiliary information is the final and most crucial step in the weight construction process. Standard approaches include poststratification, in which weights are adjusted so that the weighted sample distribution of categorical auxiliary variables matches that of the population, and its extention to generalized regression estimation (GREG), which ensures that the weighted sum of each auxiliary variable (continuous or categorical) equals to its corresponding total in the population (Deville and Särndal, 1992). Calibration plays an important role in official statistics because it can generate weights such that the weighted demographic estimates across different surveys are consistent.

In probability samples, when design weights are equal to the inverse of selection probabilities, weighted estimates of sample totals are design-unbiased for the population total. Calibration adjusts design weights by a minimal degree so that the weighted sample totals for auxiliary variables match their known population totals (Särndal, Swensson and Wretman, 1992). In the probability sampling setting, calibration is introduced to reduce variance and/or correct for bias by adjusting for undercoverage or overcoverage of sub-groups of the sample. For large samples, the final calibrated weights can be applied to all variables in the survey, because they approximately maintain the unbiased property of original design weights. In non-probability samples, however, there are no selection probabilities to construct initial design weights that can produce unbiased estimates. Thus, there is no guarantee that the traditional calibrated weights can work for all variables in the non-probability sample. To make inference from non-probability samples, one practical approach is to construct a set of weights that can lower the root-mean-square error (RMSE) of weighted estimates with respect to a specific outcome of interest. Model-assisted calibration provides the framework to construct calibrated weights targeting an outcome variable, given a model that can approximate the expected values of the outcome (Wu and Sitter, 2001). The key to successful model-assisted calibration is a model with strong predictive properties: model parameters estimated from one sample can be used to reliably predict values in a different sample of the same population. Of course, such predictors are not always available; Tourangeau, Conrad and Couper (2013) provide an example where the lack of predictive covariates prevent weighting adjustments from performing well. However, Tourangeau, et al. (2013) had in mind household surveys. Predictors can be more powerful in establishment or institutional surveys or in some specialized surveys like election polls. For example, Wang, Rothschild, Goel and Gelman (2015) use party affiliation and candidate voted in the previous election to make accurate predictions of the outcome of the 2012 US presidential election based on a non-probability sample that was distributed much different from that of all voters.

Clearly, then, model-assisted calibration might be expected to be most effective when there is a relatively rich set of auxiliary population covariates and consequently an extremely large set of models to be considered. In these settings, obtaining balance between structure MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Jc9qqqrpepC0xbbL8F4rqqrFfFv0dg9Wqpe0dar pepeuf0xe9q8qiYRWFGCk9vi=dbvc9s8vr0db9Ff0dbbG8Fq0Jfr=x fr=xfbpdbaqaaeaaciGaaiaabeqaamaabaabaaGcbaacbaqcLbwaqa aaaaaaaaWdbiaa=nbiaaa@3690@ to minimize model misspecification and thus bias MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Jc9qqqrpepC0xbbL8F4rqqrFfFv0dg9Wqpe0dar pepeuf0xe9q8qiYRWFGCk9vi=dbvc9s8vr0db9Ff0dbbG8Fq0Jfr=x fr=xfbpdbaqaaeaaciGaaiaabeqaamaabaabaaGcbaacbaqcLbwaqa aaaaaaaaWdbiaa=nbiaaa@3690@ and parsimony MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Jc9qqqrpepC0xbbL8F4rqqrFfFv0dg9Wqpe0dar pepeuf0xe9q8qiYRWFGCk9vi=dbvc9s8vr0db9Ff0dbbG8Fq0Jfr=x fr=xfbpdbaqaaeaaciGaaiaabeqaamaabaabaaGcbaacbaqcLbwaqa aaaaaaaaWdbiaa=nbiaaa@3690@ to stabilize estimates and thus minimize variance MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Jc9qqqrpepC0xbbL8F4rqqrFfFv0dg9Wqpe0dar pepeuf0xe9q8qiYRWFGCk9vi=dbvc9s8vr0db9Ff0dbbG8Fq0Jfr=x fr=xfbpdbaqaaeaaciGaaiaabeqaamaabaabaaGcbaacbaqcLbwaqa aaaaaaaaWdbiaa=nbiaaa@3690@ can be challenging. The Least Angle Shrinkage and Selection Operator, LASSO, is a regularized regression that can perform both variable selection and parameter estimation (Tibshirani, 1996). A wide range of applications have demonstrated that LASSO is effective in preventing model over-fitting by automatically selecting more accurate and parsimonious models. Kamarianakis, Shen and Wynter (2012) found success with LASSO in predicting average traffic speed in the presence of severe multi-collinearity due to aggregated area-level regressors. Kohannim, Hibar, Stein, Jahanshad, Hua, Rajagopalan, Toga, Jack Jr, Weiner, de Zubicaray and McMahon (2012) applied LASSO regression to identify subsets of high-dimensional and correlated single nucleotide polymorphisms (SNPs) that are related to brain structure measures. In a review of challenges in ecological analysis with collinear covariates, Dormann, Elith, Bacher, Buchmann, Carl, Carre, Marquez, Gruber, Lafourcade, Leitao and Mnkemller (2013) found that LASSO is one of the methods to consistently produce low root-mean-square-errors. In the fields of genetics and finance, LASSO has been used effectively in prediction modeling with hundreds or thousands of predictors (Wu, Chen, Hastie, Sobel and Lange, 2009).

There is a literature that considers stabilizing forms of traditional calibration. Park and Yang (2008) considered a ridge regression form of a generalized regression estimator that used a penalty term to stabilize the calibration estimators, proving design consistency and showing reduction in variance in simulation studies. Goga, Muhammad-Shehzad and Vanheuverzwyn (2011) and Cardot, Goga and Shehzad (2017) considered calibration to principle components of population totals rather than the population totals themselves, allowing large numbers of auxiliary variables to be collapsed into a manageable subset. Perhaps most relevant to this work, McConville (2011) and McConville, Breidt, Lee and Moisen (2017) developed, again under traditional calibration, the theoretical framework to show approximate design unbiasedness and consistency of LASSO calibration estimator of a total, given LASSO regression parameter estimates. Although model-assisted calibration with LASSO holds great promise in constructing a set of weights that can result in small RMSE of weighted estimates for an outcome variable in a non-probability sample, there is no theoretical framework established for the bias and consistency properties of model-assisted LASSO calibration estimators for non-probability sample.

Thus the main objectives of this article are:

  1. Develop the theoretical framework for model-assisted calibration with LASSO for both continuous and binary outcome variables: derive the point estimate of the total, its asymptotic expectation, and asymptotic theoretical variance estimate.
  2. Investigate relative performances, in terms of root-mean-square-error, of LASSO calibration to traditional calibration under different outcome types, sampling schemes, sample sizes, and calibration variable covariance structures.

While our development of the asymptotic theory assumes known design weights, a key finding is that LASSO calibration yields consistent estimators of a population total regardless of whether the design weights are correctly specified as long as the regression model includes all superpopulation parameters as a subset of the parameters in the model. Hence, we focus estimation in the simulation studies in the non-probability-based setting, where initial design weights taken to be the same as those for simple-random-sampling (SRS), d i = N / n MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWGKbWaaSbaaSqaaiaadMgaaeqaaO GaaGypamaalyaabaGaamOtaaqaaiaad6gaaaaaaa@36A1@ for population and sample sizes N MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWGobaaaa@32C4@ and n , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8WrFr0xc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peuj0lXxdrpe0db9Wqpepic9qr=xfr=xfr=tmeaabaqaciGa caGaaeqabaqaaeaadaaakeaacaWGUbGaaiilaaaa@3394@ regardless of how the samples are formed (which in practice would be unknown). We also apply LASSO calibration to estimation of the total number of adults diagnosed with cancer in the US population, using data on cancer incidence from the 2013 National Health Interview Survey (NHIS) and auxiliary population data from the US Census American Communities Survey, ignoring sample design weight to approximate a non-probability sample and comparing results to the fully-weighted (representative) estimates.

The organization of this article is as follows. Section 2 provides the definition and notations for calibration and LASSO regression. Section 3 develops the LASSO calibration estimator of population total, its model expectation, and asymptotic variances. Section 4 describes the simulation and results for evaluating the root-mean-square-error and variance estimates of the LASSO-calibrated estimator. Section 5 considers the NHIS example. We conclude with Section 6 summarizing the findings.


Date modified: