Model-assisted calibration of non-probability sample survey data using adaptive LASSO
Section 6. Conclusion
In
this manuscript, we developed the LASSO calibration estimator of population
totals,
given population auxiliary data. We also
derived closed-form variance estimates for
Simulation results show that the point
estimates are approximately unbiased under simple-random sampling and
informative sampling. For sample selections that are related to analysis
variables, LASSO was able to significantly reduce sample bias even without the
correct design weights. LASSO tends to outperform stepwise-selected working
models when covariates are highly collinear. For analysis with many categorical
variables, where there are natural correlations between the categories, LASSO
calibration estimator can perform better than traditional calibration
estimators, even if the effect sizes are small. The improvement is modest in
the continuous variable setting, but substantial when the outcome of interest
is binary, as shown in simulations and in the NHIS data example. We have
demonstrated theoretically and through simulations that LASSO calibration holds
great promise in making unbiased inference of population totals from
non-probability samples. Although asymptotic closed-form variance estimates did
not produce very accurate nominal coverage, the naive bootstrap is a viable
alternative approach. In an application to estimate population total of
individuals diagnosed with cancer, without correct design weights, the LASSO
calibration estimator was able to produce an estimate that is the closest to
the estimate based on correct survey weights. LASSO calibration estimator also
has the smallest standard error of all the estimators considered, although the
bootstrap variance estimate that was used did not fully account for the
clustering in the NHIS, which generally increases standard errors. The
application shows that LASSO calibration can generate inference to the
population for a specific outcome variable, and the inference is both more
accurate and precise than traditional calibration estimators.
The
question arises when use of LASSO model-assisted calibration should be used
instead of traditional calibration methods such as GREG. Both theoretical and
empirical results in this paper suggest that there is little to be lost in
terms of statistical efficiency to use LASSO model-assisted calibration, it
does require additional effort on the part of the analyst to implement. While
we cannot give specific cutoffs, our analysis suggests that this effort will be
worthwhile when a) there are large numbers of potential calibration variables,
b) many of these calibration variables are likely to be highly correlated, and
c) the outcome is binary rather than continuous. We believe that conditions a)
and b), at least, are increasing likely to be encountered in non-probability
settings, where administrative datasets might provide these types of
calibration variables and subsets of data obtained through various means will
contain the core variable of interest.
While
LASSO provides particularly convenient and rapid implementation, there are, of
course, other modern regression methods that could be considered in addition to
LASSO to develop penalized regression models for high-dimensional
model-assisted regression, including approaches such as ridge regression,
principle components, or Bayesian additive regression trees (Chipman, George
and McCulloch, 2010). These approaches provide opportunity for further research
in this area.
Finally,
we note that this work is only a part of a larger and rapidly expanding
literature on inference from non-probability surveys. In addition to the work of
McConville et al. (2017), the “Mr. P” (multi-level regression and
poststratification or MRP) approach of Wang et al. (2015) also uses high
dimensional covariates to adjust non-probabilities samples, by use of a
hierarchical model rather than penalized regression. Quasi-randomization (Elliott,
2009; Elliott, Resler, Flannagan and Rupp, 2010; Elliott and Valliant, 2017)
and sample matching (Rivers, 2007; Vavreck and Rivers, 2008) also provide
alternatives that use data from either known population quantities or
probability sampling estimates to deal with selection bias issues in
non-probability samples. Each have their strengths and weaknesses relative to
each other and to model-assisted LASSO. The MRP approach makes distributional
assumptions that might improve efficiency, but might reduce robustness, and is
non-trivial to implement in its fully Bayesian form. Quasi-randomization
forfeits the link to a particular outcome variable, making the weights it
develops general purpose but likely less effective, while sample matching
requires intervention at the design stage to sample elements from the
non-probability frame that match elements from the population, ala quota
sampling. The decision to use model-assisted LASSO calibration should be made
in the context of these tradeoffs.
Acknowledgements
The
authors thank Professor Fred M. Feinberg and Associate Research Scientist
Sunghee Lee for their helpful review and suggestions, as well as the Editor,
Associate Editor, and three referees whose suggestions greatly contributed to
improving this paper.
Appendix
Determining estimates for adaptive LASSO
In
practice, we do not observe the theoretical rate of growth of
which optimize model fit measures such as AIC
or BIC, unless we have obtained many samples of the same population with
various sample sizes. Given a sample, the choices of
and
depend on the modeler. In R
implementation (Friedman et al., 2010), a
range of
is determined by the following scheme:
- Set
- Determine
by finding the smallest
that sets all coefficients to 0.
- If sample size
is larger than the number of parameters in the
regression model, set
If sample size
is smaller than the number of parameters, set
(to set parameters to 0 sooner).
- Generate a grid of
typically 100 equally spaced points between
and
The
initial range of values of
is determined independently of
Choices of
are less data-driven. Some modelers choose one
of
Here we determine
through cross-validation as follows:
- Obtain
- Determine 100 equally spaced values of
based on R
’s implementation.
- For each pair
from Step 2, and
split data into 5 folds. Use 4 folds to obtain
- Apply
to the last fold not used to estimate
and calculate a metric. For continuous
we calculate the mean-absolute-error (MAE),
For binary
we calculate the area under curve (AUC)
(calculated through R
function).
- Average the 5 metrics for each pair of
and choose the pair with the best average
metric: minimum MAE for continuous
maximum AUC for binary
The
adaptive LASSO coefficient estimates are then obtained by solving equations (3.1)
or (3.2) in Section 3.1 given the selected
The R code used to perform cross-validation is
provided in the on-line supplemental material.
Asymptotic unbiasedness
and variance of model-assisted LASSO calibration estimator of a population total
Lemma 1: Assume the superpopulation
model:
Let
be the
finite-population quasilikelihood estimate of
Under conditions
(1)-(5) in Section 3.2, the model-assisted asymptotic estimator of
population total is:
where
Proof. The proof is adapted
and expanded from the proof of Theorem 1 in Wu and Sitter (2001), with
slight modifications in notations to be consistent with this paper. We begin by
deriving the asymptotic model-assisted estimator for a population mean,
(see equation (2.7)).
By conditions (2) and (3), the second order Taylor series expansion of
around
is:
for
or
Let
Note that
is a vector of length
and
is a matrix of size
where
is the number of
parameters in
By conditions (2) and
(3),
Conditions (1) and (3) imply that
By equation (2.2) in
Section 2.1 and the boundedness conditions of (2) and (3) in Section 3.2.2
imply
By conditions (1), (4), and
equation (A.7):
Using conditions (1) and (3),
for
Then from (A.2) and (A.9) and
using conditions (1)-(3), we have
Similarly,
From (A.10) and (A.11) we
have:
Thus
and we have:
Since
we have
Thus,
Theorem 2: Suppose the parameters in a full
regression model have both zero and non-zero components, without loss of
generality, let the first
be non-zero and the
last
be zero:
Under conditions (1)-(5), the
asymptotic LASSO calibration estimator of total is:
Proof. Under condition (5), the adaptive LASSO regression satisfies the oracle
property through Theorems 1 and 4 in Zou (2006):
where
is the covariance
matrix of
under the linear
model, and
is the inverse of
Fisher information matrix of
under generalized
linear model. By Slutsky’s theorem, the oracle property implies
By condition (1) and
Lemma 1:
Theorem 3:
is model-unbiased.
Proof. Under the assumption of our theoretical framework, the superpopulation
parameters are a subset of the full LASSO regression parameters, we can prove
the model-unbiasedness of
by taking expectations
with respect to model
First note that:
Thus
Thus,
as long as LASSO regression parameters include the superpopulation parameters,
is model-unbiased regardless of design
weights. This property is essential in non-probability samples, where there are
no initial design weights to guarantee unbiasedness.
References
Baker, R., Brick, J.M.,
Bates, N.A., Battaglia, M., Couper, M.P., Dever, J.A., Gile, K. and Tourangeau, R.
(2013). Summary report of the AAPOR task force on non-probability sampling. Journal of Survey Statistics and Methodology, 1, 90-143.
Chipman, H.A., George,
E.I. and McCulloch, R.E. (2010). BART: Bayesian additive regression trees. Annals of Applied Statistics, 4, 266-298.
Cardot, H., Goga, C. and
Shehzad, M.-A. (2017). Calibration and partial calibration on principal
components when the number of auxiliary variables is large. Statistica
Sinica, 27, 243-260.
Centers for Disease
Control and Prevention (2005). 2004 National Health Interview Survey (NHIS)
Public Use Data Release: NHIS Survey Description. National Center for
Health Statistics: Hyattsville, Maryland. www.cdc.gov/nchs/data/nhis/srvydesc.pdf.
Czanner, G., Sarma, S.V.,
Eden, U.T. and Brown, E.N. (2008). A signal-to-noise ratio estimator for
generalized linear model systems. Proceedings of the World Congress on
Engineering, vol. 2.
Deville, J.-C., and
Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of
the American Statistical Association, 87, 376-382.
Dormann, C.F.,
Elith, J., Bacher, S., Buchmann, C., Carl, G., Carre, G.,
Marquez, J.R.G., Gruber, B., Lafourcade, B., Leitao, P.J.
and Mnkemller, T. (2013). Collinearity: A review of methods to deal with
it and a simulation study evaluating their performance. Ecology, 36, 27-46.
Elliott, M.R. (2009).
Combining data from probability and nonprobability samples using
pseudo-weights. Survey Practice, 2(6).
Elliott, M.R, Resler, A.,
Flannagan, C. and Rupp, J. (2010). Combining data from probability and
non-probability samples using pseudo-weights. Accident Analysis and Prevention, 42,
530-539.
Elliott, M.R., and
Valliant, R. (2017). Inference for non-probability samples. Statistical
Science, 32, 249-264.
Fan, J., and Li, R.
(2001). Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical Association, 96, 1348-1360.
Friedman, J., Hastie, T.
and Tibshirani, R. (2010). Regularization paths for generalized linear models
via coordinate descent. Journal of Statistical Software, 33, 1-22.
Frankel, M.R., and
Frankel, L.R. (1987). Fifty years of survey sampling in the United States. Public Opinion Quarterly, S127-S138.
Fuller, W.A. (2009). Sampling Statistics. New York: John Wiley & Sons, Inc.
Goga, C.,
Muhammad-Shehzad, A. and Vanheuverzwyn, A. (2011). Principal component
regression with survey data: Application on the French media audience. Proceedings of the 58th World Statistics Congress of the
International Statistical Institute, 3847-3852.
Groves, R.M. (2006).
Nonresponse rates and nonresponse bias in household surveys. Public Opinion
Quarterly, 70, 646-675.
Kamarianakis, Y., Shen, W.
and Wynter, L. (2012). Real-time road traffic forecasting using
regime-switching space-time models and adaptive LASSO. Applied Stochastic
Models in Business and Industry, 28, 297-315.
Kohannim, O., Hibar, D.P.,
Stein, J.L., Jahanshad, N., Hua, X., Rajagopalan, P., Toga, A.,
Jack Jr, C.R., Weiner, M.W., de Zubicaray, G.I. and
McMahon, K.L. (2012). Discovery and replication of gene influences on
brain structure using LASSO regression. Frontiers in Neuroscience, 6,
115.
Kohut, A., Keeter, S.,
Doherty, C., Dimock, M. and Christian, L. (2012). Assessing the
representativeness of public opinion surveys. Pew Research Center for The
People & The Press. http://www.people-press.org/2012/05/15/assessing-the-representativeness-of-public-opinion-surveys/.
Mosteller, F. (1949). The
Pre-Election Polls of 1948: The Report to the Committee on Analysis of Pre-Election
Polls and Forecasts, vol. 60, Social Science Research Council.
McConville, K. (2011). Improved Estimation for Complex Surveys
Using Modern Regression Techniques. Unpublished PhD Thesis, Colorado State
University.
McConville, K.,
Breidt, F.J., Lee, T.M. and Moisen, G.G. (2017). Model-assisted
survey regression estimation with the LASSO. Journal of Survey Statistics
and Methodology, 5, 131-158.
Park, M., and Yang, M.
(2008). Ridge regression estimation for survey samples. Communication in
Statistics - Theory and Methods, 37, 532-543.
Rivers, D. (2007).
Sampling for web surveys. Proceedings of the Joint Statistical Meetings, American Statistical Association.
Särndal, C.-E., Swensson,
B. and Wretman, J. (1989). The weighted residual technique for estimating the
variance of the general regression estimator of the finite population total. Biometrika, 76, 527-537.
Särndal, C.-E., Swensson,
B. and Wretman, J. (1992). Model Assisted Survey Sampling. New York:
Springer.
Skinner, C., and Silva,
P. (1997). Variable selection for regression estimation in the presence of
nonresponse. Proceedings of the Survey Research Methods Section, American Statistical Association, 76-81.
Stephan, F.F. (1948).
History of the uses of modern sampling procedures. Journal of the American
Statistical Association, 43, 12-39.
Terhanian, G., and
Bremer, J. (2012). A smarter way to select respondents for surveys? International Journal of Market Research, 54, 751-780.
Tibshirani, R. (1996). Regression
shrinkage and selection via the LASSO. Journal of the Royal Statistical
Society, 58, 267-288.
Tourangeau, R., Conrad,
F.G. and Couper, M.P. (2013). The Science of Web Surveys. Oxford
University Press, Oxford, UK.
Vavreck, L., and Rivers,
D. (2008). The 2006 Cooperative Congressional Election Study. Journal of
Elections, Public Opinion, and Parties, 355-366.
Wang, W., Rothschild, D.,
Goel, S. and Gelman, A. (2015). Forecasting elections with non-representative
Polls. International Journal of Forecasting, 31, 980-991.
Wu, C., and Sitter, R.R.
(2001). A model-calibration approach to using complete auxiliary information
from survey data. Journal of the American Statistical Association, 96,
185-193.
Wu, T.T., Chen, Y.F.,
Hastie, T., Sobel, E. and Lange, K. (2009). Genome-wide association analysis by
LASSO penalized logistic regression. Bioinformatics, 25, 714-721.
Zou, H. (2006). The
adaptive LASSO and its oracle properties. Journal of the American
Statistical Association, 101, 1418-1429.