Model-assisted calibration of non-probability sample survey data using adaptive LASSO
Section 3. Model selection and robust calibration using adaptive LASSO
3.1 Adaptive LASSO background
3.1.1 Definition and parameters
The
adaptive LASSO regression coefficients are obtained by solving a penalized
regression equation. For linear adaptive LASSO regression (Zou, 2006):
where
is an adjustable weight and
is a penalty used to optimize a model fit
measure. Similarly for logistic adaptive LASSO:
Given
and
we can calculate
through iterative procedures. The R package
will compute both the linear and logistic
adaptive LASSO (Friedman, Hastie and Tibshirani, 2010).
The
role of the weight parameter,
is to prevent LASSO from selecting covariates
with large effect sizes in favor of lowering prediction error when the sample
size is small. Thus the weights are inversely proportional to effect sizes of
regression parameters:
A common choice of
is
where
is the maximum likelihood estimate of
The power of the weight parameter,
is a constant greater than 0 that interacts
with
to control LASSO from selecting or excluding
parameters. For example, if we still want LASSO to favor large effect
covariates when the sample size is small, we should set
small. If we want to de-emphasize effect sizes
further, we should set
large.
3.1.2 Oracle property
An
important concept in measuring the performance of a model selection and
estimation method is called the “oracle property”. The optimal method selects
the correct variables and provides unbiased estimates of selected parameters.
Suppose the parameters in a full regression model have both zero and non-zero
components. Without loss of generality, let the first
be non-zero and the last
zero:
A regression model has the oracle property if it satisfies the following
conditions (Fan and Li, 2001):
- The probability of estimating 0 for zero-valued parameters tends to
one:
as
- The estimates of non-zero parameters are as good as if the true
sub-model is known:
where
is the covariance matrix of
under linear model, and
is the inverse of the Fisher information
matrix of
under the generalized linear model.
For
finite-population inference, suppose
indexes a population with size
let
be the quasilikelihood estimates of
in population
and
is the estimate of
based on a sample with size
We assume that
and
as
The finite-population equivalent of the oracle
property is then:
where
is the covariance matrix of
if the model is linear, and
is the inverse of Fisher information matrix of
under the generalized linear model.
Zou
(2006) has shown that if
and
then the adaptive LASSO satisfies the oracle
property. The conditions require that
grow at least at the rate of
but not faster than
The choice of
and
and R code for implementing it, are discussed
in the Appendix.
3.2 LASSO calibration
This
section derives the analytical formula for a LASSO estimator of total, its
model expectation, and estimators of the asymptotic design variance. We make
the following assumptions:
- The samples are drawn from a single-stage
sample design
allowing for unequal probabilities of selection. The selection
probability for unit
is denoted by
and the joint selection probability of units
and
is denoted by
We denote the design weight for unit
by
the vector of design weights by
and the diagonal matrix of design weights by
- Population-level auxiliary data are known,
denoted by
- A superpopulation model is assumed, as is
described in Section 2.2:
- The true superpopulation parameters are a
subset of the full regression model for LASSO:
- The full-range of
in the population has non-zero probability of
being observed in the analytical sample.
3.2.1 Point estimate:
The
LASSO calibration estimate of total can be obtained following the steps:
- Obtain LASSO regression coefficients
as described in the Appendix.
- Use
to calculate
in the population.
- Define
and
under chi-square distance measure with
- Determine the LASSO calibration estimator of
total:
- where
is the calibration slope to satisfy the
calibration constraints:
3.2.2 Asymptotic behavior of
Wu
and Sitter (2001) established the conditions to derive an asymptotic
model-assisted calibration estimator. We state the conditions here with slight
modification in notations to be consistent with the current research. Let
be the true superpopulation parameter for the
model defined in equation (2.5), and
be the finite-population quasilikelihood
estimator of
The following conditions are used for deriving
LASSO calibration asymptotic estimator of total:
-
is the finite-population regression slope of
- For each
is continuous in
and
for
in a neighborhood of
and
- For each
is continuous in
and
for
in a neighborhood of
and
- The Horvitz-Thompson (HT) estimators of
certain population means are asymptotically normally distributed (Fuller, 2009;
pages 47-57).
-
Lemma 1:
Assume that superpopulation model (2.5) holds. Let
be the finite-population quasilikelihood
estimate of
Under conditions (1)-(5), the model-assisted
asymptotic estimator of population total is:
where
Proof. See Appendix.
Given
Lemma 1, we derive
the asymptotic LASSO estimator of total in
Theorem 1. We show
is model unbiased for the population total in
Theorem 2. Finally, Theorem 3 determines variance estimates for the
LASSO calibration estimator of a total.
Theorem 1: Suppose the parameters in a full regression model have both zero and
non-zero components. Without loss of generality, let the first
be non-zero and the last
be zero:
under conditions (1)-(5), the asymptotic LASSO calibration estimator of
total is:
Proof. See Appendix.
Theorem 2:
is model-unbiased, that is
Proof. See Appendix.
Thus,
as long as LASSO regression parameters include the superpopulation parameters,
is model-unbiased regardless of design
weights. (Note that this is a quality that
shares with
However,
can assume models with much larger numbers of
covariates than
This property is essential in non-probability
samples, where there are no initial design weights to guarantee unbiasedness.
Theorem 3: The estimator of the asymptotic variance of
is given
by
Proof. The theoretical design variance of the LASSO estimator is
which follows from equation (3.30) derived for the variance of
traditional LASSO calibration estimator of total in McConville (2011). Equation (3.6)
then follows from replacing estimates for population quantities.
An
alternative variance estimate, suggested by Särndal, Swensson and Wretman
(1989), multiplies
by
weights, which are the ratios of calibrated weights to the
original design weights:
To simplify notations, we refer to
as
and
as
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2018
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa