Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 2. Model-assisted estimation under probability sampling
2.1 GREG estimators
Consider the estimation of a finite population total where is the set of units of the finite population
and is the value of the survey variable of
interest for the unit Let be a sample selected according to a sampling
design where is the probability of selecting For let denote the first-order inclusion probabilities
of the design. We assume for all Additionally, assume auxiliary variables, are known for each A standard approach is to use the
Horvitz-Thompson estimator
where denotes
design weights. Under this strictly design-based framework, the auxiliary data
do not impact the form of the estimator but can impact the design weights, through
the specification of the sampling design.
One strategy to use auxiliary data in estimation is to
employ a model-assisted estimator of by specifying a working model for the mean of given and use this model to predict values. Specifying a linear regression working
model leads to the generalized regression (GREG) estimator (Cassel, Särndal and
Wretman, 1976). The GREG estimator typically has smaller variance than the
Horvitz-Thompson estimator if the working model has some predictor power for Here, we consider the GREG estimator under a
linear regression working model
with independent and identically distributed with
mean zero and variance and The GREG
estimator is given by
with the
regression coefficients estimated as
where is a matrix, is a -vector and is an diagonal
matrix of first-order inclusion probabilities for the sampled units.
The GREG estimator can also be written as
a weighted sum of the variable of interest, yielding regression weights that are
independent of and, therefore, can be applied to any study
variable,
where is the
known population total vector of the covariates and is the
Horvitz-Thompson estimator vector of the covariate population totals The
regression weights, are termed calibration weights because they
satisfy the calibration constraint The
calibration weight does not
depend on the study variable Note that the GREG estimator (2.4) can
alternatively be expressed as
which only
requires known population totals For the
GREG estimator, the individual population values are not
needed.
If a variable selection procedure, such
as a forward stepwise procedure, is implemented prior to fitting the linear
regression model, then the calibration weights will depend on as the selected models may vary across study
variables. This type of stepwise survey regression estimator is calibrated to
the auxiliary variables selected by the variable selection procedure for a
specific variable of interest,
Using a working linear regression model
with many auxiliary variables, including interactions of categorical auxiliary
variables, can produce substantially variable weights, and greatly increase the
variance of the GREG estimator. Furthermore, some of the regression weights, may be negative, thus losing the
interpretation of a weight as the number of population units represented by the
sampled unit.
2.2 Survey regression
estimator with lasso
If the linear regression model in (2.1) is sparse, i.e.,
is large, and, say, only of the regression coefficients are nonzero, then the
estimation of the zero coefficients in (2.3) leads to extra variation in the
GREG estimator (2.2). In this case, model selection to remove extraneous
variables could reduce the overall design variance of the GREG estimator,
leading to more efficient estimates of finite population totals. The least
absolute shrinkage and selection operator (lasso) method, developed by
Tibshirani (1996), simultaneously performs model selection and coefficient
estimation by shrinking some regression coefficients to zero. The lasso
approach estimates coefficients by minimizing the sum of squared residuals
subject to a penalty constraint on the sum of the absolute value of the
regression coefficients.
McConville et al. (2017) proposed
using survey-weight lasso estimated regression coefficients given by
where The lasso survey regression estimator for the
total is then
given by
The value of
the penalty parameter must be
selected prior to obtaining the estimated coefficients. In general, this
process of specifying hyperparameters prior to fitting the final model is
called hyperparameter tuning. There are several potential selection criteria that
can used to select the value of hyperparameters including Akaike Information
Criterion (AIC), Bayesian Information Criterion (BIC) or cross-validation. We
used a version of cross-validation which incorporates the design weights in our
simulation study; see McConville (2011) for discussion of the selection of the
penalty parameter for survey-weighted lasso coefficient estimates.
2.3 Survey regression estimator with adaptive
lasso
An issue with the use of the lasso criterion is that by
shrinking the regression coefficients towards zero it yields biased estimates
for regression coefficients that are far from zero. Under the adaptive lasso criterion
(Zou, 2006), the coefficients in the penalty are weighted by the inverse of a root-
consistent estimator of Therefore, the bias for large coefficients
tends to be smaller.
McConville et al. (2017) considered
an adaptive lasso survey regression estimator
where
and is given
by (2.3). The reliance of the adaptive lasso method on the standard weighted
linear regression coefficient estimates, leads to
a loss of efficiency in settings when is large because the estimates tend to
be very unstable.
2.4 Lasso
calibration estimators
The lasso and adaptive lasso methods do
not produce regression weights directly, as the estimators cannot be expressed
as weighted combinations of the -values. McConville et al. (2017)
developed lasso survey regression weights using a model calibration approach
and a ridge regression approximation. These lasso regression weights depend on
the variable of interest,
The lasso calibration estimator is
calculated by regressing the variable of interest, on an intercept and the lasso-fitted mean
function The lasso calibration estimator can be written
in the same form as (2.4), where is replaced by
Similarly,
the adaptive lasso calibration estimator is given by
where the
lasso-fitted mean for in (2.5) is replaced by the adaptive lasso fit,
The weights for the lasso calibration
estimators are calibrated to the population size and to
the population total of the lasso-fitted mean functions.
2.5 Regression tree estimator
The GREG estimator can also be expressed as
where is an
estimator of the mean function of given based on the sample data As an alternative to a linear regression model,
McConville and Toth (2019) proposed estimating with a
regression tree model using the following algorithm:
-
Let be the minimum box size
and be a specified
significance level.
-
If the dataset contains
at least observations then
continue to step 3; otherwise, stop.
-
Among the auxiliary
variables choose a variable to
split the data. The chosen is the variable that
shows the largest significance difference after testing the null-hypothesis of
homogeneous If no variable leads to a
significant difference, then stop.
-
Split the data into two
sets and by splitting based on the
value of the selected variable that results in the
largest decrease in the estimated mean square error, while satisfying the
requirement that each subset contains at least units.
-
For each of the
resulting subsets of the data, return to step 1.
The resulting regression tree model groups the
categories of an auxiliary variable based on their relationship to the variable
of interest and only includes auxiliary variables and interactions associated
with this variable. Importantly, including a categorical variable does not
require a split for each category, potentially reducing the model size substantially
while still capturing important interactions.
After fitting a regression tree model, we obtain a set
of boxes which partition the data. Let if and 0 otherwise, for This means that for exactly one box for every For every the estimator of is given by
where
is the HT
estimator of the population size in box The
regression tree estimator is
obtained by inserting equation (2.7) into the generalized regression estimator,
given in equation (2.6), leading to the post stratified estimator
where is the
number of units in that
belong to box
Since can be written as a linear regression
estimator with indicator function covariates, the regression
tree estimator is also a post-stratified estimator, where each box represents a post-stratum. This implies that
this estimator is calibrated to the population total of each box, providing a
data-driven mechanism, dependent on for selecting post-strata that ensures that none
of them are empty. As a result, the regression weights are guaranteed to be
non-negative. The weights produced by this estimation procedure depend on the
variable of interest, Therefore, unlike the GREG approach, a single
set of generic weights to apply to all study variables is not available.
Instead, a set of weights for each survey variable of interest is produced.
2.6 Variance estimation
under stratified simple random sampling
Under stratified simple random sampling, a variance
estimator of the model-assisted survey regression estimators described above is
obtained by the Taylor linearization method and given by
where indexes
the strata, is the
number of population units in stratum is the
number of sampled units in
stratum is the
residual of sample unit in
stratum under
the regression model and is the
average residual in stratum
The variance estimators readily extend to
more complex sampling designs, but for simplicity we have given the expression
only for stratified simple random sampling which is used in the simulation
study of Section 3.