Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 2. Model-assisted estimation under probability sampling

2.1 GREG estimators

Consider the estimation of a finite population total $t_{y} = \sum_{i \in U} y_{i},$ where $U = {1, \dots, N}$ is the set of units of the finite population and $y_{i}$ is the value of the survey variable of interest for the unit $i \in U .$ Let $s \subset U$ be a sample selected according to a sampling design $p (.),$ where $p (s)$ is the probability of selecting $s .$ For $i \in U,$ let $π_{i} = Pr [i \in s]$ denote the first-order inclusion probabilities of the design. We assume $π_{i} > 0$ for all $i \in U .$ Additionally, assume $d$ auxiliary variables, $x_{i} = {(x_{i 1}, x_{i 2}, \dots, x_{i d})}^{T}$ are known for each $i \in U .$ A standard approach is to use the Horvitz-Thompson estimator

${\hat{t}}_{y, HT} = \sum_{i \in s} \frac{y_{i}}{π_{i}} = \sum_{i \in s} d_{i} y_{i}$

where $d_{i} = π_{i}^{- 1}$ denotes design weights. Under this strictly design-based framework, the auxiliary data do not impact the form of the estimator but can impact the design weights, $d_{i},$ through the specification of the sampling design.

One strategy to use auxiliary data in estimation is to employ a model-assisted estimator of $t_{y}$ by specifying a working model for the mean of $y$ given $x$ and use this model to predict $y$ values. Specifying a linear regression working model leads to the generalized regression (GREG) estimator (Cassel, Särndal and Wretman, 1976). The GREG estimator typically has smaller variance than the Horvitz-Thompson estimator if the working model has some predictor power for $y .$ Here, we consider the GREG estimator under a linear regression working model

$y_{i} = x_{i}^{T} β + ε_{i} (2.1)$

with $β = {(β_{0}, β_{1}, \dots, β_{p})}^{T},$ $ε_{i}$ independent and identically distributed with mean zero and variance $σ^{2}$ and $x_{i} = {(1, x_{i 1}, \dots, x_{i p})}^{T} .$ The GREG estimator is given by

${\hat{t}}_{y, GREG} = \sum_{i \in s} \frac{y_{i} - x_{i}^{T} {\hat{β}}_{s}}{π_{i}} + \sum_{i \in U} x_{i}^{T} {\hat{β}}_{s} (2.2)$

with the regression coefficients $β$ estimated as

${\hat{β}}_{s} = \underset{β}{argmin} {(Y_{s} - X_{s} β)}^{T} Π_{s}^{- 1} (Y_{s} - X_{s} β) = {(X_{s}^{T} Π_{s}^{- 1} X_{s})}^{- 1} X_{s}^{T} Π_{s}^{- 1} Y_{s}, (2.3)$

where $X_{s}$ is a $n \times (p + 1)$ matrix, $Y_{s}$ is a $n$ -vector and $Π_{s}$ is an $n \times n$ diagonal matrix of first-order inclusion probabilities for the sampled units.

The GREG estimator can also be written as a weighted sum of the variable of interest, $y,$ yielding regression weights that are independent of $y$ and, therefore, can be applied to any study variable, $y :$

${\hat{t}}_{y, GREG} = \sum_{i \in s} [1 + {(t_{x} - {\hat{t}}_{x, HT})}^{T} {(\sum_{k \in s} x_{k} x_{k}^{T} d_{k})}^{- 1} x_{i}] d_{i} y_{i} = \sum_{i \in s} w_{i} y_{i} (2.4)$

where $t_{x}$ is the known population total vector of the covariates $x$ and ${\hat{t}}_{x, HT}$ is the Horvitz-Thompson estimator vector of the covariate population totals $t_{x} = \sum_{i \in U} x_{i} .$ The regression weights, $w_{i},$ are termed calibration weights because they satisfy the calibration constraint $\sum_{i \in s} w_{i} x_{i} = \sum_{i \in U} x_{i} .$ The calibration weight $w_{i}$ does not depend on the study variable $y_{i} .$ Note that the GREG estimator (2.4) can alternatively be expressed as

${\hat{t}}_{y, GREG} = {\hat{t}}_{y, HT} + {(t_{x} - {\hat{t}}_{x, HT})}^{T} {\hat{β}}_{s}$

which only requires known population totals $t_{x} .$ For the GREG estimator, the individual population values $x_{i}, i \in U$ are not needed.

If a variable selection procedure, such as a forward stepwise procedure, is implemented prior to fitting the linear regression model, then the calibration weights will depend on $y$ as the selected models may vary across study variables. This type of stepwise survey regression estimator is calibrated to the auxiliary variables selected by the variable selection procedure for a specific variable of interest, $y .$

Using a working linear regression model with many auxiliary variables, including interactions of categorical auxiliary variables, can produce substantially variable weights, and greatly increase the variance of the GREG estimator. Furthermore, some of the regression weights, $w_{i},$ $i \in s,$ may be negative, thus losing the interpretation of a weight as the number of population units represented by the sampled unit.

2.2 Survey regression estimator with lasso

If the linear regression model in (2.1) is sparse, i.e., $p$ is large, and, say, only $p_{0}$ of the $p$ regression coefficients are nonzero, then the estimation of the zero coefficients in (2.3) leads to extra variation in the GREG estimator (2.2). In this case, model selection to remove extraneous variables could reduce the overall design variance of the GREG estimator, leading to more efficient estimates of finite population totals. The least absolute shrinkage and selection operator (lasso) method, developed by Tibshirani (1996), simultaneously performs model selection and coefficient estimation by shrinking some regression coefficients to zero. The lasso approach estimates coefficients by minimizing the sum of squared residuals subject to a penalty constraint on the sum of the absolute value of the regression coefficients.

McConville et al. (2017) proposed using survey-weight lasso estimated regression coefficients given by

${\hat{β}}_{s, L} = \underset{β}{argmin} {(Y_{s} - X_{s} β)}^{T} Π_{s}^{- 1} (Y_{s} - X_{s} β) + λ \sum_{j = 1}^{p} | β_{j} |,$

where $λ \geq 0.$ The lasso survey regression estimator for the total $t_{y}$ is then given by

${\hat{t}}_{y, LASSO} = \sum_{i \in s} \frac{y_{i} - x_{i}^{T} {\hat{β}}_{s, L}}{π_{i}} + \sum_{i \in U} x_{i}^{T} {\hat{β}}_{s, L} .$

The value of the penalty parameter $λ$ must be selected prior to obtaining the estimated coefficients. In general, this process of specifying hyperparameters prior to fitting the final model is called hyperparameter tuning. There are several potential selection criteria that can used to select the value of hyperparameters including Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) or cross-validation. We used a version of cross-validation which incorporates the design weights in our simulation study; see McConville (2011) for discussion of the selection of the penalty parameter for survey-weighted lasso coefficient estimates.

2.3 Survey regression estimator with adaptive lasso

An issue with the use of the lasso criterion is that by shrinking the regression coefficients towards zero it yields biased estimates for regression coefficients that are far from zero. Under the adaptive lasso criterion (Zou, 2006), the coefficients in the $l_{1}$ penalty are weighted by the inverse of a root- $n$ consistent estimator of $β .$ Therefore, the bias for large coefficients tends to be smaller.

McConville et al. (2017) considered an adaptive lasso survey regression estimator

${\hat{t}}_{y, ALASSO} = \sum_{i \in s} \frac{y_{i} - x_{i}^{T} {\hat{β}}_{s, AL}}{π_{i}} + \sum_{i \in U} x_{i}^{T} {\hat{β}}_{s, AL},$

where

${\hat{β}}_{s, AL} = \underset{β}{argmin} {(Y_{s} - X_{s} β)}^{T} Π_{s}^{- 1} (Y_{s} - X_{s} β) + λ \sum_{j = 1}^{p} \frac{| β_{j} |}{| {\hat{β}}_{s j} |}$

and ${\hat{β}}_{s}$ is given by (2.3). The reliance of the adaptive lasso method on the standard weighted linear regression coefficient estimates, ${\hat{β}}_{s},$ leads to a loss of efficiency in settings when $p$ is large because the estimates ${\hat{β}}_{s}$ tend to be very unstable.

2.4 Lasso calibration estimators

The lasso and adaptive lasso methods do not produce regression weights directly, as the estimators cannot be expressed as weighted combinations of the $y$ -values. McConville et al. (2017) developed lasso survey regression weights using a model calibration approach and a ridge regression approximation. These lasso regression weights depend on the variable of interest, $y .$

The lasso calibration estimator is calculated by regressing the variable of interest, $y_{i},$ on an intercept and the lasso-fitted mean function $x_{i}^{T} {\hat{β}}_{s, L} .$ The lasso calibration estimator can be written in the same form as (2.4), where $x_{i}$ is replaced by $x_{i}^{*} = {(1, x_{i}^{T} {\hat{β}}_{s, L})}^{T} :$

${\hat{t}}_{y, CLASSO} = \sum_{i \in s} [1 + {(t_{x^{*}} - {\hat{t}}_{x^{*}, HT})}^{T} {(\sum_{k \in s} x_{k}^{*} x_{k}^{* T} d_{k})}^{- 1} x_{i}^{*}] d_{i} y_{i} . (2.5)$

Similarly, the adaptive lasso calibration estimator is given by

${\hat{t}}_{y, CALASSO} = \sum_{i \in s} [1 + {(t_{x^{* *}} - {\hat{t}}_{x^{* *}, HT})}^{T} {(\sum_{k \in s} x_{k}^{* *} x_{k}^{* * T} d_{k})}^{- 1} x_{i}^{* *}] d_{i} y_{i},$

where the lasso-fitted mean for $x_{i}^{*}$ in (2.5) is replaced by the adaptive lasso fit, $x_{i}^{* *} = {(1, x_{i}^{T} {\hat{β}}_{s, AL})}^{T} .$ The weights for the lasso calibration estimators are calibrated to the population size $N$ and to the population total of the lasso-fitted mean functions.

2.5 Regression tree estimator

The GREG estimator can also be expressed as

${\hat{t}}_{y, r} = \sum_{i \in s} \frac{y_{i} - {\hat{h}}_{n} (x_{i})}{π_{i}} + \sum_{i \in U} {\hat{h}}_{n} (x_{i}), (2.6)$

where ${\hat{h}}_{n} (x_{i})$ is an estimator of the mean function of $Y_{i}$ given $X_{i} = x_{i},$ $h (x_{i}) = E (Y_{i} | X_{i} = x_{i}),$ based on the sample data $(y_{i}, x_{i}), i \in s .$ As an alternative to a linear regression model, McConville and Toth (2019) proposed estimating $h (x)$ with a regression tree model using the following algorithm:

Let $k (n)$ be the minimum box size and $α$ be a specified significance level.
If the dataset contains at least $2 k (n)$ observations then continue to step 3; otherwise, stop.
Among the auxiliary variables $x_{l}, l = 1, \dots, d,$ choose a variable to split the data. The chosen $x_{l}$ is the variable that shows the largest significance difference after testing the null-hypothesis of homogeneous $E [y | x_{l}] .$ If no variable leads to a significant difference, then stop.
Split the data into two sets $S_{L}$ and $S_{R}$ by splitting based on the value of the selected variable $x_{l}$ that results in the largest decrease in the estimated mean square error, while satisfying the requirement that each subset contains at least $k (n)$ units.
For each of the resulting subsets of the data, return to step 1.

The resulting regression tree model groups the categories of an auxiliary variable based on their relationship to the variable of interest and only includes auxiliary variables and interactions associated with this variable. Importantly, including a categorical variable does not require a split for each category, potentially reducing the model size substantially while still capturing important interactions.

After fitting a regression tree model, we obtain a set of boxes $Q_{n} = {B_{n 1}, B_{n 2}, \dots, B_{n q}}$ which partition the data. Let $I (x_{i} \in B_{n k}) = 1$ if $x_{i} \in B_{n k}$ and 0 otherwise, for $k = 1, \dots, q .$ This means that $I (x_{i} \in B_{n k}) = 1$ for exactly one box $B_{n k} \in Q_{n}$ for every $i \in s .$ For every $x_{i} \in B_{n k},$ the estimator of $h (x_{i})$ is given by

${\tilde{h}}_{n} (x_{i}) = \tilde{#} {(B_{n k})}^{- 1} \sum_{i \in s} π_{i}^{- 1} y_{i} I (x_{i} \in B_{n k}) = {\tilde{μ}}_{n k}, (2.7)$

where

$\tilde{#} (B_{n k}) = \sum_{i \in s} π_{i}^{- 1} I (x_{i} \in B_{n k})$

is the HT estimator of the population size in box $B_{n k} .$ The regression tree estimator ${\hat{t}}_{y, TREE}$ is obtained by inserting equation (2.7) into the generalized regression estimator, given in equation (2.6), leading to the post stratified estimator

${\hat{t}}_{y, TREE} = \sum_{k} N_{k} {\tilde{μ}}_{n k},$

where $N_{k}$ is the number of units in $U$ that belong to box $k .$

Since ${\tilde{h}}_{n} (x_{i})$ can be written as a linear regression estimator with $q$ indicator function covariates, the regression tree estimator is also a post-stratified estimator, where each box $B_{n k}$ represents a post-stratum. This implies that this estimator is calibrated to the population total of each box, providing a data-driven mechanism, dependent on $y,$ for selecting post-strata that ensures that none of them are empty. As a result, the regression weights are guaranteed to be non-negative. The weights produced by this estimation procedure depend on the variable of interest, $y .$ Therefore, unlike the GREG approach, a single set of generic weights to apply to all study variables is not available. Instead, a set of weights for each survey variable of interest is produced.

2.6 Variance estimation under stratified simple random sampling

Under stratified simple random sampling, a variance estimator of the model-assisted survey regression estimators described above is obtained by the Taylor linearization method and given by

$\hat{V} ({\hat{t}}_{y}) = \sum_{h} \frac{N_{h} (N_{h} - n_{h})}{n_{h}} \frac{1}{n_{h} - 1} \sum_{i \in s_{h}} {(e_{h i} - {\bar{e}}_{h})}^{2}, (2.8)$

where $h$ indexes the strata, $N_{h}$ is the number of population units in stratum $h,$ $n_{h}$ is the number of sampled units $s_{h}$ in stratum $h,$ $e_{h i} = y_{h i} - {\hat{h}}_{n} (x_{h i})$ is the residual of sample unit $i$ in stratum $h$ under the regression model and ${\bar{e}}_{h}$ is the average residual in stratum $h .$

The variance estimators readily extend to more complex sampling designs, but for simplicity we have given the expression only for stratified simple random sampling which is used in the simulation study of Section 3.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-06-21

Language selection

Search and menus

Search

Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 2. Model-assisted estimation under probability sampling

2.1 GREG estimators

2.2 Survey regression estimator with lasso

2.3 Survey regression estimator with adaptive lasso

2.4 Lasso calibration estimators

2.5 Regression tree estimator

2.6 Variance estimation under stratified simple random sampling

Relative performance of methods based on model-assisted survey regression estimation: A simulation study Section 2. Model-assisted estimation under probability sampling

2.1 GREG estimators

2.2 Survey regression estimator with lasso

2.3 Survey regression estimator with adaptive lasso

2.4 Lasso calibration estimators

2.5 Regression tree estimator

2.6 Variance estimation under stratified simple random sampling

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 2. Model-assisted estimation under probability sampling