Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 1. Introduction

Table of contents

At Statistics Canada and several other statistical agencies, there is a growing interest in leveraging auxiliary data, possibly from administrative sources, to improve the efficiency of estimators. Machine learning techniques have become a popular tool in various disciplines for utilizing such auxiliary information. These methods often do not require the distributional assumptions of more traditional methods and are able to adapt to complex non-linear and non-additive relationships between the outcomes and auxiliary variables. Machine learning methods have been applied to survey data in a variety of contexts such as response/adaptive designs, data processing, nonresponse adjustment and weighting (Buskirk, Kirchner, Eck and Signorino, 2018; Kern, Klausch and Kreuter, 2019).

Recently, the use of machine learning techniques to improve the efficiency of estimators of totals and means through model-assisted survey regression estimation under probability sampling has been considered. Model-assisted survey regression estimators of finite population totals may reduce variability and lead to significant gains in efficiency if the available auxiliary variables are strongly associated with the survey variable of interest. Increasingly, a large number of auxiliary variables are available, some of which may be extraneous. In this case, variable selection followed by regression estimation based on the selected model may improve efficiency of the survey regression estimators of finite population totals. We consider finite population estimation using the generalized regression (GREG) estimator with various linear working models (Särndal, Swensson and Wretman, 1992). Model-assisted estimators, using lasso and adaptive lasso methods (McConville, Breidt, Lee and Moisen, 2017) and regression trees (McConville and Toth, 2019), have been applied to survey data. Other nonlinear models, such as penalized splines and neural networks, have been explored for model-assisted estimation; see Breidt and Opsomer (2017) for a survey of these techniques.

Another field of research where the use of model-assisted estimators has been proposed is estimation from non-probability samples. Increasing costs and declining response rates are leading to an expanding interest in the use of non-probability samples. However, the process generating a non-probability sample is unknown and such samples are subject to selection bias. Two commonly used approaches to estimation from non-probability samples are quasi-randomization and superpopulation modeling. In the first, the sample is treated as if it was obtained from probability sampling but with unknown selection probabilities. The pseudo-inclusion probabilities are estimated via a propensity model that uses the sample data in combination with some external data set that covers the targeted population. Machine learning techniques have been employed in the estimation of pseudo-inclusion probabilities or, equivalently, in the construction of pseudo-weights. Kern, Li and Wang (2020) investigated several machine learning techniques to construct pseudo-weights using a propensity score-based kernel weighting for non-probability samples. Rafei, Flannagan and Elliott (2020) developed a pseudo-weighting approach using Bayesian Additive Regression Trees.

In the superpopulation approach, observed values of the variables of interest are assumed to be generated by some model. The model is estimated from the data and, along with external population control data, is used to project the sample to the population. Under this framework, calibration to known population totals of auxiliary variables provides a means of potentially reducing the effect of sample selection bias. Chen, Valliant and Elliott (2018) discussed the implementation of model calibration using adaptive lasso for data based on non-probability sampling. In scenarios where the population totals are estimated, Chen, Valliant and Elliott (2019), incorporated the sampling uncertainty of the benchmarked data, obtained from a probability sample survey, into the variance component of a model-assisted calibration estimator using adaptive lasso regression. Therefore, unlike in the probability sampling context where the use of model-assisted estimation seeks to improve the efficiency of estimators, the use of these techniques in a non-probability sampling context aims to diminish the impact of selection bias.

We consider several lasso-based estimators as well as a regression tree estimator and evaluate their performance in both a probability sampling context and a non-probability sampling set up. In Section 2, the model-assisted estimators considered are discussed. The set up for a simulation study under probability sampling is described in Section 3. The results of the simulation study on the root mean square error of the estimators, relative bias of variance estimators and properties of survey weights are presented in Section 4. Except for the GREG estimator, all the model-assisted estimators considered here involve variable selection and yield, if applicable, regression weights that depend on the survey variable of interest, $y .$ The impact of using a single set of regression weights for multiple related study variables is also investigated in this section. The results of the simulation study using a non-probability sampling scenario are detailed in Section 5. We conclude with a summary of the findings in Section 6.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-06-21

Language selection

Search and menus

Search

Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 1. Introduction

Relative performance of methods based on model-assisted survey regression estimation: A simulation study Section 1. Introduction

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Relative performance of methods based on model-assisted survey regression estimation: A simulation study
Section 1. Introduction