Survey Methodology

Release date: December 20, 2024

The journal Survey Methodology Volume 50, Number 2 (December 2024) contains the following sixteen papers:

Waksberg invited paper series

Sample design using models

by Richard Valliant

Abstract

Joseph Waksberg was an important figure in survey statistics mainly through his applied work in the design of samples. He took a design-based approach to sample design by emphasizing uses of randomization with the goal of creating estimators with good design-based properties. Since his time on the scene, advances have been made in the use of models to construct designs and in software to implement elaborate designs. This paper reviews uses of models in balanced sampling, cutoff samples, stratification using models, multistage sampling, and mathematical programming for determining sample sizes and allocations.

HTML version  PDF version

Regular papers

Design consistent random forest models for data collected from a complex sample

by Daniell Toth and Kelly S. McConville

Abstract

Random forest models, which are the result of averaging the estimated values from a large number of tree models, represent a useful and flexible tool for modeling the data nonparametrically to provide accurately predicted values. There are many potential applications for these types of models when dealing with survey data. However, survey data is usually collected using an informative sample design, so it is necessary to have an algorithm for creating random forest models that account for this design during model estimation.

The tree models used in the forest are typically obtained by estimating tree models on bootstrapped samples of the original data. Since the models depend on the observed data and the values observed in the sample depend on the informative sample design, the usual method for estimation is likely to lead to a biased random forest model when applied to survey data.

In this article, we provide an algorithm and a set of conditions that produce consistent random forest models under an informative sample design and compare this method to the usual random forest modeling method. We show that ignoring the design can lead to biased model estimates.

HTML version  PDF version

Adaptive cluster sampling, a quasi Bayesian approach

by Glen Meeden and Muhammad Nouman Qureshi

Abstract

Adaptive cluster sampling designs were proposed as a method that could be used when sampling rare populations whose units tend to appear in clusters. The resulting estimator is not based on any model assumptions and is design unbiased. It can have smaller variance than the standard estimator which does not incorporate the fact that one is dealing with a rare population. Here we will demonstrate that, when adaptive cluster sampling is appropriate, its estimator does not take into account all the available information in the design. We present a quasi Bayesian approach which incorporates the information which is now ignored. We will see that the resulting estimator is a significant improvement over the current methods.

HTML version  PDF version

Inference from sampling with response probabilities estimated via calibration

by Caren Hasler

Abstract

A solution to control for nonresponse bias consists of multiplying the design weights of respondents by the inverse of estimated response probabilities to compensate for the nonrespondents. Maximum likelihood and calibration are two approaches that can be applied to obtain estimated response probabilities. We consider a common framework in which these approaches can be compared. We develop an asymptotic study of the behavior of the resulting estimator when calibration is applied. A logistic regression model for the response probabilities is postulated. Missing at random and unclustered data are supposed. Three main contributions of this work are: 1) we show that the estimators with the response probabilities estimated via calibration are asymptotically equivalent to unbiased estimators and that a gain in efficiency is obtained when estimating the response probabilities via calibration as compared to the estimator with the true response probabilities, 2) we show that the estimators with the response probabilities estimated via calibration are doubly robust to model misspecification and explain why double robustness is not guaranteed when maximum likelihood is applied, and 3) we highlight problems related to response probabilities estimation, namely existence of a solution to the estimating equations, problems of convergence, and extreme weights. We present the results of a simulation study in order to illustrate these elements.

HTML version  PDF version

Relaxed calibration of survey weights

by Nicholas T. Longford

Abstract

Population surveys are nowadays rarely analysed in isolation from any auxiliary information, often in the form of population counts, totals and other summaries. Calibration, or benchmarking, by which the weighted sample totals of auxiliary variables are matched to their (known) population totals, is widely applied. Methods for adjusting the weights to satisfy these constraints involve iterative procedures with unknown finite-sample properties. We develop an alternative method in which the weights are calibrated by minimising a quadratic function, requiring no iterations and yielding a unique solution. The relative priority of each constraint is represented by a tuning parameter. The properties of the weights and of the calibration estimator, as functions of these parameters, are explored analytically and by simulations. A connection of the proposed method with ridge calibration is established.

HTML version  PDF version

A hierarchical gamma prior for modeling random effects in small area estimation

by Xueying Tang and Liangliang Zhang

Abstract

Small area estimation (SAE) is becoming increasingly popular among survey statisticians. Since the direct estimates of small areas usually have large standard errors, model-based approaches are often adopted to borrow strength across areas. SAE models often use covariates to link different areas and random effects to account for the additional variation. Recent studies showed that random effects are not necessary for all areas, so global-local (GL) shrinkage priors have been introduced to effectively model the sparsity in random effects. The GL priors vary in tail behavior, and their performance differs under different sparsity levels of random effects. As a result, one needs to fit the model with different choices of priors and then select the most appropriate one based on the deviance information criterion or other evaluation metrics. In this paper, we propose a flexible prior for modeling random effects in SAE. The hyperparameters of the prior determine the tail behavior and can be estimated in a fully Bayesian framework. Therefore, the resulting model is adaptive to the sparsity level of random effects without repetitive fitting. We demonstrate the performance of the proposed prior via simulations and real applications.

HTML version  PDF version

Design-based estimation of small and empty domains in survey data analysis using order constraints

by Xiyue Liao, Mary C. Meyer and Xiaoming Xu

Abstract

Recent work in survey domain estimation has shown that incorporating a priori assumptions about orderings of population domain means reduces the variance of the estimators and provides smaller confidence intervals with good coverage. Here we show how partial ordering assumptions allow design-based estimation of sample means in domains for which the sample size is zero, with conservative variance estimates and confidence intervals. Order restrictions can also substantially improve estimation and inference in small-size domains. Examples with well-known survey data sets demonstrate the utility of the methods. Code to implement the examples using the R package csurvey is given in the appendix.

HTML version  PDF version

A small area estimation approach for reconciling differences in two surveys of recreational fishing effort

by Teng Liu, F. Jay Breidt and Jean D. Opsomer

Abstract

Many studies face the problem of comparing estimates obtained with different survey methodology, including differences in frames, measurement instruments, and modes of delivery. The problem arises in multimode surveys and in surveys that are redesigned. Major redesign of survey processes could affect survey estimates systematically, and it is important to quantify and adjust for such discontinuities between the designs to ensure comparability of estimates over time. We propose a small area estimation approach to reconcile two sets of survey estimates, and apply it to two surveys in the Marine Recreational Information Program (MRIP), which monitors recreational fishing along the Atlantic and Gulf coasts of the United States. We develop a log-normal model for the estimates from the two surveys, accounting for temporal dynamics through regression on population size and state-by-wave seasonal factors, and accounting in part for changing coverage properties through regression on wireless telephone penetration. Using the estimated design variances, we develop a regression model that is analytically consistent with the log-normal mean model. We use the modeled design variances in a Fay-Herriot small area estimation procedure to obtain empirical best linear unbiased predictors of the reconciled estimates of fishing effort (requiring predictions at new sets of covariates), and provide an asymptotically valid mean square error approximation.

HTML version  PDF version

Fully synthetic data for complex surveys

by Shirley Mathur, Yajuan Si and Jerome P. Reiter

Abstract

When seeking to release public use files for confidential data, statistical agencies can generate fully synthetic data. We propose an approach for making fully synthetic data from surveys collected with complex sampling designs. Our approach adheres to the general strategy proposed by Rubin (1993). Specifically, we generate pseudo-populations by applying the weighted finite population Bayesian bootstrap to account for survey weights, take simple random samples from those pseudo-populations, estimate synthesis models using these simple random samples, and release simulated data drawn from the models as public use files. To facilitate variance estimation, we use the framework of multiple imputation with two data generation strategies. In the first, we generate multiple data sets from each simple random sample. In the second, we generate a single synthetic data set from each simple random sample. We present multiple imputation combining rules for each setting. We illustrate the repeated sampling properties of the combining rules via simulation studies, including comparisons with synthetic data generation based on pseudo-likelihood methods. We apply the proposed methods to a subset of data from the American Community Survey.

HTML version  PDF version

Models of linkage error for capture-recapture estimation without clerical reviews

by Abel Dasylva, Arthur Goussanou and Christian-Olivier Nambeu

Abstract

The capture-recapture method can be applied to measure the coverage of administrative and big data sources, in official statistics. In its basic form, it involves the linkage of two sources while assuming a perfect linkage and other standard assumptions. In practice, linkage errors arise and are a potential source of bias, where the linkage is based on quasi-identifiers. These errors include false positives and false negatives, where the former arise when linking a pair of records from different units, and the latter arise when not linking a pair of records from the same unit. So far, the existing solutions have resorted to costly clerical reviews, or they have made the restrictive conditional independence assumption. In this work, these requirements are relaxed by modeling the number of links from a record instead. The same approach may be taken to estimate the linkage accuracy without clerical reviews, when linking two sources that each have some undercoverage.

HTML version  PDF version

Investigating mode effects in interviewer variances using two representative multi-mode surveys

by Wenshan Yu, Michael R. Elliott and Trivellore E. Raghunathan

Abstract

As mixed-mode designs become increasingly popular, their effects on data quality have attracted much scholarly attention. Most studies focused on the bias properties of mixed-mode designs; few of them have investigated whether mixed-mode designs have heterogeneous variance structures across modes. While many characteristics of mixed-mode designs, such as varied interviewer usage, systematic differences in respondents, varying levels of social desirability bias, among others, may lead to heterogeneous variances in mode-specific point estimates of population means, this study specifically investigates whether interviewer variances remain consistent across different modes in mixed-mode studies. To address this research question, we utilize data collected from two distinct study designs. In the first design, when interviewers are responsible for either face-to-face or telephone mode, we examine whether there are mode differences in interviewer variances for 1) sensitive political questions, 2) international items, 3) and item missing indicators on international items, using the Arab Barometer wave 6 Jordan data. In the second design, we draw on Health and Retirement Study (HRS) 2016 core survey data to examine the question on three topics when interviewers are responsible for both modes. The topics cover 1) the CESD depression scale, 2) interviewer observations, and 3) the physical activity scale. To account for the lack of interpenetrated designs in both data sources, we include respondent-level covariates in our models. We find significant differences in interviewer variances on one item (twelve items in total) in the Arab Barometer study; whereas for HRS, the results are three out of eighteen. Overall, we find the magnitude of the interviewer variances larger in FTF than TEL on sensitive items. We conduct simulations to understand the power to detect mode effects in the typically modest interviewer sample sizes.

HTML version  PDF version

Robust adaptive survey design for time changes in mixed-mode response propensities

by Shiya Wu, Harm-Jan Boonstra, Mirjam Moerbeek and Barry Schouten

Abstract

Adaptive survey designs (ASDs) tailor recruitment protocols to population subgroups that are relevant to a survey. In recent years, effective ASD optimization has been the topic of research and several applications. However, the performance of an optimized ASD over time is sensitive to time changes in response propensities. How adaptation strategies can adjust to such variation over time is not yet fully understood. In this paper, we propose a robust optimization approach in the context of sequential mixed-mode surveys employing Bayesian analysis. The approach is formulated as a mathematical programming problem that explicitly accounts for uncertainty due to time change. ASD decisions can then be made by considering time-dependent variation in conditional mode response propensities and between-mode correlations in response propensities. The approach is demonstrated using a case study: the 2014-2017 Dutch Health Survey. We evaluate the sensitivity of ASD performance to 1) the budget level and 2) the length of applicable historic time-series data. We find there is only a moderate dependence on the budget level and the dependence on historic data is moderated by the amount of seasonality during the year.

HTML version  PDF version

Bayesian predictive inference of a finite population mean without specifying the relation between the study variable and the covariates

by Ashley Lockwood and Balgobin Nandram

Abstract

While we avoid specifying the parametric relationship between the study variable and covariates, we illustrate the advantage of including a spatial component to better account for the covariates in our models to make Bayesian predictive inference. We treat each unique covariate combination as an individual stratum, then we use small area estimation techniques to make inference about the finite population mean of the continuous response variable. The two spatial models used are the conditional autoregressive and simple conditional autoregressive models. We include the spatial effects by creating the adjacency matrix via the Mahalanobis distance between covariates. We also show how to incorporate survey weights into the spatial models when dealing with probability survey data. We compare the results of two non-spatial models including the Scott-Smith model and the Battese, Harter, and Fuller model to the spatial models. We illustrate the comparison between the aforementioned models with an application using BMI data from eight counties in California. Our goal is to have neighboring strata yield similar predictions, and to increase the difference between strata that are not neighbors. Ultimately, using the spatial models shows less global pooling compared to the non-spatial models, which was the desired outcome.

HTML version  PDF version

Recursive Neyman algorithm for optimum sample allocation under box constraints on sample sizes in strata

by Jacek Wesołowski, Robert Wieczorkowski and Wojciech Wójciak

Abstract

The optimum sample allocation in stratified sampling is one of the basic issues of survey methodology. It is a procedure of dividing the overall sample size into strata sample sizes in such a way that for given sampling designs in strata the variance of the stratified π estimator of the population total (or mean) for a given study variable assumes its minimum. In this work, we consider the optimum allocation of a sample, under lower and upper bounds imposed jointly on sample sizes in strata. We are concerned with the variance function of some generic form that, in particular, covers the case of the simple random sampling without replacement in strata. The goal of this paper is twofold. First, we establish (using the Karush-Kuhn-Tucker conditions) a generic form of the optimal solution, the so-called optimality conditions. Second, based on the established optimality conditions, we derive an efficient recursive algorithm, named RNABOX, which solves the allocation problem under study. The RNABOX can be viewed as a generalization of the classical recursive Neyman allocation algorithm, a popular tool for optimum allocation when only upper bounds are imposed on sample strata-sizes. We implement RNABOX in R as a part of our package stratallo which is available from the Comprehensive R Archive Network (CRAN) repository.

HTML version  PDF version

Daily rhythm of data quality: Evidence from the Survey of Unemployed Workers in New Jersey

by Jorge González Chapela

Abstract

This paper investigates whether survey data quality fluctuates over the day. After laying out the argument theoretically, panel data from the Survey of Unemployed Workers in New Jersey are analyzed. Several indirect indicators of response error are investigated, including item nonresponse, interview completion time, rounding, and measures of the quality of time diary data. The evidence that we assemble for a time of day of interview effect is weak or nonexistent. Item nonresponse and the probability that interview completion time is among the 5% shortest appear to increase in the evening, but a more thorough assessment requires instrumental variables.

HTML version  PDF version

Short note

Exploring a skewness conjecture: Expanding Cochran’s rule to a proportion estimated from a complex sample

by Phillip S. Kott and Burton Levine

Abstract

Cochran’s rule states that a standard (Wald) two-sided 95% confidence interval around a sample mean drawn from a population with positive skewness is reasonable when the sample size is greater than 25 times the square of the skewness coefficient of the population. We investigate whether a variant of this crude rule applies for a proportion estimated from a stratified simple random sample.

HTML version  PDF version


Date modified: