Survey Methodology

Release date: June 27, 2019

The journal Survey Methodology Volume 45, Number 2 (June 2019) contains the following ten papers:

Waksberg Invited Paper Series

Conditional calibration and the sage statistician

by Donald B. Rubin

Being a calibrated statistician means using procedures that in long-run practice basically follow the guidelines of Neyman’s approach to frequentist inference, which dominates current statistical thinking. Being a sage (i.e., wise) statistician when confronted with a particular data set means employing some Bayesian and Fiducial modes of thinking to moderate simple Neymanian calibration, even if not doing so formally. This article explicates this marriage of ideas using the concept of conditional calibration, which takes advantage of more recent simulation-based ideas arising in Approximate Bayesian Computation.

Full article  PDF version

Regular Papers

A bivariate hierarchical Bayesian model for estimating cropland cash rental rates at the county level

by Andreea Erciulescu, Emily Berg, Will Cecere and Malay Ghosh

The National Agricultural Statistics Service (NASS) of the United States Department of Agriculture (USDA) is responsible for estimating average cash rental rates at the county level. A cash rental rate refers to the market value of land rented on a per acre basis for cash only. Estimates of cash rental rates are useful to farmers, economists, and policy makers. NASS collects data on cash rental rates using a Cash Rent Survey. Because realized sample sizes at the county level are often too small to support reliable direct estimators, predictors based on mixed models are investigated. We specify a bivariate model to obtain predictors of 2010 cash rental rates for non-irrigated cropland using data from the 2009 Cash Rent Survey and auxiliary variables from external sources such as the 2007 Census of Agriculture. We use Bayesian methods for inference and present results for Iowa, Kansas, and Texas. Incorporating the 2009 survey data through a bivariate model leads to predictors with smaller mean squared errors than predictors based on a univariate model.

Full article  PDF version

Estimation of response propensities and indicators of representative response using population-level information

by Annamaria Bianchi, Natalie Shlomo, Barry Schouten, Damião N. Da Silva and Chris Skinner

In recent years, there has been a strong interest in indirect measures of nonresponse bias in surveys or other forms of data collection. This interest originates from gradually decreasing propensities to respond to surveys parallel to pressures on survey budgets. These developments led to a growing focus on the representativeness or balance of the responding sample units with respect to relevant auxiliary variables. One example of a measure is the representativeness indicator, or R-indicator. The R-indicator is based on the design-weighted sample variation of estimated response propensities. It pre-supposes linked auxiliary data. One of the criticisms of the indicator is that it cannot be used in settings where auxiliary information is available only at the population level. In this paper, we propose a new method for estimating response propensities that does not need auxiliary information for non-respondents to the survey and is based on population auxiliary information. These population-based response propensities can then be used to develop R-indicators that employ population contingency tables or population frequency counts. We discuss the statistical properties of the indicators, and evaluate their performance using an evaluation study based on real census data and an application from the Dutch Health Survey.

Full article  PDF version

Semiparametric quantile regression imputation for a complex survey with application to the Conservation Effects Assessment Project

by Emily Berg and Cindy Yu

Development of imputation procedures appropriate for data with extreme values or nonlinear relationships to covariates is a significant challenge in large scale surveys. We develop an imputation procedure for complex surveys based on semiparametric quantile regression. We apply the method to the Conservation Effects Assessment Project (CEAP), a large-scale survey that collects data used in quantifying soil loss from crop fields. In the imputation procedure, we first generate imputed values from a semiparametric model for the quantiles of the conditional distribution of the response given a covariate. Then, we estimate the parameters of interest using the generalized method of moments (GMM). We derive the asymptotic distribution of the GMM estimators for a general class of complex survey designs. In simulations meant to represent the CEAP data, we evaluate variance estimators based on the asymptotic distribution and compare the semiparametric quantile regression imputation (QRI) method to fully parametric and nonparametric alternatives. The QRI procedure is more efficient than nonparametric and fully parametric alternatives, and empirical coverages of confidence intervals are within 1% of the nominal 95% level. An application to estimation of mean erosion indicates that QRI may be a viable option for CEAP.

Full article  PDF version

Multiple imputation of missing values in household data with structural zeros

by Olanrewaju Akande, Jerome Reiter and Andrés F. Barrientos

We present an approach for imputation of missing items in multivariate categorical data nested within households. The approach relies on a latent class model that (i) allows for household-level and individual-level variables, (ii) ensures that impossible household configurations have zero probability in the model, and (iii) can preserve multivariate distributions both within households and across households. We present a Gibbs sampler for estimating the model and generating imputations. We also describe strategies for improving the computational efficiency of the model estimation. We illustrate the performance of the approach with data that mimic the variables collected in typical population censuses.

Full article  PDF version

An optimisation algorithm applied to the one-dimensional stratification problem

by José André de Moura Brito, Tomás Moura da Veiga and Pedro Luis do Nascimento Silva

This paper presents a new algorithm to solve the one-dimensional optimal stratification problem, which reduces to just determining stratum boundaries. When the number of strata H and the total sample size n are fixed, the stratum boundaries are obtained by minimizing the variance of the estimator of a total for the stratification variable. This algorithm uses the Biased Random Key Genetic Algorithm (BRKGA) metaheuristic to search for the optimal solution. This metaheuristic has been shown to produce good quality solutions for many optimization problems in modest computing times. The algorithm is implemented in the R package stratbr available from CRAN (de Moura Brito, do Nascimento Silva and da Veiga, 2017a). Numerical results are provided for a set of 27 populations, enabling comparison of the new algorithm with some competing approaches available in the literature. The algorithm outperforms simpler approximation-based approaches as well as a couple of other optimization-based approaches. It also matches the performance of the best available optimization-based approach due to Kozak (2004). Its main advantage over Kozak’s approach is the coupling of the optimal stratification with the optimal allocation proposed by de Moura Brito, do Nascimento Silva, Silva Semaan and Maculan (2015), thus ensuring that if the stratification bounds obtained achieve the global optimal, then the overall solution will be the global optimum for the stratification bounds and sample allocation.

Full article  PDF version

An assessment of accuracy improvement by adaptive survey design

by Carl-Erik Särndal and Peter Lundquist

High nonresponse occurs in many sample surveys today, including important surveys carried out by government statistical agencies. An adaptive data collection can be advantageous in those conditions: Lower nonresponse bias in survey estimates can be gained, up to a point, by producing a well-balanced set of respondents. Auxiliary variables serve a twofold purpose: Used in the estimation phase, through calibrated adjustment weighting, they reduce, but do not entirely remove, the bias. In the preceding adaptive data collection phase, auxiliary variables also play a major role: They are instrumental in reducing the imbalance in the ultimate set of respondents. For such combined use of auxiliary variables, the deviation of the calibrated estimate from the unbiased estimate (under full response) is studied in the article. We show that this deviation is a sum of two components. The reducible component can be decreased through adaptive data collection, all the way to zero if perfectly balanced response is realized with respect to a chosen auxiliary vector. By contrast, the resisting component changes little or not at all by a better balanced response; it represents a part of the deviation that adaptive design does not get rid of. The relative size of the former component is an indicator of the potential payoff from an adaptive survey design.

Full article  PDF version

An alternative way of estimating a cumulative logistic model with complex survey data

by Phillip S. Kott and Peter Frechtel

When fitting an ordered categorical variable with L > 2 levels to a set of covariates onto complex survey data, it is common to assume that the elements of the population fit a simple cumulative logistic regression model (proportional-odds logistic-regression model). This means the probability that the categorical variable is at or below some level is a binary logistic function of the model covariates. Moreover, except for the intercept, the values of the logistic-regression parameters are the same at each level. The conventional “design-based” method used for fitting the proportional-odds model is based on pseudo-maximum likelihood. We compare estimates computed using pseudo-maximum likelihood with those computed by assuming an alternative design-sensitive robust model-based framework. We show with a simple numerical example how estimates using the two approaches can differ. The alternative approach is easily extended to fit a general cumulative logistic model, in which the parallel-lines assumption can fail. A test of that assumption easily follows.

Full article  PDF version

On combining independent probability samples

by Anton Grafström, Magnus Ekström, Bengt Gunnar Jonsson, Per-Anders Esseen and Göran Ståhl

Merging available sources of information is becoming increasingly important for improving estimates of population characteristics in a variety of fields. In presence of several independent probability samples from a finite population we investigate options for a combined estimator of the population total, based on either a linear combination of the separate estimators or on the combined sample approach. A linear combination estimator based on estimated variances can be biased as the separate estimators of the population total can be highly correlated to their respective variance estimators. We illustrate the possibility to use the combined sample to estimate the variances of the separate estimators, which results in general pooled variance estimators. These pooled variance estimators use all available information and have potential to significantly reduce bias of a linear combination of separate estimators.

Full article  PDF version

Bayesian benchmarking of the Fay-Herriot model using random deletion

by Balgobin Nandram, Andreea L. Erciulescu and Nathan B. Cruze

Benchmarking lower level estimates to upper level estimates is an important activity at the United States Department of Agriculture’s National Agricultural Statistical Service (NASS) (e.g., benchmarking county estimates to state estimates for corn acreage). Assuming that a county is a small area, we use the original Fay-Herriot model to obtain a general Bayesian method to benchmark county estimates to the state estimate (the target). Here the target is assumed known, and the county estimates are obtained subject to the constraint that these estimates must sum to the target. This is an external benchmarking; it is important for official statistics, not just NASS, and it occurs more generally in small area estimation. One can benchmark these estimates by “deleting” one of the counties (typically the last one) to incorporate the benchmarking constraint into the model. However, it is also true that the estimates may change depending on which county is deleted when the constraint is included in the model. Our current contribution is to give each small area a chance to be deleted, and we call this procedure the random deletion benchmarking method. We show empirically that there are differences in the estimates as to which county is deleted and that there are differences of these estimates from those obtained from random deletion as well. Although these differences may be considered small, it is most sensible to use random deletion because it does not give preferential treatment to any county and it can provide small improvement in precision over deleting the last one benchmarking as well.

Full article  PDF version


Date modified: