Are probability surveys bound to disappear for the production of official statistics?
Section 1. Introduction

In 1934, Jerzy Neyman laid the foundation for probability survey theory and his design-based approach to inference with an article published in the Journal of the Royal Statistical Society. His article (Neyman, 1934) piqued the interest of a number of statisticians at the time, and the theory was developed further in the following years. Still today, many articles on this topic are published in statistics journals. Rao (2005) provides an excellent review of various developments in probability survey theory during the 20th century (see also Bethlehem, 2009; Rao and Fuller, 2017; Kalton, 2019). Nowadays, national statistical agencies, such as Statistics Canada and the Institut National de la Statistique et des Études Économiques (INSEE) in France, use probability surveys most often to get the information they seek on a population of interest.

The popularity of probability surveys for producing official statistics stems largely from the non-parametric nature of the inference approach developed by Neyman (1934). In other words, probability surveys allow for valid inferences about a population without having to rely on model assumptions. This is appealing – even fundamental, according to Deville (1991) – to national statistical agencies that produce official statistics. In fact, these agencies have historically been reluctant to take unnecessary risks, which are unavoidable for approaches that depend on the validity of model assumptions, especially when it is difficult to check the underlying assumptions.

However, estimates from probability surveys can prove inefficient, even to the point of being unusable, particularly when the sample size is small (see, for example, Rao and Molina, 2015). Furthermore, they are based on the assumption that non-sampling errors, such as measurement, coverage or non-response errors, are negligible. To minimize these errors, national statistical agencies often invest considerable resources. For example, questionnaires are tested to ensure that respondents fully understand them; survey data are validated using various edit rules; respondents are contacted again, if necessary, to confirm the data collected; non-respondent follow-ups are conducted to minimize the impact of non-response on the estimates, etc. Despite all these efforts, non-sampling errors persist in practice. There are, of course, adaptations of the theory for taking these errors into account. These adaptations necessarily come with model assumptions and thus with the risk of bias resulting from inadequate assumptions. Probability surveys are not a panacea but they are generally recognized as providing a reliable source of information about a population, except when non-sampling errors become dominant. Brick (2011) takes the argument further and defends the idea that a probability survey with a low response rate – if properly designed –  usually provides estimates with smaller bias than those obtained from a volunteer non-probability survey. Dutwin and Buskirk (2017) show empirical results that corroborate this argument.

For the past few years, a wind of change has been blowing over national statistical agencies, and other data sources are being increasingly explored. Five key factors are behind this trend: i) the decline in response rates in probability surveys in recent years; ii) the high cost of data collection; iii) the increased burden on respondents; iv) the desire for access to “real-time” statistics (Rao, 2020), in other words, having the ability to produce statistics practically at the same time or very shortly after the information needs are expressed; and v) the proliferation of non-probability data sources (Rancourt, 2019) such as administrative sources, social media, web surveys, etc. To control data collection costs of probability surveys and reduce the adverse effects of non-response on the quality of estimates, a number of authors have proposed and evaluated responsive data collection methods (e.g., Laflamme and Karaganis, 2010; Lundquist and Särndal, 2013; Schouten, Calinescu and Luiten, 2013; Beaumont, Haziza and Bocci, 2014; Särndal, Lumiste and Traat, 2016). Tourangeau, Brick, Lohr and Li (2017) review various methods and point out their limited success in reducing non-response bias and costs. Särndal et al. (2016) also reach the same conclusion regarding bias. Some surveys conducted by national statistical agencies still have very low response rates, and it becomes risky to rely solely on data collection and estimation methods to correct potential non-response biases. Indeed, a number of authors (e.g., Rivers, 2007; Elliott and Valliant, 2017) pointed out the similarity between a probability survey with a very low response rate and a non-probability survey. Yet, a non-probability survey has the advantages of having a usually much larger sample size and being less costly. Given the above discussion, some have come to believe that probability surveys could gradually disappear (see Couper, 2000; Couper, 2013; Miller, 2017).

However, data from non-probability sources are not without challenges, as noted by Couper (2000), Baker, Brick, Bates, Battaglia, Couper, Dever, Gile and Tourangeau (2013), and Elliott and Valliant (2017), among others. For example, it is well known that non-probability surveys that collect data from volunteers can often lead to estimates with significant selection bias (or participation bias). Bethlehem (2016) provides a bias expression and argues that the potential for bias is usually higher for a non-probability survey than for a probability survey affected by non-response. Meng (2018) illustrates that bias becomes dominant as the non-probability sample size increases, which significantly reduces the effective sample size. Therefore, the acquisition of large non-probability samples alone cannot ensure the production of estimates with an acceptable quality. The pre-election poll conducted by the Literary Digest magazine for predicting the outcome of the 1936 U.S. presidential election is a prime example of this (Squire, 1988; Elliott and Valliant, 2017). Despite a huge sample size of over two million people, the poll was unable to predict Franklin Roosevelt’s overwhelming victory. Instead, it incorrectly predicted a convincing victory for his opponent, Alfred Landon. The set of poll respondents, who were highly unrepresentative of the voting population, was made up mainly of car and phone owners as well as the magazine’s subscribers. Couper (2000) and Elliott and Valliant (2017) cite other more recent examples of non-probability surveys that led to erroneous conclusions.

Selection bias is not the only challenge that must be overcome when using data from a non-probability source. Another major challenge is the presence of measurement errors (e.g., Couper, 2000). They can significantly impact the estimates, especially when data are collected without relying on an experienced interviewer. This is the case for most non-probability sources, in particular volunteer web surveys.

The current context leads to the following question: How can data from a non-probability source be used to minimize, even eliminate, the data collection costs and respondent burden of a probability survey, all the while preserving a valid statistical inference framework and acceptable quality? That is the main question this article attempts to answer.

Most of the methods we present integrate data from a probability survey and a non-probability source. Zhang (2012) discusses the concept of statistical validity when integrated data are used to make inferences. We contend that establishing a statistical framework that can be used to make valid inferences is essential for the production of official statistics, a point that also seems to be shared by Rancourt (2019). Without such a framework, the usual properties of estimators, such as bias and variance, are not defined. It then becomes impossible to select estimators based on an objective criterion such as, for example, choosing the linear unbiased estimator with the smallest possible variance. Without a valid statistical inference framework, estimates can be calculated, but all the usual tools for determining the quality of those estimates and drawing accurate conclusions about the population’s characteristics of interest are lost.

In the rest of this article, we differentiate design-based approaches to inference, described in Section 3, from model-based approaches to inference, described in Section 4. For each approach, we consider two scenarios: In the first one, the data from the non-probability source match exactly the concepts of interest and are not fraught with measurement errors. Those data can therefore be used to replace the data from a probability survey. In the second scenario, the data from the non-probability source do not reflect concepts of interest or are subject to measurement errors. Although these data cannot be used to directly replace data from a probability survey, they can still be used as auxiliary information to enhance it. In Section 5, we provide some additional thoughts. Let us first begin with some background in Section 2.


Date modified: