Are probability surveys bound to disappear for the production of official statistics?
Section 2. Background
One of the first
steps to meet information needs is to define the target population for which
that information is sought. We denote this target population by
Then, it
is necessary to define the parameters of interest, i.e. what it is desired to
know about the target population. In practice, it is often desired to estimate
many parameters. To simplify the discussion, we suppose that only one parameter
is of interest: the total of the variable
where
is the
value of the variable
for unit
of the population
We use
to
denote the vector containing the values
for
Lastly, a
set of procedures must be established for the estimation of the parameter
while
taking into account various factors, such as the available budget, the respondent
burden, the desired precision, etc. During this process, it is necessary to
identify the data sources that will be used – probabilistic or not – and a statistical inference framework that will allow for assessing the
properties of the estimates produced, such as bias and variance.
The above sequence,
which starts with defining the target population and parameters of interest,
followed by the data sources and estimation procedures, is consistent with the
proposal by Citro (2014). She suggests that national statistical agencies first
determine the information needs along with potential users. Next, they can work
at identifying the data source(s) that will meet those needs while preserving
an acceptable quality of estimates, keeping costs within the established budget
and controlling for respondent burden. It seems preferable to avoid the reverse
procedure, however tempting it is, of first identifying available data sources
and then artificially determining the needs based on what can be produced by
these sources. In general, this kind of procedure cannot adequately meet users’
actual needs.
We assume that we
have access to data from a non-probability source (e.g., administrative data,
web survey data, etc.). Values are observed for a few variables, including a
variable
for all
units of a subset of
denoted
as
The
variable
is not
necessarily equal to
because
of conceptual differences and/or measurement errors. At least, it is hoped that
there is a strong association between the two variables. We denote the
inclusion indicator in
as
in other
words,
if unit
is in
and
otherwise. The vector of the inclusion
indicators
for
is
denoted by
Data from a
probability survey may also be available. In that case, a sample
of the
population
is
randomly selected with probability
The
matrix
contains
information available on the sampling frame that is used to define the sampling
design, such as stratum identifiers for each unit of the population. The sample
inclusion indicators,
are defined as follows:
if unit
is
selected in the sample
otherwise,
We use
to
denote the vector containing the sample inclusion indicators for
The
probability that unit
of the
population
is chosen
in the sample is denoted by
Most of
the time it is known or can be approximated. We assume that
For each
unit
the
values of certain variables are collected, which may or may not include the
variable
We use
to
denote the set of all the auxiliary data used to make inferences. Among other
things,
includes
the design information,
if
a probability sample is used, and potentially other auxiliary variables such as
calibration variables, matching variables or explanatory variables of a model
(see Sections 3 and 4). The inclusion indicator
can also
be used as an auxiliary variable either for stratifying the population or for
calibration (see Section 3). The vector
can thus
be included in
and
The following two
assumptions are used throughout the article:
Assumption 1:
is
independent of
and
after
conditioning on
Assumption 2:
and
are
independent after conditioning on
and
Assumption 1
implies that the values of the variables included in
and
are not
affected by whether or not a unit is included in the sample
This is
implicit in the literature on probability surveys and results from the very
definition of the sampling design, which depends only on
Assumption 2 is automatically satisfied if the
non-probability source (and thus
is
available prior to selecting the probability sample. Note that if
is used
as an auxiliary variable to stratify the population, then
is
included in
and
assumption 2 is still satisfied. It will not be satisfied if being selected in
impacts
the provision of data to the non-probability source. For example, being
selected in
(and
contacted) could be an indirect reminder for the selected individual to fill
out forms required by the government (non-probability source). It can be
expected that assumptions 1 and 2 are satisfied in most cases.
The union of
and
contains
all the information used for making inferences. The various approaches to
inference set out in Sections 3 and 4 differ in what they treat as fixed
and what they treat as random. For example, in the design-based approach to
inference, everything is considered fixed except for the vector
in
other words, design-based inferences are conditional on
and
To
simplify the notation, we use
to
denote the union of
and
Thus,
design expectations are denoted as
rather
than
In the
design-based approach to inference, an estimator
of
is
usually chosen so that the design bias,
is zero or
negligible. Under assumptions 1 and 2, we note that
For
estimating the total
an
estimator of the form
is frequently
used, where
is a
survey weight for unit
The standard
basic weight is
This
weight ensures that the estimator
is
exactly design-unbiased for
The
basic weight can then be modified using calibration techniques (e.g., Deville
and Särndal, 1992; Haziza and Beaumont, 2017). The advantage of this approach
is its non-parametric nature: no model assumption is needed for making valid
inferences about the population because the first two design moments are
controlled by the statistician and are usually known. Yet, the approach is not
free of assumptions, for example to ensure the consistency and asymptotic
normality of estimators, but it does not require any parametric model.
In practice,
non-response is often observed in probability surveys as well as other
non-sampling errors. Non-response of some sample units is often viewed as an
additional phase of sampling that is not controlled by the statistician. In
other words, the non-response mechanism is not known, unlike the sampling
design. Assuming an adequate model for the non-response mechanism, estimators
with little or no bias can be obtained, for example by weighting the responding
units by the inverse of their estimated response probability. However, this
requires careful modelling of the response indicators. In the rest of this
paper, we ignore non-sampling errors and assume that the estimates from the probability
survey are not biased or, at least, that their bias is small compared to the
bias of the estimates from the non-probability source alone. This assumption
may not always be satisfied in practice, but it is reasonable in many contexts
(see Brick, 2011), especially in large surveys conducted by national
statistical agencies.
The acquisition of
data from non-probability sources is generally inexpensive compared to the cost
of collecting data from a probability survey. Therefore, they would ideally be
used to replace data from a probability survey. This data replacement is valid
only if
This
assumption will not be satisfied with all non-probability data sources, but may
be realistic with some administrative data sources. In Sections 3 and 4,
we will differentiate the methods based on the assumption that
from the
methods not requiring this assumption. Several methods described in Sections 3
and 4 are also reviewed in the upcoming article by Rao (2020) that was
presented to Statistics Canada in summer 2018.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2020
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa