Are probability surveys bound to disappear for the production of official statistics?
Section 3. Design-based approaches
Design-based approaches yield design-consistent
estimators of
even when the non-probability source produces
estimates with a significant selection bias. In this context, the purpose of
using a non-probability sample is to reduce the variance of estimators of
The efficiency gains achieved can be used to
justify a reduction of the probability sample size, thereby a reduction of the data
collection costs and respondent burden. The methods that we consider in Sections 3.1
and 3.2 require collecting the values of the variable of interest
in the probability sample, just like small
area estimation methods described in Section 4.4. However, the efficiency gains
are usually expected to be more modest than those obtained using small area
estimation methods. In Section 3.1, we consider the scenario
whereas in Section 3.2, we consider the
scenario
3.1 Weighting by the
inverse of the probability of inclusion in the combined sample
The ideal case occurs when the non-probability
sample is a census, i.e.,
In that
case, the value of the parameter of interest
can be directly
calculated without worrying about bias or variance since
is
assumed in this section. In general, we expect under-coverage in the sense that
is
smaller than the population
In a
design-based approach, the potential under-coverage bias can be addressed by
selecting a probability sample
from
and
collecting the values of the variable
for the sample
units. Ideally, the probability sample is drawn from
but it
is possible that the units in
cannot
be linked to those of the sampling frame
to
establish the set
In
general, the larger the non-probability sample, the more it is possible to
reduce the size of the probability sample without jeopardizing the desired precision
of the estimates.
It seems desirable
to estimate
using
all the data collected in the combined sample
The inclusion
indicator in
can be
defined as
To
obtain a design-unbiased estimator of
each
unit
is
weighted by
where
Under
assumptions 1 and 2,
and we obtain
The resulting
estimator is written:
Note that
estimator (3.1) requires the indicator
to be
available for all units in the sample
For the
units
we have
two values:
and
In
principle, we should have
but it
is possible that this relationship is not exactly satisfied. These units can be
used to validate the assumption
If significant
differences are observed, it may be preferable to not consider this approach and
to rely on the methods in Section 3.2 that use data from the
non-probability source as auxiliary data. If we trust the data quality of the
non-probability source, it may be advisable not to collect the variable
in the
probability sample for the units also present in the non-probability sample in
order to reduce the data collection costs and respondent burden.
We can view the
problem as if we had two sampling frames:
and
A sample
is drawn
randomly from
and a
census is taken from
. The probability of selection in the sample
can then
be calculated for each unit
and the
estimator (3.1) is recovered by weighting each unit
by the
inverse of that probability. This approach was proposed by Bankier (1986) to
address the problem of multiple sampling frames. In the context of integrating
a probability and non-probability sample, estimator (3.1) was proposed by Kim
and Tam (2020).
The last sum of (3.1)
is a design-unbiased estimator of
If a vector of auxiliary variables,
is available for
as well as the total
then the weight
in (3.1) can be replaced with a calibrated
weight
(e.g., Deville and Särndal, 1992; Haziza and
Beaumont, 2017). The calibrated weights minimize a distance function between
and
under the constraint of satisfying the
calibration equation
Ideally, the calibration is done only on the
portion not covered by the non-probability sample,
i.e., the calibration vector
is used, and the calibration equation becomes:
This is not possible when
is unknown.
Remark: If assumption 2
is not appropriate, then
To get
around this problem, all the units for which the data were collected after
selecting the sample
can be
removed from
. Assumption 2 is then satisfied, but a lot of
available data may be omitted. To take advantage of the full set
it is
necessary to make a few assumptions and partially depart from the design-based
approach. Assuming that
we can
use Bayes’ theorem to show that
for the units
Therefore, estimating
requires postulating a model for
Under
some assumptions,
can be
estimated using the data from the probability sample and, for example, a
logistic regression model. Estimating
can be
done using the methods described in Section 4.3 that do not rely on the
validity of assumption 2, such as the method by Chen, Li and Wu (2019). These
methods require that the auxiliary variables used to model this probability be
available for all units of the combined sample
Unlike
in Section 4.3, here we can take advantage of the availability of
for all units of both samples, and we can use
the variable of interest as an auxiliary variable. Then,
is estimated by replacing
in (3.1)
with an estimate of
Similar
approaches were proposed by Beaumont, Bocci and Hidiroglou (2014) to take into
account late respondents in Statistics Canada’s National Household Survey, i.e.,
households that responded to the initial questionnaire after the follow-up
probability sample of non-respondents was drawn.
3.2 Calibration of the probability sample to the non-probability
source
Data from
non-probability sources, such as those provided by web panel respondents, can
be fraught with measurement errors large enough to cast doubt on the assumption
that
Therefore, such data cannot be used to
directly replace the values of the variable
However,
they can be used as auxiliary data to enhance the probability survey using the
calibration technique. The non-probability source contains the values
for
and
potentially the values of other variables. From all these variables, it is
possible to form a vector of auxiliary variables
available for
that
could include an intercept. Its total is denoted as
Another
vector of auxiliary variables,
may also be available for
as well
as its total for the entire population
The
calibrated weights
are
obtained by minimizing a distance function between
and
under the constraint of satisfying the
calibration equation
Note that this
calibration can be done only if is available in the probability sample for all
units
The estimator of
is again written as
where
is the calibrated weight satisfying the above calibration
equation. No model assumption is required for the validity of the approach, and
the resulting estimates remain design-consistent regardless of the strength of
the relationship between
and the auxiliary variables
and A strong relationship will help reduce the design
variance of Kim and Tam (2020) discuss the use of such
calibration.
Canada’s Labour
Force Survey (LFS) provides an example of a potential application for this
calibration method. The unemployment rate, defined as the number of unemployed
persons divided by the number of persons in the labour force, is a key
parameter of interest that the LFS estimates. To improve the precision of the
LFS estimates, a calibration variable indicating whether an individual is
receiving employment insurance could be effective because there is definitely a
connection between receiving employment insurance and being unemployed. The
total of this calibration variable, the number of employment insurance
beneficiaries, is needed for implementing this calibration and is available
from an administrative source. However, applying this method would require
adding a question to the LFS to identify LFS respondents who are receiving employment
insurance. This information could also be obtained through a linkage between
the LFS and the administrative source. It remains to be determined whether such
a calibration variable could yield significant gains in the LFS.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2020
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa