Statistical inference with non-probability survey samples
Section 7. Assumptions revisited
Our discussions on
estimation procedures for non-probability survey samples are under the
assumptions A1-A4 and the
focuses are on the validity and efficiency of estimators for the finite
population mean under three inferential frameworks. The theoretical results on
model-based prediction, inverse probability weighting and doubly robust
estimation have been rigorously established under those assumptions. It seems
that researchers are triumphant in dealing with the emerging area of
non-probability data sources. However, as pointed out by the 2021 ASA President
Robert Santos in his opinion article entitled “Using Our Superpowers to
Contribute to the Public Good” (Amstat News, May 2021), “Our superpowers are
only as good as their underlying assumptions, assumptions that are all too
often embraced with aplomb, yet cannot be proven.” How to check assumptions
A1-A4 in practical applications
of the methods is a question that can never be fully answered, and yet there
are steps to follow to boost the confidence in using the theoretical results.
It is also important to understand the potential consequences when certain
assumptions become seriously questionable.
7.1 Assumption A1
Assumption A1 states that
It is the most
crucial assumption for the validity of the pseudo maximum likelihood estimator
of Chen et al. (2020) and the nonparametric kernel smoothing estimator
presented in Section 4.1.3 for the propensity scores, although all other assumptions
are also involved. It is equivalent to the missing at random (MAR) assumption
in the missing data literature. It is well understood that the MAR assumption
cannot be tested using the sample data itself. The same statement holds for
assumption A1 with
non-probability survey samples.
In a nutshell, assumption
A1 indicates that the auxiliary
variables
included in the
non-probability sample fully characterize the participation behaviour or the
sample inclusion mechanism for units in the population. Sufficient attention
should be given at the study design stage before data collection, if such a
stage exists, to investigate potential factors and features of units which
might be related to participation and sample inclusion. For human populations,
the factors and features may include demographical variables, social and
economic indicators, and geographical variables.
Assumption A1 leads to the conclusion that the
conditional distribution of
given
for units in
the non-probability sample is the same as the conditional distribution of
given
for units in
the target population. It implies that the auxiliary variables
should include
relevant predictors for the study variable
With the given
datasets
and
sensitivity
analysis through comparisons of marginal distributions and conditional models
can be helpful in building confidence on assumption A1. For variables which are available in both
and
one can compare
the empirical distribution functions (or moments) from
to the survey
weighted empirical distribution functions (or moments) from
Marked
differences between the two indicate that
is a
non-probability sample with unequal propensity scores. One possible sensitivity
analysis on assumption A1 is to
select a variable
which has
certain similarities to
and a set of
auxiliary variables
with both
and
available from
and
We fit a
conditional model
using data from
and a survey
weighted conditional model
using data from
If
includes all
the key auxiliary variables for assumption A1, we should see the two versions of fitted models to be similar
to each other. Drastic differences between the two fitted models are a strong
sign that either the
is itself an
important auxiliary variable for assumption A1 or the assumption is questionable.
7.2 Assumption A2
A casual look at
assumption A2 may have people
believe that it should easily be satisfied in practice, since a similar
assumption is widely used in missing data analysis and causal inference. It
turns out that the assumption can be highly problematic, and for scenarios
where the assumption fails to hold, the target population is different from the
one assumed for the estimation methods. It is similar to the frame
undercoverage and nonresponse problems which are discussed extensively in probability
sampling.
Assumption A2 states that
for all
It is
equivalent to stating that every unit in the target population has a non-zero
probability to be included in the non-probability sample. If the sample was
taken by a probability sampling method, this would be the scenario where the
sampling frame is complete and there are no hardcore nonrespondents. For most
non-probability samples, the concept of “sampling frame” is often
irrelevant or simply a convenient list, and the selection and inclusion of
units for the sample may not have a structured process. In her presentation at
the 2021 CANSSI-NISS Workshop, Mary Thompson pointed out that “the statement
that the sample inclusion indicator
is a random
variable is itself an assumption” for
non-probability survey samples.
Let
be the set of
units for the
target population. Let
It is apparent
that
and
when assumption
A2 is violated. There are two
typical scenarios in practice. The first can be termed as stochastic
undercoverage, where the non-probability sample
is selected
from
and
itself can be
viewed as a random sample from
For example,
the contact list of an existing probability survey is used to approach units in
the population for participation in the non-probability sample. In this case
consists of
units from the probability sample. Another example is a volunteer survey where
the target population consists of adults in a specific city/region but the
participants are recruited from visitors to major shopping centers in the
region over certain period of time. The subpopulation
includes
visitors to the chosen locations over the sampling period and it is reasonable
to assume that
is a random
sample from the target population. Let
if
and
otherwise,
We have
for
If the
subpopulation
is formed with
an underlying stochastic mechanism such that
for all
we have
for
In other words,
the assumption A2 is valid under
the scenario of stochastic undercoverage for non-probability samples.
The second scenario is
termed as deterministic undercoverage where units with certain features
will never be included in the non-probability sample. Suppose that
participation in the non-probability survey requires internet access and a
valid email address, and 20% of the population have neither access to the
internet nor an email address, we have an example where the 20% of the population have zero propensity scores. There
is no simple fix to the inferential procedures developed under A2. Yilin Chen’s PhD dissertation at
University of Waterloo (Chen, 2020) contained one chapter dealing with some
specific aspects of the scenario.
7.3 Assumption A3
Among all the
assumptions, this one is less crucial to the validity of the proposed
inferential procedures. Under assumption A3, the full likelihood function for the propensity scores is
given in (4.1). For any parametric model on
the quasi
log-likelihood function
given in (4.2)
leads to the quasi score functions
which remains
unbiased even if assumption A3
is violated. There might be some efficiency loss without assumption A3 in estimating the model parameters
but the
estimation methods are still valid under the other three assumptions.
7.4 Assumption A4
It is not difficult to
find an existing probability sample from the same target population. It might
be very hard, however, to have a probability survey sample which contains the
desirable auxiliary variables. Existing probability surveys are designed with
specific aims and scientific objectives, and the auxiliary variables included
in the survey are not necessarily relevant to the analysis of a particular
non-probability survey sample. The ultimate goal for satisfying assumption A4 is to identify and gain access to
an existing probability survey sample with a rich collection of demographical
variables, social and economic indicators, and geographical variables.
A rich-people’s problem
(when one has too much money) for assumption A4 may also occur in practice when two or more existing
probability survey samples are available. How to combine all of them for more
efficient analysis of non-probability survey samples is a research topic that
deserves further attention. Some practical guidances on choosing one reference
probability sample from available alternatives include following
considerations.
(i) Check for
availability of important auxiliary variables which are relevant to
characterizing the participation behavor or having prediction power to the
study variables in the non-probability sample;
(ii) Give first
preference to the one with a larger set of variables that are common to the
non-probability sample;
(iii) Assign
second preference to the probability sample with a larger sample size;
(iv) And lastly, use
the probability sample for which the mode of data collection is the same as the
one for the non-probability sample.
It was shown by Chen
et al. (2020) that two reference probability survey samples with the same
set of common auxiliary variables tend to produce very similar IPW estimators
but the one with a larger sample size leads to better mass imputation
estimators.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa