Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 2. A finite-population deterministic identity for actual error
To demonstrate the fruitfulness of the finite-population
framework, consider the estimation of the population mean, denoted by of where indexes a finite population, and the are data collected on individual For each let if (or rather is recorded in our sample, and otherwise; hence the sample size is We stress that this is an all-encompassing
indicator, which can (and should) be decomposed into when the data collection consists of stages (e.g., indicates whether or not the individual is sampled, and whether the individual responded or not once
sampled).
Let be a set of weights to be determined, where
the index set such that Let be the weighted sample average, expressible in
three ways:
where and is taken with respect to the uniform
distribution over the index set The first expression in (2.1) simply defines a
weighted sample average. With the help of the second expression turns the sample
averages into finite-population averages. This trivial re-expression is
fundamental because it explicates the role of in influencing the behavior of as an estimator of The third expression reveals a divine
probability through the finite-population index (FPI) variable, by
utilizing the fact that averaging is the same as taking expectation over a
uniformly distributed random index All finite-population moments then can be
expressed via
In particular, we can express the actual error of via the following identity, where the first
expression can be traced back to Hartley and Ross (1954), who used it to
express biases in ratio estimators. The second expression was given in Meng
(2018) with a slightly different (but equivalent) expression:
Here is the finite-population correlation
between and is the finite-population variance of and is the effective sample size due to using
weights (Kish, 1965)
with being the coefficient of variation (i.e.,
standard deviation/mean) of
The expression (2.2) is an algebraic identity because it
holds for any instances of Hence no model assumptions are imposed, not
even the assumption that (or any quantity) is random, echoing the
comment by Mary Thompson, as quoted in Wu (2022), that “the sample inclusion
indicator is a random variable is itself an assumption”.
The only requirement is that the recorded is unchanged from the in the target population. (But note this
requirement has two components: (1) there is no over-coverage, that is,
everyone in the sample belongs to the target population, e.g., no non-eligible
voters are surveyed when the target population is eligible voters, and (2)
there is no measurement error; extensions to the cases with measurement errors
are available, but not pursued in this article.) When we use equal weights, the
three factors on the right-hand side of (2.2) reflect respectively (from left
to right) data defect, data sparsity, and problem difficulty, as detailed in
Meng (2018) and further illustrated in Bradley,
Kuriwaki, Isakov, Sejdinovic, Meng and Flaxman (2021) in the context of
COVID-19 vaccination surveys.
In particular, when all weights are equal, is termed as data defect correlation (ddc)
in Meng (2018) because it measures the lack of representativeness of the sample
via capturing the dependence of inclusion/recording indicator on the attributes
‒ the higher the dependence, the more biased the
sample average becomes for estimating population averages. With the basic
strategies of probabilistic sampling or inverse probability weighting, ddc
will be zero on average because and it is of order because it is essentially an average of independent terms (Meng, 2018). Our general
goal here therefore is to bring down ddc to for non-probability samples, which we shall
refer to as “miniaturizing ddc” because is typically a minuscule number in practice.
When we use weights, the first term captures the data defect that still exists
after the weighting adjustment, since no weights are perfect in practice.
Identity (2.2) shows the impact of the weights on both data quality and data
quantity. The impact on the nominal effective sample size is never positive because as seen in (2.3). Incidentally, the exactness
of (2.3) reveals that this well-known expression is in fact not an
approximation (which is often attributed to Kish (1965)), but an exact formula
for the reduction of the sample size due to weighting if the weighting had
no impact on ddc. However, weighting can have a major positive impact on
reducing the overall error by judiciously choosing weights to significantly
decrease ddc, though apparently at the price of Of course, this is exactly the aim of the
quasi-randomization framework, as discussed below. Most importantly, however, (2.2)
leads to a unified insight about the variety of methods reviewed in Wu (2022),
including an intuitive explanation of the doubly robust property, which has
been receiving increased attention for integrating data sources including both
probability and non-probability samples (e.g., Yang, Kim and Song, 2020).
Indeed, Zhang (2019, Section 3.1) used the first
expression in (2.2) to define a unified non-parametric asymptotic (NPA)
non-informativeness assumption, which requires that the numerator goes to zero, while keeping the denominator positive, as This unification permits Zhang (2019) to
evaluate the quasi-randomization approach and regression modeling via a common
criterion. The ddc framework echoes this unification, as discussed in
Section 3 below, with Section 4 stressing the same broad message as
emphasized by Zhang (2019). Section 5 harvests another low-hanging fruit
of the ddc formulation, since it provides an immediate explanation of
the celebrated double robustness. Section 6 then ventures into a much
harder area of engineering a more representative sub-sample out of a large
non-representative sample, a worthwhile trade-off because data quality is far
more important than data quantity (Meng, 2018), as briefly reviewed below.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa