Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 6. Counterbalancing sub-sampling
6.1 The devastating impact of data defect on
effective sample size
A key finding, which has surprised many, from studying
the data quality issue is how small the size of our “big data” is when we take
into account the data defect. To prove this mathematically, we can equate the
mean-squared error (MSE) of in (2.1), with the MSE of a simple random
sampling estimator of size This yields (see Meng (2018) for derivation):
where and the expectation is with respect to the conditional
distribution of given It is worthwhile to note that this
(conditional) distribution can involve all three types of probability discussed
in Section 1.2 because the variations in can come from multiple sources. For example,
in typical opinion surveys, there will be (1) design probability in the
sampling indicator, (2) divine probability in formulating the non-response
mechanism, and (3) device probability for estimating the mechanism and the
weights.
Expression (6.1) is the weighted version/extension of
the expression given in Meng (2018) with equal weights, which reveals the
devastating impact of a seemingly tiny ddc. Suppose our sample is 1% of
the population, and it suffers from a half-percent ddc. Applying (6.1)
(with equal weights) with 0.01 and 0.005 yields 404 regardless
of the sample size In the case of the 2020 US presidential
election, 1% of the voting population is about 1.55 million people, and hence
the loss of sample size due to a half percent ddc is about
1 - (404 / 1,550,000) > 99.97%. Such seemingly
impossible losses have been reported in both election studies (Meng, 2018) and
COVID vaccination studies (Bradley et al., 2021). A most devastating
consequence of such losses is the “big data paradox”: the larger the (apparent)
data size, the surer we fool ourselves because our false confidence (in both
technical and literal sense) goes up with the erroneous data size, while the
actual coverage probability of the incorrectly constructed confidence intervals
become vanishingly small (Meng, 2018; Msaouel, 2022).
A positive implication from this revelation, however, is
that we can trade much data quantity for data quality, and still end up having
statistically more accurate estimates. Of course, in order to reduce the bias,
we will need some information about it. If we have reliable information on the
value of ddc, we can directly adjust for the bias in estimating the
population average corresponding to the ddc, for example by a Bayesian
approach, similar to that taken by Isakov and Kuriwaki (2020) in their scenario
analysis. Furthermore, if we have sufficient information to construct reliable
weights, we can use the weights to adjust for selection biases as commonly
done. Nevertheless, even in such cases, it may still be useful to create a
representative miniature of the population out of a biased sample for general
purposes, which for example can eliminate many practitioners’ anxiety and
potential mistakes for not knowing how to properly use the weights. Indeed, few
really know how to deal with weights, because “Survey weighting is a mess” (Gelman,
2007).
However, creating a representative miniature out of a
biased sample in general is a challenging task, especially because ddc
can (and will) vary with the variable of interest. Nevertheless, just as
weighting is popular tool despite it being far from perfect, let us explore
representative miniaturization and see how far we can push the idea. The
following example therefore is purely for brainstorming purposes, by looking
into a common but challenging scenario, where we have reasonable information or
understanding on the direction of the bias, that is, the sign of the ddc,
but rather vague information about its magnitude. A good example is
non-representativeness of election polls because voters tend to not want to
disclose their preferences when they plan to vote for a socially unpopular
candidate; we therefore know the direction of the bias, but not much about its
degree other than some rough guesses (e.g., a range of 10 percentage points).
6.2 Creating a less biased sub-sample
The basic idea is to use such partial information about
the selection bias to design a biased sub-sampling scheme to counterbalance
the bias in the original sample, such that the resulting sub-samples have a high
likelihood to be less biased than the original sample from our target
population. That is, we create a sub-sampling indicator such that with high likelihood, the
correlation between and is reduced, compared to the original to such a degree that it will compensate for
the loss of sample size and hence reduce the MSE of our estimator (e.g., the
sample average). We say with high likelihood, in its non-technical
meaning, because without full information on the response/recording mechanism,
we can never guarantee such a counterbalance sub-sampling (CBS) would always do
better. However, with judicious execution, we can reduce the likelihood of
making serious mistakes.
To illustrate, consider the case where is binary. Let where is the propensity of responding/reporting for
individuals whose responses will take value If the sample is representative, then like is miniaturized, meaning that it is on the
order of This is most clearly seen via the easily
verifiable identity (see (4.1) of Meng, 2018)
where and which is the original sampling rate. A key
ingredient of CBS is to determine for that is, the sub-sampling probabilities of
individuals who reported and respectively.
To determine the beneficial choices, let be the sub-sampling rate, and Then by applying (2.2) (with equal weights)
and (6.2) to both the sample average and the sub-sample average, we see that
the sub-sample average has smaller (actual) error in magnitude if and only if
Writing and the right-hand side of (6.3) becomes
where is observed in the original sample, which
should remind us that may be rather different from the we seek, because of the biased -mechanism.
An immediate choice to satisfy (6.4) is to set which of course typically is unrealistic
because if we know the value of then the problem would be a lot simpler. To
explore how much leeway we have in deviating from this ideal choice, let we can then show that (6.4) is equivalent to
This tells precisely the permissible choices of without over-correcting (in the magnitude of
the resulting bias):
(i) When i.e., we can take any such that
(ii) When i.e., we can take any such that
This pair of results confirms a number of our
intuitions, but also offers some qualifications that are not so obvious. Since
we sub-sample to compensate for the bias in the original sample, and must stay on the opposite side of 1, i.e., as seen in (6.6)-(6.7). To prevent over
corrections, some limits are needed, but it is also possible that the initial
bias is so bad that no sub-sampling scheme can make things worse, which is
reflected by the positivizing function
in the two expressions above. However, the
expressions for the limits as well as for the thresholds to activate the
positivizing functions are not so obvious. Nor is it obvious that these
expressions depend on the unknown indirectly via the observed and hence only prior knowledge of is required for implementing or assessing CBS.
This observation suggests that it is possible to
implement a beneficial CBS when we can borrow information from other surveys
(or studies) where the response/recording behaviors are of similar nature. For
example, we may learn that a previous similar survey had 1.5 (e.g., those with had 6% of chance to be recorded, and those
with had only 4% chance). Taking into account the
uncertainty in the similarity between the two surveys, we might feel
comfortable to place (1.2, 1.8) as the plausible range for in the current study. Suppose we observe 0.6, this means that the maximum ‒ over
the range (1.2, 1.8) ‒ of the
lower bound on the permissible as given in (6.6) is
Therefore, as long as we choose we are unlikely to over-correct. The price we
pay for this robustness is that the resulting sub-sample is not as good quality
as it can be, for example, when the underlying for the current study is indeed 1.5 (in
expectation). Choosing any will not provide the full correction as
provided by 0.67, that is, the sub-sample
average will still have a positive bias but with a smaller MSE compared to the
original sample average. Of course both the feasibility and effectiveness of
such CBS need to be carefully investigated before it can be recommended for
general consumption, especially going beyond binary The literature on inverse sampling (Hinkins,
Oh and Scheuren, 1997; Rao, Scott and Benhin, 2003) is of great relevance for
such investigations, because it also aims to produce simple random samples via
subsampling, albeit with a different motivation (to turn complex surveys into
simple ones for ease of analysis).