Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 2. A finite-population deterministic identity for actual error

Table of contents

To demonstrate the fruitfulness of the finite-population framework, consider the estimation of the population mean, denoted by $\bar{G},$ of ${G_{i} = G (X_{i}): i \in N},$ where $N = {1, \dots, N}$ indexes a finite population, and the $X_{i} ’ s$ are data collected on individual $i .$ For each $i,$ let $R_{i} = 1$ if $G_{i}$ (or rather $X_{i})$ is recorded in our sample, and $R_{i} = 0$ otherwise; hence the sample size is $n_{R} = \sum_{i =1}^{N} R_{i} .$ We stress that this is an all-encompassing indicator, which can (and should) be decomposed into $R_{i} = r_{i}^{(1)}, \dots, r_{i}^{(J)},$ when the data collection consists of $J$ stages (e.g., $r_{i}^{(1)}$ indicates whether or not the $i^{th}$ individual is sampled, and $r_{i}^{(2)}$ whether the individual responded or not once sampled).

Let ${W_{i}, i \in S}$ be a set of weights to be determined, where the index set $S = {i : R_{i} = 1},$ such that $\sum_{i \in S} W_{i} > 0.$ Let ${\bar{G}}_{W}$ be the weighted sample average, expressible in three ways:

${\bar{G}}_{W} = \frac{\sum_{i \in S} W_{i} G_{i}}{\sum_{i \in S} W_{i}} = \frac{\sum_{i = 1}^{N} R_{i} W_{i} G_{i}}{\sum_{i = 1}^{N} R_{i} W_{i}} = \frac{E_{I} ({\tilde{R}}_{I} G_{I})}{E_{I} ({\tilde{R}}_{I})}, (2.1)$

where ${\tilde{R}}_{I} = R_{I} W_{I},$ and $E_{I}$ is taken with respect to the uniform distribution over the index set $N .$ The first expression in (2.1) simply defines a weighted sample average. With the help of $R_{i},$ the second expression turns the sample averages into finite-population averages. This trivial re-expression is fundamental because it explicates the role of $R_{i}$ in influencing the behavior of ${\bar{G}}_{W}$ as an estimator of $\bar{G} .$ The third expression reveals a divine probability through $I,$ the finite-population index (FPI) variable, by utilizing the fact that averaging is the same as taking expectation over a uniformly distributed random index $I .$ All finite-population moments then can be expressed via $E_{I} .$

In particular, we can express the actual error of ${\bar{G}}_{W}$ via the following identity, where the first expression can be traced back to Hartley and Ross (1954), who used it to express biases in ratio estimators. The second expression was given in Meng (2018) with a slightly different (but equivalent) expression:

${\bar{G}}_{W} - \bar{G} = \frac{{Cov}_{I} ({\tilde{R}}_{I}, G_{I})}{E_{I} [{\tilde{R}}_{I}]} = ρ_{\tilde{R}, G} \times \sqrt{\frac{N - n_{W}}{n_{W}}} \times σ_{G} . (2.2)$

Here $ρ_{\tilde{R}, G} = {Corr}_{I} ({\tilde{R}}_{I}, G_{I})$ is the finite-population correlation between ${\tilde{R}}_{I}$ and $G_{I},$ $σ_{G}^{2}$ is the finite-population variance of $G_{I},$ and $n_{W}$ is the effective sample size due to using weights (Kish, 1965)

$n_{W} = \frac{n_{R}}{1 + {CV}_{W}^{2}}, (2.3)$

with ${CV}_{W}$ being the coefficient of variation (i.e., standard deviation/mean) of ${W_{i}, i \in S}.$

The expression (2.2) is an algebraic identity because it holds for any instances of ${(G_{i}, R_{i} W_{i}), i \in N} .$ Hence no model assumptions are imposed, not even the assumption that $R$ (or any quantity) is random, echoing the comment by Mary Thompson, as quoted in Wu (2022), that “the sample inclusion indicator $R$ is a random variable is itself an assumption”. The only requirement is that the recorded $G_{i}$ is unchanged from the $G_{i} ’ s$ in the target population. (But note this requirement has two components: (1) there is no over-coverage, that is, everyone in the sample belongs to the target population, e.g., no non-eligible voters are surveyed when the target population is eligible voters, and (2) there is no measurement error; extensions to the cases with measurement errors are available, but not pursued in this article.) When we use equal weights, the three factors on the right-hand side of (2.2) reflect respectively (from left to right) data defect, data sparsity, and problem difficulty, as detailed in Meng (2018) and further illustrated in Bradley, Kuriwaki, Isakov, Sejdinovic, Meng and Flaxman (2021) in the context of COVID-19 vaccination surveys.

In particular, when all weights are equal, $ρ_{\tilde{R}, G}$ is termed as data defect correlation (ddc) in Meng (2018) because it measures the lack of representativeness of the sample via capturing the dependence of inclusion/recording indicator on the attributes ‒ the higher the dependence, the more biased the sample average becomes for estimating population averages. With the basic strategies of probabilistic sampling or inverse probability weighting, ddc will be zero on average because $E (W_{i} R_{i}) = 1,$ and it is of $O_{p} (N^{- 1 / 2})$ order because it is essentially an average of $N$ independent terms (Meng, 2018). Our general goal here therefore is to bring down ddc to $O_{p} (N^{- 1 / 2})$ for non-probability samples, which we shall refer to as “miniaturizing ddc” because $N^{- 1 / 2}$ is typically a minuscule number in practice.

When we use weights, the first term $ρ_{\tilde{R}, G}$ captures the data defect that still exists after the weighting adjustment, since no weights are perfect in practice. Identity (2.2) shows the impact of the weights on both data quality and data quantity. The impact on the nominal effective sample size $n_{W}$ is never positive because $n_{W} \leq n_{R}$ as seen in (2.3). Incidentally, the exactness of (2.3) reveals that this well-known expression is in fact not an approximation (which is often attributed to Kish (1965)), but an exact formula for the reduction of the sample size due to weighting if the weighting had no impact on ddc. However, weighting can have a major positive impact on reducing the overall error by judiciously choosing weights to significantly decrease ddc, though apparently at the price of $n_{W} < n_{R} .$ Of course, this is exactly the aim of the quasi-randomization framework, as discussed below. Most importantly, however, (2.2) leads to a unified insight about the variety of methods reviewed in Wu (2022), including an intuitive explanation of the doubly robust property, which has been receiving increased attention for integrating data sources including both probability and non-probability samples (e.g., Yang, Kim and Song, 2020).

Indeed, Zhang (2019, Section 3.1) used the first expression in (2.2) to define a unified non-parametric asymptotic (NPA) non-informativeness assumption, which requires that the numerator ${Cov}_{I} ({\tilde{R}}_{I}, G_{I})$ goes to zero, while keeping the denominator $E_{I} [{\tilde{R}}_{I}]$ positive, as $N \to \infty .$ This unification permits Zhang (2019) to evaluate the quasi-randomization approach and regression modeling via a common criterion. The ddc framework echoes this unification, as discussed in Section 3 below, with Section 4 stressing the same broad message as emphasized by Zhang (2019). Section 5 harvests another low-hanging fruit of the ddc formulation, since it provides an immediate explanation of the celebrated double robustness. Section 6 then ventures into a much harder area of engineering a more representative sub-sample out of a large non-representative sample, a worthwhile trade-off because data quality is far more important than data quantity (Meng, 2018), as briefly reviewed below.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-12-15

Language selection

Search and menus

Search

Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 2. A finite-population deterministic identity for actual error

Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples Section 2. A finite-population deterministic identity for actual error

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 2. A finite-population deterministic identity for actual error