Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 3. A unifying strategy based on data defect correlation

Table of contents

In the setup of Wu (2022), for each individual $i,$ we have a set of attributes $A_{i} = {y_{i}, x_{i}},$ where $y$ is the attribute of interest, and $x$ is auxiliary, which is useful in two ways. First, reducing the sampling bias due to non-probability sampling becomes possible when the non-probability mechanism can be (fully) explained by $x .$ Second, by taking advantage of the relationships between $y_{i}$ and $x_{i},$ we can improve the efficiency of our estimation. As a starting point, Wu (2022) assumes that we have two data sources available, which we denote via two recording indicators, $R$ and $R^{*} .$ The main source of the data is a non-probability sample, where we observe both $y_{i}$ and $x_{i}$ for $i \in S \equiv {i : R_{i} = 1},$ but the recording indicator $R_{i}$ is determined by a mechanism uncontrolled by any (known) design probability. A second source is (assumed to be) a probability sample, where we observe $x_{i}$ only, for $i \in S^{*} \equiv {i : R_{i}^{*} = 1}.$ This second sample provides information to estimate population auxiliary information that is useful for estimating population quantities about $y,$ such as its mean. Hence this setup is closely related to the setup where $S \cup S^{*} = N;$ see Tan (2013).

Now for any function $m (x),$ let $z_{i} = y - m (x_{i}), i \in N .$ Clearly we can estimate the population mean ${\bar{y}}_{N} = E_{I} (y_{I})$ via estimating $\bar{z} = E_{I} (z_{I})$ and $\bar{m} = E_{I} [m (x_{I})] .$ From the second sample, $\bar{m}$ can be estimated unbiasedly since it involves $x$ only. We therefore can focus on estimating $\bar{z},$ while recognizing that a more principled approach is to set up a likelihood or Bayesian model to estimate all unknown quantities jointly (Pfeffermann, 2017). Applying identity (2.2) with $G = z$ then tells us that our central task is to choose the weight ${W_{i}, i \in S}$ and/or the $m$ function to miniaturize the ddc $ρ_{\tilde{R}, z} .$ For our current discussion, it is easier to explain everything via the covariance

$c_{\tilde{R}, z} \equiv {Cov}_{I} ({\tilde{R}}_{I}, z_{I}) = {Cov}_{I} (W_{I} R_{I}, y_{I} - m (x_{I})) = \frac{1}{N} \sum_{i = 1}^{N} W_{i} R_{i} (z_{i} - \bar{z}) (3.1)$

instead of the correlation $ρ_{\tilde{R}, z}$ because ${Cov}_{I} ({\tilde{R}}_{I}, z_{I})$ is a bi-linear function in $R_{I}$ and $z_{I} .$ However, $ρ_{\tilde{R}, z},$ being standardized, is more appealing theoretically and for modelling purposes; see Sections 6 and 7.

The expression in (3.1) tells us immediately how to make it zero in expectations operationally, and in what sense conceptually. For whatever probability we impose on $R_{i}$ (to be specified in late sections), let $π_{i} = \Pr (R_{i} = 1 | A),$ which we assume will depend on $A_{i}$ only. Then the linearity of the covariance operator implies that the average covariance with respect to the randomness in $R_{i}$ is given by

$E [c_{\tilde{R}, z} | A] = {Cov}_{I} (W_{I} π_{I}, y_{I} - m (x_{I})), (3.2)$

where $A = {A_{i}, i \in N}.$ Similarly, if one is willing to posit a joint model for ${(R_{i}, y_{i}), i \in N}$ conditioning on $X$ in the independence form $Π_{i = 1}^{N} P (R_{i}, y_{i} | x_{i}),$ then

$E [c_{\tilde{R}, z} | X] = {Cov}_{I} (W_{I} π_{I}, E (y_{I} | x_{I}) - m (x_{I})) . (3.3)$

Very intuitively, one can ensure a zero covariance or correlation between two variables by making either of them a constant. The two choices then would lead to respectively the quasi-randomization approach by making $W_{I} π_{I} \propto 1$ and the super-population approach by making $E [y_{I} | x_{I}] - m (x_{I})$ a constant (e.g., zero). The fact that either one is sufficient to render zero covariance (under the joint model) yields the double robustness, because it does not matter which one. But clearly these are not the only methods to achieve a zero correlation/covariance or double robustness, an emphasis of Kang and Schafer (2007) in their attempt to demystify the doubly robust approach (Robins, Rotnitzky and Zhao, 1994; Robins, 2000; Scharfstein, Rotnitzky and Robins, 1999). See also Tan (2007, 2010) for discussions and comparisons of an array of estimators, including those corresponding to only the quasi-randomization approach or only the super-population approach, some of them are doubly robust.

Indeed, because formula (2.2) is an identity for the actual error, any asymptotically unbiased (linear) estimators of the population mean must imply its corresponding ddc is asymptotically unbiased for zero, and vice versa, with respect to the randomness in $R$ or in ${R, y}.$ However, it is possible for ddc to be asymptotically unbiased for zero, without assuming any model is correctly specified $-$ see Section 5 for an example. (This “double-plus robustness” is different from the “multiple robustness” of Han and Wang (2013), which still needs to assume the validity of at least one of the posited multiple models.) These two observations suggest that any general sufficient and necessary strategy for ensuring asymptotically consistent/unbiased (linear) estimators for the population mean would be equivalent to miniaturizing ddc.

As an example of a unified insight that otherwise might not be as intuitive, expression (3.2) suggests that we should include our estimate of $π_{I}$ as a part of the predictor in the regression model $m (x_{I}),$ since that can help to reduce the correlation between $W_{I} π_{I}$ and $z_{I} = y_{I} - m (x_{I}),$ especially when we use constant weights $W_{I} .$ Using ${\hat{π}}_{I}$ as a predictor for $y$ is generally hard to motivate purely from the regression perspective, especially when we assume $y$ and $R$ are independent given $x$ (typically a necessary condition to proceed, as discussed in the next section). However, expression (3.2) tells us that for the purpose of estimating the mean of $y,$ it is not absolutely necessary to fit the correct regression model $m (x).$ Rather, it is sufficient to ensure the “residual” $z_{I}$ is as uncorrelated with $W_{I} π_{I}$ as $I$ varies. However, it is critically important to recognize that it is not sufficient to ensure zero or small correlation only among the observed data, because ${Cov}_{I} (W_{I} π_{I}, z_{I} | R_{I} = 1)$ tells us little about ${Cov}_{I} (W_{I} π_{I}, z_{I} | R_{I} = 0).$ In the setting of Wu (2022), our ability to extrapolate from $R_{I} = 1$ to $R_{I} = 0$ depends on the availability of the (independent) auxiliary data indexed by $R_{I}^{*} = 1,$ which allow us to observe some $x_{I} ’ s$ for which $R_{I} = 0.$

The strategy of including propensity estimates as a predictor has been found beneficial in related literature. For example, Little and An (2004) included the logit of $\hat{π}$ in their imputation model, and reported the inclusion enhanced the robustness of the imputed mean to the misspecification of the imputation model. The method was further developed and enhanced by Zhang and Little (2009) and by Tan, Flannagan and Elliott (2019), who used the term “Robust-squared” to emphasize the enhanced robustness. In a more recent article on such a strategy for non-probability samples, Liu et al. (2021) emphasized the importance of including the estimated propensity ${\hat{π}}_{i}$ “as a predictor” in $m (x, \hat{π})$ (using notation in this article). Furthermore, in the literature of targeted maximum likelihood estimation (TMLE) for semi-parametric models for dealing with non-probability data (van der Laan and Rubin, 2006; Luque-Fernandez, Schomaker, Rachet and Schnitzer, 2018) (also see Scharfstein et al. (1999); Tan (2010)), the variables $R_{I} / {\hat{π}}_{I}$ and $(1 - R_{I}) / (1 - {\hat{π}}_{I})$ are called clever covariates and are used in the regression models for $y_{I} .$ The implementations and theories of TMLE, and the related Collaborative TMLE (van der Laan and Gruber, 2009, 2010), are mathematically more involved than those under finite-population settings as discussed below, but the insights gained from (3.2)-(3.3) can provide us with helpful intuitions on understanding the essence of such methods.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-12-15

Language selection

Search and menus

Search

Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 3. A unifying strategy based on data defect correlation

Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples Section 3. A unifying strategy based on data defect correlation

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 3. A unifying strategy based on data defect correlation