Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 3. A unifying strategy based on data defect correlation
In the setup of Wu (2022), for each individual we have a set of attributes where is the attribute of interest, and is auxiliary, which is useful in two ways.
First, reducing the sampling bias due to non-probability sampling becomes
possible when the non-probability mechanism can be (fully) explained by Second, by taking advantage of the
relationships between and we can improve the efficiency of our
estimation. As a starting point, Wu (2022) assumes that we have two data
sources available, which we denote via two recording indicators, and The main source of the data is a
non-probability sample, where we observe both and for but the recording indicator is determined by a mechanism uncontrolled by
any (known) design probability. A second source is (assumed to be) a
probability sample, where we observe only, for This second sample provides information to
estimate population auxiliary information that is useful for estimating
population quantities about such as its mean. Hence this setup is closely
related to the setup where see Tan (2013).
Now for any function let Clearly we can estimate the population mean via estimating and From the second sample, can be estimated unbiasedly since it involves only. We therefore can focus on estimating while recognizing that a more principled
approach is to set up a likelihood or Bayesian model to estimate all unknown
quantities jointly (Pfeffermann, 2017). Applying identity (2.2) with then tells us that our central task is to
choose the weight and/or the function to miniaturize the ddc For our current discussion, it is easier to
explain everything via the covariance
instead of the correlation because is a bi-linear function in and However, being standardized, is more appealing
theoretically and for modelling purposes; see Sections 6 and 7.
The expression in (3.1) tells us immediately how to make
it zero in expectations operationally, and in what sense conceptually. For
whatever probability we impose on (to be specified in late sections), let which we assume will depend on only. Then the linearity of the covariance
operator implies that the average covariance with respect to the randomness in is given by
where Similarly, if one is willing to posit a joint
model for conditioning on in the independence form then
Very intuitively, one can ensure a zero covariance or
correlation between two variables by making either of them a constant. The two
choices then would lead to respectively the quasi-randomization approach by
making and the super-population approach by making a constant (e.g., zero). The fact that either
one is sufficient to render zero covariance (under the joint model) yields the
double robustness, because it does not matter which one. But clearly these are
not the only methods to achieve a zero correlation/covariance or double
robustness, an emphasis of Kang and Schafer (2007) in their attempt to
demystify the doubly robust approach (Robins,
Rotnitzky and Zhao, 1994; Robins, 2000; Scharfstein,
Rotnitzky and Robins, 1999). See also Tan (2007, 2010) for discussions
and comparisons of an array of estimators, including those corresponding to
only the quasi-randomization approach or only the super-population approach,
some of them are doubly robust.
Indeed, because formula (2.2) is an identity for the
actual error, any asymptotically unbiased (linear) estimators of the population
mean must imply its corresponding ddc is asymptotically unbiased for
zero, and vice versa, with respect to the randomness in or in However, it is possible for ddc to be
asymptotically unbiased for zero, without assuming any model is correctly
specified see Section 5 for an example. (This
“double-plus robustness” is different from the “multiple robustness” of Han and
Wang (2013), which still needs to assume the validity of at least one of the
posited multiple models.) These two observations suggest that any general
sufficient and necessary strategy for ensuring asymptotically consistent/unbiased
(linear) estimators for the population mean would be equivalent to
miniaturizing ddc.
As an example of a unified insight that otherwise might
not be as intuitive, expression (3.2) suggests that we should include our
estimate of as a part of the predictor in the regression
model since that can help to reduce the correlation
between and especially when we use constant weights Using as a predictor for is generally hard to motivate purely from the
regression perspective, especially when we assume and are independent given (typically a necessary condition to proceed,
as discussed in the next section). However, expression (3.2) tells us that for
the purpose of estimating the mean of it is not absolutely necessary to fit the
correct regression model Rather, it is sufficient to ensure the
“residual” is as uncorrelated with as varies. However, it is critically important to
recognize that it is not sufficient to ensure zero or small correlation only
among the observed data, because tells us little about In the setting of Wu (2022), our ability to
extrapolate from to depends on the availability of the
(independent) auxiliary data indexed by which allow us to observe some for which
The strategy of including propensity estimates as a
predictor has been found beneficial in related literature. For example, Little
and An (2004) included the logit of in their imputation model, and reported the
inclusion enhanced the robustness of the imputed mean to the misspecification
of the imputation model. The method was further developed and enhanced by Zhang and Little (2009) and by Tan, Flannagan and Elliott (2019), who used the term
“Robust-squared” to emphasize the enhanced robustness. In a more recent article
on such a strategy for non-probability samples, Liu et al. (2021)
emphasized the importance of including the estimated propensity “as a predictor” in (using notation in this article). Furthermore,
in the literature of targeted maximum likelihood estimation (TMLE) for
semi-parametric models for dealing with non-probability data (van der Laan and Rubin, 2006; Luque-Fernandez, Schomaker, Rachet and Schnitzer, 2018) (also see Scharfstein et al.
(1999); Tan (2010)), the variables and are called clever covariates and are
used in the regression models for The implementations and theories of TMLE, and
the related Collaborative TMLE (van der Laan and Gruber, 2009, 2010),
are mathematically more involved than those under finite-population settings as
discussed below, but the insights gained from (3.2)-(3.3) can provide us with
helpful intuitions on understanding the essence of such methods.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa