Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 4. Quasi-randomization or super-population implementations

Table of contents

In a nutshell, the quasi-randomization approach focuses on making $W_{I} π_{I}$ a constant variable (induced by FPI $I) .$ When our sample is genuinely selected by a probabilistic scheme by design, then $π_{i} = \Pr (R_{i} = 1 | x_{i}),$ for $i \in N,$ is a design probability, free of $y_{i},$ but it can depend on $x_{i}$ for example when $x_{i}$ includes a stratifying variable. When the design probability is unavailable, we first need to invoke a divine probability. This could be a natural one given by the finite population, such as the propensity $π_{i} = \Pr_{I} (R_{I} = 1 | A_{I} = A_{i})$ induced by FPI, where $A_{i} = {y_{i}, x_{i}},$ or an imagined super-population one such as the $R_{i} ’ s$ being generated independently from $Ber (π_{i}),$ where $π_{i} = \Pr (R_{i} = 1 | A_{i}) > 0.$ This positivity assumption is necessary if the finite population is pre-specified, or its imposition defines the finite population that can be studied. (This is a practically rather relevant consideration, such as in election polling, where the finite population may not be always pre-specified even theoretically.) Since these divine probabilities are unknown and serve as our estimand, we need to assume some device probabilities, such as via a generalized linear model $π_{i} = g (y_{i}, x_{i})$ to proceed, even though we don’t really believe in any particular choice of $g .$

For our current discussion, suppose our divine probability is given by the super-population Bernoulli model. Let $n_{R} = \sum_{i = 1}^{N} R_{i},$ and $\tilde{p} (A) = \Pr (n_{R} > 0 | A) = 1 - Π_{i \in N} (1 - π_{i}),$ where $A = {A_{i}, i \in N}.$ Because the $R_{i}$ here is controlled by a divine probability, the sample size $n_{R}$ is no longer a design variable to be conditioned upon in our replication scheme; it is generally no longer an ancillary statistic. Nevertheless, we should condition on $n_{R} > 0,$ a universal requirement for constructing data-driven estimates for $\bar{G} .$ Fortunately this conditioning does not create mathematical complications to the simplicity granted by the independence among $π_{i}, i \in N$ as functions of $A_{i} .$ This is because ${\tilde{π}}_{i} (A) \equiv \Pr (R_{i} = 1 | A, n_{R} > 0) = π_{i} / \tilde{p} (A),$ but the normalizing constant $\tilde{p} (A)$ ‒ which depends on the entire $A$ ‒ is not relevant for the developments in this article, such as assigning weights that are proportional to ${\tilde{π}}_{i}^{- 1} (A).$

Consequently, under this divine probability, which corresponds to (the true model for) the $q$ -model setting in Wu (2022), we have for any chosen $W_{I},$ by (3.1)

$\begin{array}{l} E (c_{\tilde{R}, z} | A, n_{R} > 0) & = {Cov}_{I} (W_{I} E [R_{I} | A, n_{R} > 0], y_{I} - m (x_{I})) \\ = {\tilde{p}}^{- 1} (A) {Cov}_{I} (W_{I} π_{I}, y_{I} - m (x_{I})), (4.1) \end{array}$

where $E$ is with respect to the (unknown) divine probability over $R_{I}$ (for fixed $I) .$ It follows then that, regardless of whether we want to ensure zero expectation in (3.2) or in (4.1), we will impose $W_{I} π_{I} \propto 1,$ that is, $W_{I} \propto π_{I}^{- 1},$ the well-known inverse probability weighting. Therefore, if our postulated model $q$ permits us to reliably capture $π_{i}$ in reality, then $c_{\tilde{R}, z} = O_{p} (N^{- 1 / 2})$ because it has mean zero (with respect to the divine probability), and it is a weighted average of $N$ essentially independent Bernoulli variables, as seen in (3.1).

This is a randomization oriented approach because it treats the entire finite population attribute values $A$ as fixed, and the hypothetical replications are generated only by repeated realizations of the recording indicator $R_{I} .$ Of course, in general, the values of ${π_{i}, i \in N}$ are unknown, and worse they are inestimable from a non-probability sample without further assumptions. To proceed, we pose assumptions such as missing at random, i.e., $\Pr (R_{i} = 1 | A_{i}) = \Pr (R_{i} = 1 | x_{i}),$ and the requirement of an auxiliary sample so that we have some values of $x_{i}$ with $R_{i} = 0.$ We also have choices on how to estimate the inclusion propensity $π_{i} = \Pr (R_{i} = 1 | x_{i}),$ parametrically or non-parametrically. These assumptions, requirements, and estimation methods are all essential for practical implementation, as carefully reviewed and discussed by Wu (2022); also see Tan (2010) for a detailed comparison of various estimation strategies. Nevertheless, the overarching idea of quasi-randomization methods is to choose $W_{I}$ to free ${\tilde{R}}_{I} = W_{I} R_{I}$ from $I$ in expectation over the posited hypothetical replications, to regain the freedom guaranteed by probability sampling.

Complementarily, the super-population approaches aim to miniaturize $c_{\tilde{R}, z}$ via making the other variable in $c_{\tilde{R}, z},$ that is, $z_{I}$ free of $I$ in expectation, but over a different hypothetical replication scheme. Here the idea is to choose an $m (x_{i})$ that is a good approximation to $y_{i}$ such that the residual $z_{i} = y_{i} - m (x_{i})$ will be zero in expectation conditioning on $x .$ Typically, this is done by considering a joint model for ${R_{i}, y_{i}}$ given $x_{i},$ and with a specific regression model $ξ (y | x),$ using the notation in Wu (2022). It is important to recognize that, although we only specify the regression model $y_{i}$ given $x_{i},$ we must include $R_{i}$ in the replications in order to capture the possible dependence of $R_{i}$ on the entire $A_{i} = {y_{i}, x_{i}},$ which is the key concern for non-probability samples. Indeed, it is this joint specification that permits the adoption of the missing at random assumption to reduce $P (y_{i} | x_{i}, R_{i}) = P (y_{i} | x_{i}),$ which in turn permits us to focus on specifying a single regression model $ξ (y_{i} | x_{i})$ for both observed and unobserved individuals. Therefore, when we write $E_{ξ},$ we mean the expectation with respect to

$P (R_{i}, y_{i} | x_{i}) = P (R_{i} | x_{i}) P (y_{i} | R_{i}, x_{i}) = π_{i}^{R_{i}} {(1 - π_{i})}^{1 - R_{i}} ξ (y_{i} | x_{i}), (4.2)$

where $π_{i} = \Pr (R_{i} = 1 | x_{i})$ is left unspecified, unlike with the quasi-randomization approach.

It follows then that, conditioning on $X = {x_{i}, i \in N}$ and $n_{R} > 0,$ which does not alter $P (y | X)$ because $y$ and $R$ are independent given $X,$ we have

$E (c_{\tilde{R}, z} | X, n_{R} > 0) = {[\tilde{p} (X)]}^{- 1} {Cov}_{I} (W_{I} π_{I}, E [y_{I} | x_{I}] - m (x_{I})). (4.3)$

Clearly, (4.3) becomes zero when we choose $m (x_{I}) = E_{ξ} [y_{I} | x_{I}]$ and that the $ξ$ model is (first-order) correctly specified, that is, $E_{ξ} [y_{I} | x_{I}] = E [y_{I} | x_{I}].$ This summarizes the super-population approach, and it renders $c_{\tilde{R}, z} = O_{p} (N^{- 1 / 2})$ for similar reasons as given for the quasi-randomization framework.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-12-15

Language selection

Search and menus

Search

Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 4. Quasi-randomization or super-population implementations

Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples Section 4. Quasi-randomization or super-population implementations

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Comments on “Statistical inference with non-probability survey samples” – Miniaturizing data defect correlation: A versatile strategy for handling non-probability samples
Section 4. Quasi-randomization or super-population implementations