Are probability surveys bound to disappear for the production of official statistics?
Section 3. Design-based approaches

Table of contents

Design-based approaches yield design-consistent estimators of $θ$ even when the non-probability source produces estimates with a significant selection bias. In this context, the purpose of using a non-probability sample is to reduce the variance of estimators of $θ .$ The efficiency gains achieved can be used to justify a reduction of the probability sample size, thereby a reduction of the data collection costs and respondent burden. The methods that we consider in Sections 3.1 and 3.2 require collecting the values of the variable of interest $y$ in the probability sample, just like small area estimation methods described in Section 4.4. However, the efficiency gains are usually expected to be more modest than those obtained using small area estimation methods. In Section 3.1, we consider the scenario $y_{k}^{*} = y_{k}$ whereas in Section 3.2, we consider the scenario $y_{k}^{*} \neq y_{k} .$

3.1 Weighting by the inverse of the probability of inclusion in the combined sample

The ideal case occurs when the non-probability sample is a census, i.e., $s_{NP} = U .$ In that case, the value of the parameter of interest $θ = \sum_{k \in U} y_{k}$ can be directly calculated without worrying about bias or variance since $y_{k}^{*} = y_{k}$ is assumed in this section. In general, we expect under-coverage in the sense that $s_{NP}$ is smaller than the population $U .$ In a design-based approach, the potential under-coverage bias can be addressed by selecting a probability sample $s_{P}$ from $U$ and collecting the values of the variable $y$ for the sample units. Ideally, the probability sample is drawn from $U - s_{NP}$ but it is possible that the units in $s_{NP}$ cannot be linked to those of the sampling frame $U$ to establish the set $U - s_{NP} .$ In general, the larger the non-probability sample, the more it is possible to reduce the size of the probability sample without jeopardizing the desired precision of the estimates.

It seems desirable to estimate $θ$ using all the data collected in the combined sample $s = s_{P} \cup s_{NP} .$ The inclusion indicator in $s$ can be defined as ${\tilde{I}}_{k} = δ_{k} + (1 - δ_{k}) I_{k} .$ To obtain a design-unbiased estimator of $θ,$ each unit $k \in s$ is weighted by ${\tilde{w}}_{k} = {\tilde{π}}_{k}^{- 1}$ where ${\tilde{π}}_{k} = E ({\tilde{I}}_{k} | Ω_{P}) .$ Under assumptions 1 and 2, $E (I_{k} | Ω_{P}) = π_{k}$ and we obtain

${\tilde{π}}_{k} = E ({\tilde{I}}_{k} | Ω_{P}) = δ_{k} + (1 - δ_{k}) π_{k} .$

The resulting estimator is written:

$\hat{θ} = \sum_{k \in s} {\tilde{w}}_{k} y_{k} = \sum_{k \in s_{NP}} y_{k} + \sum_{k \in s_{P}} \frac{1}{π_{k}} (1 - δ_{k}) y_{k} . (3.1)$

Note that estimator (3.1) requires the indicator $δ_{k}$ to be available for all units in the sample $s_{P} .$ For the units $k \in s_{P} \cap s_{NP},$ we have two values: $y_{k}$ and $y_{k}^{*} .$ In principle, we should have $y_{k}^{*} = y_{k},$ but it is possible that this relationship is not exactly satisfied. These units can be used to validate the assumption $y_{k}^{*} \approx y_{k} .$ If significant differences are observed, it may be preferable to not consider this approach and to rely on the methods in Section 3.2 that use data from the non-probability source as auxiliary data. If we trust the data quality of the non-probability source, it may be advisable not to collect the variable $y$ in the probability sample for the units also present in the non-probability sample in order to reduce the data collection costs and respondent burden.

We can view the problem as if we had two sampling frames: $U$ and $s_{NP} .$ A sample $s_{P}$ is drawn randomly from $U$ and a census is taken from $s_{NP}$ . The probability of selection in the sample $s,$ $\Pr (k \in s | Ω_{P}),$ can then be calculated for each unit $k \in U,$ and the estimator (3.1) is recovered by weighting each unit $k \in s$ by the inverse of that probability. This approach was proposed by Bankier (1986) to address the problem of multiple sampling frames. In the context of integrating a probability and non-probability sample, estimator (3.1) was proposed by Kim and Tam (2020).

The last sum of (3.1) is a design-unbiased estimator of $\sum_{k \in U} (1 - δ_{k}) y_{k} = \sum_{k \in U - s_{NP}} y_{k} .$ If a vector of auxiliary variables, $x_{k},$ is available for $k \in s_{P}$ as well as the total $T_{x} = \sum_{k \in U} x_{k}$ then the weight $1 / π_{k}$ in (3.1) can be replaced with a calibrated weight $w_{k}$ (e.g., Deville and Särndal, 1992; Haziza and Beaumont, 2017). The calibrated weights minimize a distance function between $w_{k}$ and $1 / π_{k},$ $k \in s_{P},$ under the constraint of satisfying the calibration equation $\sum_{k \in s_{P}} w_{k} x_{k} = T_{x} .$ Ideally, the calibration is done only on the portion not covered by the non-probability sample, $U - s_{NP};$ i.e., the calibration vector $(1 - δ_{k}) x_{k}$ is used, and the calibration equation becomes: $\sum_{k \in s_{P}} w_{k} (1 - δ_{k}) x_{k} = \sum_{k \in U - s_{NP}} x_{k} .$ This is not possible when $\sum_{k \in U - s_{NP}} x_{k}$ is unknown.

Remark: If assumption 2 is not appropriate, then $E (I_{k} | Ω_{P}) \neq E (I_{k} | Z) = π_{k} .$ To get around this problem, all the units for which the data were collected after selecting the sample $s_{P}$ can be removed from $s_{NP}$ . Assumption 2 is then satisfied, but a lot of available data may be omitted. To take advantage of the full set $s_{NP},$ it is necessary to make a few assumptions and partially depart from the design-based approach. Assuming that $E (I_{k} | Ω_{P}) = \Pr (I_{k} = 1 | δ_{k}, Y, Ω),$ we can use Bayes’ theorem to show that

$\Pr (I_{k} = 1 | δ_{k} = 0, Y, Ω) = \frac{1 - \Pr (δ_{k} = 1 | I_{k} = 1, Y, Ω)}{1 - \Pr (δ_{k} = 1 | Y, Ω)} π_{k},$

for the units $k \in U - s_{NP} .$ Therefore, estimating $E (I_{k} | Ω_{P})$ requires postulating a model for $δ_{k} .$ Under some assumptions, $\Pr (δ_{k} = 1 | I_{k} = 1, Y, Ω)$ can be estimated using the data from the probability sample and, for example, a logistic regression model. Estimating $\Pr (δ_{k} = 1 | Y, Ω)$ can be done using the methods described in Section 4.3 that do not rely on the validity of assumption 2, such as the method by Chen, Li and Wu (2019). These methods require that the auxiliary variables used to model this probability be available for all units of the combined sample $s = s_{P} \cup s_{NP} .$ Unlike in Section 4.3, here we can take advantage of the availability of $y_{k}$ for all units of both samples, and we can use the variable of interest as an auxiliary variable. Then, $θ$ is estimated by replacing $π_{k}$ in (3.1) with an estimate of $\Pr (I_{k} = 1 | δ_{k} = 0, Y, Ω) .$ Similar approaches were proposed by Beaumont, Bocci and Hidiroglou (2014) to take into account late respondents in Statistics Canada’s National Household Survey, i.e., households that responded to the initial questionnaire after the follow-up probability sample of non-respondents was drawn.

3.2 Calibration of the probability sample to the non-probability source

Data from non-probability sources, such as those provided by web panel respondents, can be fraught with measurement errors large enough to cast doubt on the assumption that $y_{k}^{*} \approx y_{k} .$ Therefore, such data cannot be used to directly replace the values of the variable $y .$ However, they can be used as auxiliary data to enhance the probability survey using the calibration technique. The non-probability source contains the values $y_{k}^{*}$ for $k \in s_{NP}$ and potentially the values of other variables. From all these variables, it is possible to form a vector of auxiliary variables $x_{k}^{*},$ available for $k \in s_{NP},$ that could include an intercept. Its total is denoted as $T_{x^{*}} = \sum_{k \in s_{NP}} x_{k}^{*} = \sum_{k \in U} δ_{k} x_{k}^{*} .$ Another vector of auxiliary variables, $x_{k},$ may also be available for $k \in s_{P},$ as well as its total for the entire population $U,$ $T_{x} = \sum_{k \in U} x_{k} .$ The calibrated weights $w_{k},$ $k \in s_{P},$ are obtained by minimizing a distance function between $w_{k}$ and $1 / π_{k},$ $k \in s_{P},$ under the constraint of satisfying the calibration equation

$\sum_{k \in s_{P}} w_{k} (\begin{matrix} x_{k} \\ δ_{k} x_{k}^{*} \end{matrix}) = (\begin{matrix} T_{x} \\ T_{x^{*}} \end{matrix}) .$

Note that this calibration can be done only if $x_{k}^{*}$ is available in the probability sample for all units $k \in s_{P} \cap s_{NP} .$ The estimator of $θ$ is again written as $\hat{θ} = \sum_{k \in s_{P}} w_{k} y_{k},$ where $w_{k}$ is the calibrated weight satisfying the above calibration equation. No model assumption is required for the validity of the approach, and the resulting estimates remain design-consistent regardless of the strength of the relationship between $y_{k}$ and the auxiliary variables $x_{k}$ and $x_{k}^{*} .$ A strong relationship will help reduce the design variance of $\hat{θ},$ $var (\hat{θ} | Ω_{P}) .$ Kim and Tam (2020) discuss the use of such calibration.

Canada’s Labour Force Survey (LFS) provides an example of a potential application for this calibration method. The unemployment rate, defined as the number of unemployed persons divided by the number of persons in the labour force, is a key parameter of interest that the LFS estimates. To improve the precision of the LFS estimates, a calibration variable indicating whether an individual is receiving employment insurance could be effective because there is definitely a connection between receiving employment insurance and being unemployed. The total of this calibration variable, the number of employment insurance beneficiaries, is needed for implementing this calibration and is available from an administrative source. However, applying this method would require adding a question to the LFS to identify LFS respondents who are receiving employment insurance. This information could also be obtained through a linkage between the LFS and the administrative source. It remains to be determined whether such a calibration variable could yield significant gains in the LFS.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2020-06-30

Language selection

Search and menus

Search

Are probability surveys bound to disappear for the production of official statistics?
Section 3. Design-based approaches

3.1 Weighting by the inverse of the probability of inclusion in the combined sample

3.2 Calibration of the probability sample to the non-probability source

Are probability surveys bound to disappear for the production of official statistics? Section 3. Design-based approaches

3.1 Weighting by the inverse of the probability of inclusion in the combined sample

3.2 Calibration of the probability sample to the non-probability source

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Are probability surveys bound to disappear for the production of official statistics?
Section 3. Design-based approaches