Estimation of response propensities and indicators of representative response using population-level information
Section 6. Discussion

Table of contents

The extension of sample-based to population-based estimators of R-indicators is comprised of two steps: 1) the estimation of response propensities, and 2) the estimation of the R-indicators based on these propensities. The population-based estimation of response propensities is straightforward when linear models are assumed for response propensities and response influences. The linear link function is reasonable when estimating response propensities under typical response rates seen for large-scale national social surveys as shown in the evaluation study in Section 4. The sample-based estimators contain sample covariance matrices and sample frequencies that can be replaced by population covariance matrices or population frequencies. We identified two types of settings: when population cross-products are available or when auxiliary information is restricted to marginal population counts only. We labelled the corresponding estimators as Type 1 and Type 2 estimators, respectively. The Type 2 setting is more restrictive than the Type 1 setting.

Following the estimation of population-based response propensities, we have constructed population-based estimators for the R-indicator and examined their properties both theoretically and empirically. The estimators are applied to samples drawn from real data from the 1995 Israel Census Data where “true” propensities were calculated according to realistic assumptions for national household social surveys. Thus, we have addressed the first two research questions at the beginning of the paper: How to extend sample-based response propensities and R-indicators to population-based response propensities and R-indicators? and What are the statistical properties of population-based R-indicators?

There are many options for the estimation of R-indicators depending on the response to the survey. We used propensity weighted response means as the propensities are available. However, any calibration method can be used such as linear weighting or adjustment classes. In fact, the set of auxiliary variables used for the estimation of the R-indicators may be a subset of the auxiliary variables used for the estimation of propensities and influences. Parsimonious models may prove to be more efficient as it is known that propensity-weighting may seriously affect the precision of the estimators. This is a topic for future research.

The two properties we examined are the bias and standard errors of the proposed population-based R-indicators. As expected the bias and standard errors are dependent on the size of the sample and the type of auxiliary information available where the smaller the sample, the larger the bias and the standard error. When samples are smaller, it becomes more difficult to distinguish sampling variation from response variation. Clearly, the confidence intervals become larger as there is less information in small samples.

The bias-adjusted Type 1 estimators (population cross-products) perform better than the bias-adjusted Type 2 estimators (population marginal counts). This is as expected given that they employ more information. However, the unadjusted Type 2 estimators have better RRMSE properties than the unadjusted Type 1 estimators. This is a surprising result and points to a suboptimal use of the population cross-products when they are used as “plug-ins” and do not account for any sampling variation. The standard errors of the population-based estimators are larger than their sample-based counterparts.

The evaluation study in scenario RR3 shows that, for very high response rates, the population-based R-indicators provide higher standard errors and larger bias, mainly due to propensities being estimated outside of the interval [0, 1]. For this reason, we proposed a composite estimator with varying smoothing parameters dependent on the response rate. Standard errors were reduced but at the cost of increased bias.

From the analyses it becomes apparent that the bias of the Type 1 and Type 2 estimators depends on the number of auxiliary variables, but this dependence was modest in our evaluations. The bias may increase when using detailed models with many variables for the estimation of response propensities. The rationale behind this is that detailed models allow for more sampling variation to be picked up as bias.

The population-based R-indicators have a number of caveats:

Firstly, the choice of auxiliary information that is available at a national level may be more limiting than sample-based auxiliary information depending on the availability of registers and administrative data. The selection of auxiliary variables should depend on whether they are correlated with the survey target variables. Also, it is strongly recommended that population statistics that are based on registers or administrative data are used rather than those based on weighted survey counts from other surveys since these statistics may not reflect the true population distribution accurately. One would draw erroneous conclusions about the representativeness of the response if the population estimates are biased.

Secondly, we make the assumption that the survey measures the same quantities as in the population information and we do not investigate the effect of possible departures from this assumption. However, we note that there is an imminent risk of measurement errors when comparing the representativeness of survey questions to population statistics. It must be ascertained that the survey questions that are employed have the same definitions and classifications as the population tables. Hence, it is best to avoid questions that are prone to measurement errors, such as questions that require a strong cognitive effort or that may lead to socially desirable answers.

Thirdly, in settings where only population information is available, options to improve representativeness during data collection are much more limited since there is no individual auxiliary information available for the nonrespondents. Nonetheless, in these settings, assessments of representativeness may still be useful in the design of advance and reminder letters, in interviewer training and in paradata collection.

Finally, we do not consider hybrid settings where the R-indicator is based on both linked data and population tables. In addition, we do not deal with the case where we could use weighted survey estimates if there is no aggregated population information available. This will impact on both the bias and variance estimates for the population based R-indicators. Such extensions are relatively straightforward but will be left to future papers.

The research into population-based R-indicators is still at the beginning stage and it is too early to provide a definitive answer to the last research question presented in the introduction regarding the feasibility and practicability of R-indicators based on aggregate population auxiliary information. As mentioned in the introduction, further usage of these R-indicators are being explored in the context of evaluating and monitoring streamed administrative data and assessing the representativeness of linked records. In addition, Schouten et al. (2011) introduced partial R-indicators under sample-based auxiliary information for evaluating the lack of representativeness due to a specific auxiliary variable or category. These were used for monitoring and evaluating data collection. Schouten and Shlomo (2017) demonstrate the use of partial R-indicators for adaptive survey designs. It is straightforward, similarly, to define population-based partial R-indicators and this will be a subject of future work.

Regarding the evaluation study presented in Section 4 on survey representativeness, it is based on real data under realistic assumptions of response probabilities typically found in social surveys conducted at national statistical institutes. Future research needs to assess whether alternative estimators can be constructed that are more precise, and, consequently, allow for stronger conclusions regarding the nature of response. A natural avenue to explore is an iterative approach through a modification of the EM-algorithm, in which the score of the nonrespondents on the auxiliary variables is estimated and used to update response propensity estimates.

We did not consider population-based estimation for other types of models such as logistic or probit regression. As shown in the numerical evaluation in Section 4, differences in sample-based estimators between the linear and logistic link function are in general small, but when the response rates get very close to 1, they become more evident. For these cases, developing other link functions for population-based estimation is a subject of future research. This would be a useful and natural extension to the theory of R-indicators as these models are often used in practice and avoid propensities outside the [0, 1] interval.

Acknowledgements

Part of the research presented here was developed within project RISQ (Representativity Indicators for Survey Quality, www.risq-project.eu), funded by the European 7^th Framework Programme. We thank the members of the RISQ project: Katja Rutar from Statisti $č$ ni Urad Republike Slovenije, Geert Loosveldt and Koen Beullens from Katholieke Universiteit, Leuven, Øyvin Kleven, Johan Fosen and Li-Chun Zhang from Statistisk Sentralbyrå, Norway, Ana Marujo from the University of Southampton, UK and Paul Knottnerus, Centraal Bureau voor de Statistiek, for their valuable input.

The first author was supported by a STSM Grant from the COST Action IS1004 and by the ex 60% University of Bergamo, Biffignandi grant.

Appendix A

Analytic approximation to the bias of Type 1 ${\tilde{R}}_{{\tilde{ρ}}_{T 1}}$ estimators

First, we compute the bias of ${\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}$ under general sampling design. Letting ${\hat{m}}_{1} = N^{- 1} \sum_{r} d_{i}$ and ${\hat{m}}_{2} = N^{- 1} \sum_{r} d_{i} {\tilde{ρ}}_{i , T 1},$ then we can write

$B ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}) = E ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}) - S_{ρ}^{2} = \frac{N}{N - 1} {E ({\hat{m}}_{2}) - V ({\hat{m}}_{1}) - {[E ({\hat{m}}_{1})]}^{2}} - \frac{N}{N - 1} {\frac{1}{N} \sum_{i \in U} ρ_{i}^{2} - {\bar{ρ}}_{U}^{2}} . (A .1)$

Note that

$\begin{array}{l} E ({\hat{m}}_{2}) & = E (\frac{1}{N} \sum_{i \in U} d_{i} s_{i} r_{i} {\tilde{ρ}}_{i , T 1}) = \frac{1}{N} \sum_{i \in U} x_{i}^{T} T_{1}^{- 1} E_{s} {E_{m} (d_{i}^{2} s_{i} r_{i} x_{i} + \sum_{\begin{array}{l} k \in U \\ k \neq i \end{array}} d_{i} d_{k} s_{i} s_{k} r_{i} r_{k} x_{k} | s)} \\ = \frac{1}{N} \sum_{i \in U} d_{i} ρ_{i} x_{i}^{T} T_{1}^{- 1} x_{i} + \frac{1}{N} \sum_{i \in U} d_{i} ρ_{i} x_{i}^{T} T_{1}^{- 1} \sum_{\begin{array}{l} k \in U \\ k \neq i \end{array}} d_{k} π_{i k} ρ_{k} x_{k}, \\ E ({\hat{m}}_{1}) & = E (\frac{1}{N} \sum_{i \in U} d_{i} s_{i} r_{i}) = E_{s} (\frac{1}{N} \sum_{i \in U} d_{i} s_{i} ρ_{i}) = {\bar{ρ}}_{U}, \end{array}$

and

$\begin{array}{l} V ({\hat{m}}_{1}) & = V_{s} {E_{m} ({\hat{m}}_{1} | s)} + E_{s} {V_{m} ({\hat{m}}_{1} | s)} \\ = V_{s} {\frac{1}{N} \sum_{i \in U} d_{i} s_{i} ρ_{i}} + E_{s} {\frac{1}{N^{2}} \sum_{i \in U} d_{i}^{2} s_{i} ρ_{i} (1 - ρ_{i})} \\ = \frac{1}{N^{2}} \sum_{i \in U} \sum_{k \in U} d_{i} d_{k} Δ_{i k} ρ_{i} ρ_{k} + \frac{1}{N^{2}} \sum_{i \in U} d_{i} ρ_{i} (1 - ρ_{i}), \end{array}$

where $Δ_{i k} = π_{i k} - π_{i} π_{k}$ and $π_{i k}$ are the second-order sample inclusion probabilities. Hence, the bias of ${\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}$ with respect to the joint distribution of sampling design and the response mechanism is given by

$\begin{array}{l} B ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}) & = \frac{N}{N - 1} [\frac{1}{N} \sum_{i \in U} d_{i} ρ_{i} x_{i}^{T} T_{1}^{- 1} x_{i} + \frac{1}{N} \sum_{i \in U} d_{i} ρ_{i} x_{i}^{T} T_{1}^{- 1} \sum_{\begin{array}{l} k \in U \\ k \neq i \end{array}} d_{k} π_{i k} ρ_{k} x_{k} \\ ^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{}}}}}}}}}}}}}}}}} - \frac{1}{N^{2}} \sum_{i \in U} \sum_{k \in U} d_{i} d_{k} Δ_{i k} ρ_{i} ρ_{k} - \frac{1}{N^{2}} \sum_{i \in U} d_{i} ρ_{i} (1 - ρ_{i}) - \frac{1}{N} \sum_{i \in U} ρ_{i}^{2}] . (A .2) \end{array}$

Under simple random sampling without replacement, (A.2) can be simplified to

$B^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}) = \frac{N}{N - 1} [\frac{1}{n} \sum_{i \in U} ρ_{i} {1 - \frac{n - 1}{N - 1} ρ_{i}} x_{i}^{T} T_{1}^{- 1} x_{i} + \frac{n - 1}{n (N - 1) N} \sum_{i \in U} ρ_{i}^{2} - \frac{{\bar{ρ}}_{U}}{n} - (1 - \frac{n}{N}) \frac{S_{ρ}^{2}}{n}] .$

A response-set based estimator of $B^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2})$ is

${\tilde{B}}_{{\tilde{ρ}}_{T 1}}^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}) = \frac{N}{N - 1} [\frac{N}{n^{2}} \sum_{i \in r} {1 - \frac{n - 1}{N - 1} {\tilde{ρ}}_{i , T 1}} x_{i}^{T} T_{1}^{- 1} x_{i} + \frac{n - 1}{n^{2} (N - 1)} \sum_{i \in r} {\tilde{ρ}}_{i , T 1} - (1 - \frac{n}{N}) \frac{{\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}}{n} - \frac{n_{r}}{n^{2}}] .$

More generally, the Horvitz-Thompson response-set estimator for (A.2) under complex sampling is given by

$\begin{array}{l} {\tilde{B}}_{{\tilde{ρ}}_{T 1}} ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}) & = \frac{N}{N - 1} {\frac{1}{N} \sum_{i \in r} d_{i} (d_{i} - {\tilde{ρ}}_{i , T 1}) x_{i}^{T} T_{1}^{- 1} x_{i} - \frac{1}{N^{2}} \sum_{i \in r} d_{i}^{3} Δ_{i i} {\tilde{ρ}}_{i , T 1}^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{}}}}}}}}}}}}}}}}} \\ - \frac{1}{N^{2}} \sum_{i \in r} \sum_{\begin{array}{l} k \in r \\ k \neq i \end{array}} d_{i} d_{k} \frac{Δ_{i k}}{π_{i k}} - \frac{1}{N^{2}} \sum_{i \in r} d_{i}^{2} (1 - {\tilde{ρ}}_{i , T 1}) \\ + \frac{1}{N} \sum_{i \in r} x_{i}^{T} T_{1}^{- 1} \sum_{\begin{array}{l} k \in r \\ k \neq i \end{array}} x_{k} (d_{i} d_{k} - \frac{1}{π_{i k}})} . \end{array}$

Appendix B

Analytic approximation to the bias of Type 2 ${\tilde{R}}_{{\tilde{ρ}}_{T 2}}$ estimators

The strategy to compute an analytical bias adjustment for ${\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2}$ is to first approximate ${\tilde{ρ}}_{i , T 2}$ by a linear estimator using Taylor linearization techniques. Next, compute an approximate bias adjustment for ${\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2} ,$ by inserting the linear approximation for ${\tilde{ρ}}_{i , T 2}$ into ${\hat{m}}_{2} .$

In the following, define, for $j = 1, \dots, p$ and $j^{'} = 1, \dots, p,$ the estimated totals

${\hat{t}}_{0} = \sum_{s} d_{k} r_{k}, {\hat{t}}_{j j^{'}} = \sum_{s} d_{k} r_{k} z_{j k} z_{j^{'} k}, and {\hat{t}}_{j} = \sum_{s} d_{k} r_{k} x_{j k},$

where $z_{k} = (x_{k} - {\bar{x}}_{U})$ and $z_{j k} = (x_{j k} - {\bar{x}}_{j U}) .$ Let $\hat{t}$ be a $p$ -vector with components ${\hat{t}}_{j},$ and $\hat{F}$ be the symmetric $(p \times p)$ -matrix with elements ${\hat{t}}_{j j^{'}} .$ We may write

${\tilde{ρ}}_{i , T 2} = x_{i}^{T} {[N {\hat{t}}_{0}^{- 1} \hat{F} + N {\bar{x}}_{U} {\bar{x}}_{U}^{T}]}^{- 1} \hat{t} = x_{i}^{T} {\hat{T}}_{2}^{- 1} \hat{t} .$

Define now the population totals

$t_{0} = \sum_{U} ρ_{k}, F = \sum_{U} ρ_{k} z_{k} z_{k}^{T}, and t = \sum_{U} ρ_{k} x_{k} .$

Notice that ${\hat{t}}_{0}$ is unbiased for $t_{0},$ $\hat{F}$ is unbiased for $F,$ and $\hat{t}$ is unbiased for $t .$ Let $T_{2} = N t_{0}^{- 1} F + N {\bar{x}}_{U} {\bar{x}}_{U}^{T} .$

Proposition 1. The estimator ${\tilde{ρ}}_{i , T 2}$ defined in (2.7) may be approximated by

${\tilde{ρ}}_{i , T 2} ≅ x_{i}^{T} T_{2}^{- 1} (N t_{0}^{- 2} F) T_{2}^{- 1} t ({\hat{t}}_{0} - t_{0}) - x_{i}^{T} T_{2}^{- 1} N t_{0}^{- 1} (\hat{F} - F) T_{2}^{- 1} t + x_{i}^{T} T_{2}^{- 1} \hat{t} .$

Proof. Following standard Taylor linearization (see Särndal, Swensson and Wretman, 1992 and Bethlehem, 1988), the estimator ${\tilde{ρ}}_{i , T 2}$ may be approximated by

${\tilde{ρ}}_{i , T 2} ≅ ρ_{i , T 2}^{*} + a_{0} ({\hat{t}}_{0} - t_{0}) + \sum_{j = 1}^{p} \sum_{j^{'} \leq j} a_{j j^{'}} ({\hat{t}}_{j j^{'}} - t_{j j^{'}}) + \sum_{j = 1}^{p} a_{j} ({\hat{t}}_{j} - t_{j}), (B .1)$

where $ρ_{i , T 2}^{*} = x_{i}^{T} T_{2}^{- 1} t,$ and

$\begin{array}{l} a_{0} & = {\frac{\partial {\tilde{ρ}}_{i , T 2}}{\partial {\hat{t}}_{0}}}_{| \begin{matrix} {\hat{t}}_{0} = t_{0} \\ \hat{F} = F \\ \hat{t} = t \end{matrix}} & = x_{i}^{T} [- {\hat{T}}_{2}^{- 1} (- N {\hat{t}}_{0}^{- 2} \hat{F}) {\hat{T}}_{2}^{- 1}] {\hat{t}}_{| \begin{matrix} {\hat{t}}_{0} = t_{0} \\ \hat{F} = F \\ \hat{t} = t \end{matrix}} = x_{i}^{T} T_{2}^{- 1} (N t_{0}^{- 2} F) T_{2}^{- 1} t, \\ a_{j j^{'}} & = {\frac{\partial {\tilde{ρ}}_{i , T 2}}{\partial {\hat{t}}_{j j^{'}}}}_{| \begin{matrix} {\hat{t}}_{0} = t_{0} \\ \hat{F} = F \\ \hat{t} = t \end{matrix}} & = - x_{i}^{T} T_{2}^{- 1} (N t_{0}^{- 1} Λ_{j j^{'}}) T_{2}^{- 1} t, \\ a_{j} & = {\frac{\partial {\tilde{ρ}}_{i , T 2}}{\partial {\hat{t}}_{j}}}_{| \begin{matrix} {\hat{t}}_{0} = t_{0} \\ \hat{F} = F \\ \hat{t} = t \end{matrix}} & = x_{i}^{T} T_{2}^{- 1} λ_{j}, \end{array}$

where $Λ_{j j^{'}}$ is a $(p \times p) -$ matrix with ones in positions $(j, j^{'})$ and $(j^{'}, j)$ and zeros elsewhere and $λ_{j}$ is a $p$ -vector with the $j^{th}$ component equal to one and zeros elsewhere. Inserting the partial derivatives into (B.1) gives the result.

Proposition 2. Under simple random sampling, an approximate bias for ${\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2}$ with respect to the joint distribution of sampling design and the response mechanism is given by

$\begin{array}{l} B^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2}) & = \frac{N}{N - 1} {t_{0}^{- 2} \frac{N}{n} \sum_{U} c_{i} ρ_{i} {1 - \frac{n - 1}{N - 1} ρ_{i}} \\ - t_{0}^{- 1} \frac{N}{n} \sum_{U} b_{i} ρ_{i} {1 - \frac{n - 1}{N - 1} ρ_{i}} z_{i} z_{i}^{T} T_{2}^{- 1} t + \frac{1}{n} \sum_{U} ρ_{i} x_{i}^{T} T_{2}^{- 1} x_{i} {1 - \frac{n - 1}{N - 1} ρ_{i}} \\ + \frac{n - 1}{n (N - 1)} \sum_{U} ρ_{i} ρ_{i , T 2}^{*} - (1 - \frac{n}{N}) \frac{S_{ρ}^{2}}{n} - \frac{{\bar{ρ}}_{U}}{n} + \frac{1}{n N} \sum_{U} ρ_{i}^{2} - \frac{1}{N} \sum_{U} ρ_{i}^{2}}, \end{array}$

where $c_{i} = x_{i}^{T} T_{2}^{- 1} F T_{2}^{- 1} t,$ $b_{i} = x_{i}^{T} T_{2}^{- 1}$ and $ρ_{i , T 2}^{*} = x_{i}^{T} T_{2}^{- 1} t .$

A response-set based estimator of $B^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2})$ is

$\begin{array}{l} {\tilde{B}}_{{\tilde{ρ}}_{T 2}}^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2}) & = \frac{N}{N - 1} {\frac{1}{n_{r}^{2}} \sum_{r} {1 - \frac{n - 1}{N - 1} {\tilde{ρ}}_{i , T 2}} x_{i}^{T} {\hat{T}}_{2}^{- 1} \hat{F} {\hat{T}}_{2}^{- 1} \hat{t} \\ - \frac{N}{n n_{r}} \sum_{r} {1 - \frac{n - 1}{N - 1} {\tilde{ρ}}_{i , T 2}} x_{i}^{T} {\hat{T}}_{2}^{- 1} z_{i} z_{i}^{T} {\hat{T}}_{2}^{- 1} \hat{t} \\ + \frac{N}{n^{2}} \sum_{r} {1 - \frac{n - 1}{N - 1} {\tilde{ρ}}_{i , T 2}} x_{i}^{T} {\hat{T}}_{2}^{- 1} x_{i} \\ + \frac{n - 1}{n^{2} (N - 1)} \sum_{r} {\tilde{ρ}}_{i , T 2} - (1 - \frac{n}{N}) \frac{{\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2}}{n} - \frac{n_{r}}{n^{2}}} . \end{array}$

Proof. Thanks to Proposition 1, ${\hat{m}}_{2}$ defined in Appendix A may be approximated as follows

$\begin{array}{l} {\hat{m}}_{2} & = \frac{1}{N} \sum_{U} d_{i} s_{i} r_{i} {\tilde{ρ}}_{i , T 2} \\ ≅ \frac{1}{N} \sum_{U} d_{i} s_{i} r_{i} x_{i}^{T} T_{2}^{- 1} (N t_{0}^{- 2} F) T_{2}^{- 1} t ({\hat{t}}_{0} - t_{0}) \\ - \frac{1}{N} \sum_{U} d_{i} s_{i} r_{i} x_{i}^{T} T_{2}^{- 1} N t_{0}^{- 1} (\hat{F} - F) T_{2}^{- 1} t + \frac{1}{N} \sum_{U} d_{i} s_{i} r_{i} x_{i}^{T} T_{2}^{- 1} \hat{t} \\ = : A + B + C . \end{array}$

The expected values of the terms $A, B,$ and $C$ are

$\begin{array}{l} E (A) & = t_{0}^{- 2} \sum_{i \in U} c_{i} d_{i} ρ_{i} + t_{0}^{- 2} \sum_{i \in U} c_{i} d_{i} \sum_{k \neq i} d_{k} ρ_{i} ρ_{k} π_{i k} - t_{0}^{- 1} \sum_{i \in U} c_{i} ρ_{i}, \\ E (B) & = - t_{0}^{- 1} \sum_{i \in U} d_{i} b_{i} ρ_{i} z_{i} z_{i}^{T} T_{2}^{- 1} t - t_{0}^{- 1} \sum_{i \in U} d_{i} b_{i} \sum_{k \neq i} d_{k} ρ_{i} ρ_{k} π_{i k} z_{k} z_{k}^{T} T_{2}^{- 1} t + t_{0}^{- 1} \sum_{i \in U} ρ_{i} b_{i} F T_{2}^{- 1} t, \end{array}$

and

$E (C) = \frac{1}{N} \sum_{i \in U} d_{i} ρ_{i} x_{i}^{T} T_{2}^{- 1} x_{i} + \frac{1}{N} \sum_{i \in U} d_{i} ρ_{i} x_{i}^{T} T_{2}^{- 1} \sum_{k \neq i} d_{k} ρ_{k} π_{i k} x_{k} .$

It follows that, under simple random sampling, $E ({\hat{m}}_{2})$ becomes

$\begin{array}{l} E^{SRS} ({\hat{m}}_{2}) & = t_{0}^{- 2} \frac{N}{n} \sum_{U} c_{i} ρ_{i} {1 - \frac{n - 1}{N - 1} ρ_{i}} - t_{0}^{- 1} \frac{N}{n} \sum_{U} b_{i} ρ_{i} {1 - \frac{n - 1}{N - 1} ρ_{i}} z_{i} z_{i}^{T} T_{2}^{- 1} t \\ + \frac{1}{n} \sum_{U} ρ_{i} x_{i}^{T} T_{2}^{- 1} x_{i} {1 - \frac{n - 1}{N - 1} ρ_{i}} + \frac{n - 1}{n (N - 1)} \sum_{U} ρ_{i} ρ_{i , T 2}^{*} . \end{array}$

So the total bias under simple random sampling is obtained by inserting $E^{SRS} ({\hat{m}}_{2})$ computed above into (A.1) and following the proof in Appendix A for the other terms.

The response-set based estimator ${\tilde{B}}_{{\tilde{ρ}}_{T 2}}^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2})$ of $B^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2})$ is obtained by substituting $t_{0}$ with ${\hat{t}}_{0} = N n_{r} / n,$ $F$ with $\hat{F} = N n^{- 1} \sum_{r} z_{k} z_{k}^{T},$ $T_{2}$ with ${\hat{T}}_{2} = N {\hat{t}}_{0}^{- 1} \hat{F} + N {\bar{x}}_{U} {\bar{x}}_{U}^{T},$ and $t$ with $\hat{t} = N n^{- 1} \sum_{r} x_{k} .$

Note that the bias adjustment ${\tilde{B}}_{{\tilde{ρ}}_{T 2}}^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2})$ corresponds to “plugging-in” Type 2 quantities $({\tilde{ρ}}_{i , T 2}$ instead of ${\tilde{ρ}}_{i , T 1},$ matrix ${\hat{T}}_{2}$ instead of $T_{1},$ and ${\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2}$ instead of ${\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2})$ into the analytical bias adjustment ${\tilde{B}}_{{\tilde{ρ}}_{T 1}}^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2})$ developed for ${\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}$ with two additional terms due to the linearization of ${\hat{T}}_{2}$ .

More generally, the Horvitz-Thompson response-set estimator under complex sampling for the bias adjustment of Type 2 population-based R-indicator is given by

$\begin{array}{l} {\tilde{B}}_{{\tilde{ρ}}_{T 2}} ({\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2}) & = \frac{N}{N - 1} {\frac{1}{N} \sum_{i \in r} d_{i} (d_{i} - {\tilde{ρ}}_{i , T 2}) x_{i}^{T} {\hat{T}}_{2}^{- 1} x_{i} - \frac{1}{N^{2}} \sum_{i \in r} d_{i}^{3} Δ_{i i} {\tilde{ρ}}_{i, T 2} - \frac{1}{N^{2}} \sum_{i \in r} \sum_{\begin{array}{l} k \in r \\ k \neq i \end{array}} d_{i} d_{k} \frac{Δ_{i k}}{π_{i k}} \\ - \frac{1}{N^{2}} \sum_{i \in r} d_{i}^{2} (1 - {\tilde{ρ}}_{i , T 2}) + \frac{1}{N} \sum_{i \in r} x_{i}^{T} {\hat{T}}_{2}^{- 1} \sum_{\begin{array}{l} k \in r \\ k \neq i \end{array}} x_{k} (d_{i} d_{k} - \frac{1}{π_{i k}}) \\ + {(\sum_{k \in r} d_{k})}^{- 2} \sum_{i \in r} d_{i}^{2} x_{i}^{T} {\hat{T}}_{2}^{- 1} \hat{F} {\hat{T}}_{2}^{- 1} \hat{t} + {(\sum_{k \in r} d_{k})}^{- 2} \sum_{i \in r} d_{i} x_{i}^{T} {\hat{T}}_{2}^{- 1} \hat{F} {\hat{T}}_{2}^{- 1} \hat{t} \sum_{k \neq i} d_{k} \\ ^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{}}}}}}}}}}}}}}}}} - {(\sum_{k \in r} d_{k})}^{- 1} \sum_{i \in r} d_{i}^{2} x_{i}^{T} {\hat{T}}_{2}^{- 1} z_{i} z_{i}^{T} {\hat{T}}_{2}^{- 1} \hat{t} - {(\sum_{k \in r} d_{k})}^{- 1} \sum_{i \in r} d_{i} x_{i}^{T} {\hat{T}}_{2}^{- 1} \sum_{k \neq i} d_{k} z_{k} z_{k}^{T} {\hat{T}}_{2}^{- 1} \hat{t}} . \end{array}$

References

Beaumont, J.-F., Bocci, C. and Haziza, D. (2014). An adaptive data collection procedure for call prioritization. Journal of Official Statistics, 30, 607-621.

Bethlehem, J. (1988). Reduction of nonresponse bias through regression estimation. Journal of Official Statistics, 4, 251-260.

Booth, J.G., Butler, R.W. and Hall, P. (1994). Bootstrap methods for finite populations. Journal of the American Statistical Association, 89 (428), 1282-1289.

Brick, J.M., and Jones, M.E. (2008). Propensity to respond and nonresponse bias. METRON – International Journal of Statistics, LXVI (1), 51-73.

Copas, J.B. (1983). Regression, prediction and shrinkage. Journal of the Royal Statistical Society, Series B, 45, 311-354.

Copas, J.B. (1993). The shrinkage of point scoring methods. Journal of the Royal Statistical Society, Series C, 42, 315-331.

De Heij, V., Schouten, B. and Shlomo, N. (2015). RISQ manual 2.1. Tools in SAS and R for the computation of R-indicators and partial R-indicators, available at www.risq-project.eu.

Deville, J.-C., and Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87, 376-382.

Efron, B., and Tibshirani, R.J. (1993). An Introduction to the Bootstrap. New York: Chapman and Hall.

Kreuter, F. (2013). Improving Surveys with Process and Paradata, Edited monograph, New Jersey: John Wiley & Sons, Inc.

Little, R.J.A. (1986). Survey nonresponse adjustments for estimates of means. International Statistical Review, 54, 139-157.

Little, R.J.A. (1988). Missing-data adjustments in large surveys. Journal of Business and Economic Statistics, 6, 287-301.

Little, R.J.A., and Rubin, D.B. (2002). Statistical Analysis with Missing Data, Hoboken, New Jersey: John Wiley & Sons, Inc.

Lundquist, P., and Särndal, C.-E. (2013). Aspects of responsive design with applications to the Swedish Living Conditions Survey. Journal of Official Statistics, 29 (4), 557-582.

MOA (2015). User Instruction Gold Standard, Dutch Market Research Association, available at www.moaweb.nl/sevrices/services/gouden-standaard.html.

Rosenbaum, P.R., and Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55.

Särndal, C.-E. (2011). The 2010 Morris Hansen Lecture: Dealing with survey nonresponse in data collection, in estimation. Journal of Official Statistics, 27 (1), 1-21.

Särndal, C.-E., and Lundquist, P. (2014). Accuracy in estimation with nonresponse: A function of degree of imbalance and degree of explanation. Journal of Survey Statistics and Methodology, 2 (4), 361-387.

Särndal, C.-E., and Lundström, S. (2005). Estimation in Surveys with Nonresponse, New York: John Wiley & Sons, Inc.

Särndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling, New York: Springer.

Schouten, B., and Shlomo, N. (2017). Selecting adaptive survey design strata with partial R-indicators. International Statistical Review, 85 (1), 143-163.

Schouten, B., Calinescu, M. and Luiten, A. (2013). Optimizing quality of response through adaptive survey designs. Survey Methodology, 39, 1, 29-58. Paper available at https://www150.statcan.gc.ca/n1/pub/12-001-x/2013001/article/11824-eng.pdf.

Schouten, B., Cobben, F. and Bethlehem, J. (2009). Indicators for the representativeness of survey response. Survey Methodology, 35, 1, 101-113. Paper available at https://www150.statcan.gc.ca/n1/pub/12-001-x/2009001/article/10887-eng.pdf.

Schouten, B., Shlomo, N. and Skinner, C. (2011). Indicators for monitoring and improving representativeness of response. Journal of Official Statistics, 27, 231-253.

Schouten, B., Cobben, F., Lundquist, P. and Wagner, J. (2016). Does more balanced survey response imply less non-response bias? Journal of the Royal Statistical Society, Series A, 179 (3), 727-748.

Schouten, B., Bethlehem, J., Beulens, K., Kleven, Ø., Loosveldt, G., Rutar, K., Shlomo, N. and Skinner, C. (2012). Evaluating, comparing, monitoring and improving representativeness of survey response through R-indicators and partial R-indicators. International Statistical Review, 80 (3), 382-399.

Shlomo, N., Skinner, C. and Schouten, B. (2012). Estimation of an indicator of the representativeness of survey response. Journal of Statistical Planning and Inference, 142, 201-211.

Van der Laan, D., and Bakker, B. (2015). Indicators for the representativeness of linked sources, NTTS 2015 Proceedings, available at https://ec.europa.eu/eurostat/cros/system/files/NTTS2015%20proceedings.pdf.

Wagner, J. (2012). A comparison of alternative indicators for the risk of nonresponse bias. Public Opinion Quarterly, 76 (3), 555-575.

Wagner, J. (2013). Adaptive contact strategies in telephone and face-to-face surveys. Survey Research Methods, 7 (1), 45-55.

Wagner, J., and Hubbard, F. (2014). Producing unbiased estimates of propensity models during data collection. Journal of Survey Statistics and Methodology, 2, 323-342.

Wolter, K.M. (2007). Introduction to Variance Estimation, 2^nd Ed. New York: Springer.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2019-07-04

Language selection

Search and menus

Search

Estimation of response propensities and indicators of representative response using population-level information
Section 6. Discussion

Acknowledgements

Appendix A

Analytic approximation to the bias of Type 1 ${\tilde{R}}_{{\tilde{ρ}}_{T 1}}$ estimators

Appendix B

Analytic approximation to the bias of Type 2 ${\tilde{R}}_{{\tilde{ρ}}_{T 2}}$ estimators

References

Estimation of response propensities and indicators of representative response using population-level information Section 6. Discussion

Acknowledgements

Appendix A

Appendix B

References

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Estimation of response propensities and indicators of representative response using population-level information
Section 6. Discussion