Estimation of response propensities and indicators of representative response using population-level information
Section 3. Estimation of R-indicators based on population totals

Table of contents

In this section, we first briefly review the definition and concepts of R-indicators, and their estimation based on sample-level auxiliary information. Details can be found in Shlomo et al. (2012). Next, applying the theory introduced in Section 2.3, we adapt the sample-based R-indicator to the case where auxiliary information is obtained from population tables and population counts. Further, we investigate the statistical properties of this estimator.

3.1 R-indicators

Schouten et al. (2009) introduce the concept of representative response. A response to a survey is said to be representative with respect to $X$ when response propensities are constant for $X,$ i.e.,

$ρ_{i} = ρ_{X} (x_{i}) = \bar{ρ}, \forall x_{i},$

where $\bar{ρ}$ denotes the average response propensity in the population.

The overall measure of representative response is the R-indicator. The R-indicator associated with a set of population response propensities ${ρ_{i} : i \in U}$ is defined as

$R_{ρ} = 1 - 2 S_{ρ}, (3.1)$

where $S_{ρ}$ denotes the standard deviation of the individual response propensities

$S_{ρ}^{2} = \frac{1}{N - 1} \sum_{U} {(ρ_{i} - {\bar{ρ}}_{U})}^{2} = \frac{N}{N - 1} {\frac{1}{N} \sum_{U} ρ_{i}^{2} - {[\frac{1}{N} \sum_{U} ρ_{i}]}^{2}}, (3.2)$

where ${\bar{ρ}}_{U} = \sum_{U} ρ_{i} / N .$

The R-indicator takes values on the interval $[1 - \sqrt{\frac{N}{N - 1}}, 1]$ with the upper value 1 indicating the most representative response, where the $ρ_{i} ’ s$ display no variation, and the lower value $1 - \sqrt{\frac{N}{N - 1}}$ (which is close to 0 for large surveys) indicating the least representative response, where the $ρ_{i} ’ s$ display maximum variation.

An important related measure of representativeness is the coefficient of variation of the response propensities

${CV}_{ρ} = \frac{S_{ρ}}{{\bar{ρ}}_{U}} . (3.3)$

This is a relevant measure when considering population means or totals as parameters of interest. In those cases, it may be used instead of the R-indicator. For other types of parameters of interest, such as the median or a ratio, other indicators can be used (Brick and Jones, 2008).

The coefficient of variation in (3.3) bounds the absolute nonresponse bias of unadjusted response means for a variable $Y$ divided by its standard deviation. Schouten et al. (2016) also used the coefficient of variation to assess “worst case” nonresponse bias intervals for standard nonresponse adjusted post-survey estimators, such as the generalized regression estimator (GREG) (Deville and Särndal, 1992) and inverse propensity weighting (IPW) (Little, 1988).

3.2 Sample-based R-indicators

In the case of sample-based auxiliary information, it is possible to estimate response propensities for all sampled units. In the following, let ${\hat{ρ}}_{i}$ be an estimator for $ρ_{i} .$ The sample-based estimator for the R-indicator is

${\hat{R}}_{\hat{ρ}} = 1 - 2 {\hat{S}}_{\hat{ρ}}, (3.4)$

where ${\hat{S}}_{\hat{ρ}}^{2}$ is the design-weighted sample variance of the estimated response propensities computed using the first expression in (3.2)

${\hat{S}}_{\hat{ρ}}^{2} = \frac{1}{N - 1} \sum_{s} d_{i} {({\hat{ρ}}_{i} - {\hat{\bar{ρ}}}_{U})}^{2},$

where ${\hat{\bar{ρ}}}_{U} = (\sum_{s} d_{i} {\hat{ρ}}_{i}) / N .$

The sample-based R-indicator defined by (3.4) is a statistic with a certain precision and bias. Shlomo et al. (2012) discuss bias adjustments and confidence intervals for ${\hat{R}}_{\hat{ρ}} .$ These are available in SAS and R code at www.risq-project.eu, and a manual is provided by De Heij, Schouten and Shlomo (2015). We return to the statistical properties in Section 3.4.

3.3 Population-based R-indicators

We demonstrate in Section 4 that the R-indicators depend only mildly on the type of link function when estimating response propensities if response rates are not in the tails, i.e., very high or very low. Furthermore, we obtain similar estimation of R-indicators when population-based response propensities are estimated according to the Type 1 or Type 2 types of information.

In the population-based setting, an estimator for the R-indicator is then

${\tilde{R}}_{\tilde{ρ}} = 1 - 2 {\tilde{S}}_{\tilde{ρ}}, (3.5)$

where

${\tilde{S}}_{\tilde{ρ}}^{2} = \frac{N}{N - 1} {\frac{1}{N} \sum_{r} d_{i} {\tilde{ρ}}_{i} - {(\frac{1}{N} \sum_{r} d_{i})}^{2}}, (3.6)$

and ${\tilde{ρ}}_{i}$ denotes either response propensities computed under Type 1 information $({\tilde{ρ}}_{i, T 1})$ or response propensities estimated under Type 2 information $({\tilde{ρ}}_{i, T 2}) .$

Notice that the estimation of the R-indicator is based on the second expression for $S_{ρ}^{2}$ in (3.2). This choice indeed makes the estimator ${\tilde{S}}_{\tilde{ρ}}^{2}$ linear in ${\tilde{ρ}}_{i},$ which provides an advantage for bias computations as described in Section 3.4. The evaluation study in Section 4 empirically demonstrates that the two expressions for $S_{ρ}^{2}$ are similar for the types of large-scale national surveys under consideration. Furthermore, we use propensity-weighting by ${\tilde{ρ}}_{i}^{- 1}$ to adjust for nonresponse bias. As for standard nonresponse weighting, the validity of this correction depends on the validity of the estimates ${\tilde{ρ}}_{i} .$

We remark that any adjustment technique for nonresponse can be applied to construct estimators for $R_{ρ},$ e.g., calibration estimators such as linear or multiplicative weighting (Särndal and Lundström, 2005) or weighting class adjustments (Little, 1986). It is generally known that propensity weighting may lead to larger standard errors. It may, therefore, be more efficient to use parsimonious models to estimate the R-indicator. For instance, this can be done by stratifying on response propensity classes. However, we did not explore such estimators, and restricted ourselves to the propensity-weighted estimator (3.5). This is a topic for future research.

The estimation of the coefficient of variation (3.3) in the population-based setting is straightforward

${CV}_{\tilde{ρ}} = \frac{{\tilde{S}}_{\tilde{ρ}}}{{\tilde{\bar{ρ}}}_{U}},$

where ${\tilde{\bar{ρ}}}_{U} = \sum_{r} d_{i} / N .$

Despite being straightforward estimators, the population-based R-indicators based on (2.3) and (2.7) are problematic. Their standard errors and biases increase with higher response rates. We will demonstrate this tendency in the evaluation study in Section 4.2. Clearly, more respondents should provide smaller standard errors and reduce bias since the auxiliary variables will not vary as much among the remaining non-respondents. The reason that (2.3) and (2.7) have these properties is that they are natural but naïve estimators that ignore the sampling which causes sample covariances in the denominator of the estimated response propensities to vary along with the numerator. By “plugging” in a fixed population covariance in the denominator, variation from sampling is avoided.

One way to moderate this effect would be to use a composite estimator, i.e., to employ a linear combination of the estimated propensity and the response rate,

${\tilde{ρ}}_{i, T 1}^{C} = (1 - λ) {\tilde{ρ}}_{i, T 1} + λ {\tilde{\bar{ρ}}}_{U}, (3.7)$

with ${\tilde{\bar{ρ}}}_{U} = \sum_{r} d_{i} / N,$ and similarly for Type 2. The composite estimate in (3.7) is similar to a “shrinkage” estimator, e.g., Copas (1983 and 1993), for the variance of the response propensities ${\tilde{S}}_{ρ}^{2}$ given by (3.6). In that case, the optimal $λ$ is usually chosen to minimize the MSE by solving the derivative of the MSE with respect to $λ .$ We return to the choice of $λ$ in Section 3.4 and note here that, given the observed bias and variance properties, $λ$ should be an increasing function of the response rate and should converge to 1 with higher response rates. Estimated response propensities greater than 1 will be drawn closer to 1 by such a $λ$ due to the use of the linear link function under high response rates.

We explored several other possible alternatives to the composite estimator in (3.7), for example, a composite estimator of the population covariance matrix and the response covariance matrix of the $x_{i},$ and response propensities truncated to the interval [0, 1] for high response rates, but this gave worse results compared to the composite estimator in (3.7). In addition, we also investigated a Hájek-type estimate but this gave similar results to those provided by the proposed estimator in (3.6). Another advantage to using the composite estimator in (3.7) is that we can easily construct bias adjustments of the R-indicators similar to the bias adjustments constructed based on the propensities in (2.3) or (2.7).

A promising alternative may be to adopt an EM-algorithm approach in which the missing auxiliary variables for nonrespondents are imputed. Such an approach is, however, very different in nature and we leave this to future research.

3.4 Bias and standard error of the population-based R-indicators

Shlomo et al. (2012) derive analytic approximations for the bias and standard errors of the sample-based estimate of the R-indicator (3.4). The bias in this estimator arises mostly from “plugging in” estimated response propensities in the sample variances. This source of bias is referred to as small sample bias. A much smaller and usually negligible contribution to the bias originates from using sample means rather than population means. Even if the response is representative, i.e., has equal response propensities, some variation in estimated response propensities is found. The bias is inversely proportional to the sample size meaning that the larger the sample, the smaller the bias. Schouten et al. (2009) investigate the bias for different sample sizes. From their analyses, it follows that the bias is relatively small for typical sample sizes used in large-scale surveys in comparison to the standard error of the R-indicators. Also, the bias adjustment is successful in removing the bias.

For the estimated population-based R-indicators, we expect that statistical properties will be quite different from their sample-based counterparts. As these estimators use less information, the standard errors will be larger. The bias of the population-based estimators may also be larger since in addition to the bias that was evident for small sample sizes in the sample-based estimators, the population-based estimators will likely have bias arising from the estimation of the sample means and covariances and from the restriction to (propensity-weighted) response means.

To reduce the bias of the population-based estimators, we propose to adjust ${\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}$ and ${\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2}$ for bias. This leads to the adjusted version of the estimator for the R-indicator under Type 1 information:

${\tilde{R}}_{{\tilde{ρ}}_{T 1}}^{ADJ} = 1 - 2 {[{\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2} - {\tilde{B}}_{{\tilde{ρ}}_{T 1}} ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2})]}^{1 / 2} . (3.8)$

Appendix A derives the general expression for ${\tilde{B}}_{\bar{ρ} T 1} ({\tilde{S}}_{\bar{ρ} T 1}^{2})$ under both simple random sampling and a more general expression under complex sampling. From Appendix A, the response-set based estimator for the bias under simple random sampling is:

$\begin{array}{l} {\tilde{B}}_{{\tilde{ρ}}_{T 1}}^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}) & = \frac{N}{N - 1} [\frac{N}{n^{2}} \sum_{i \in r} {1 - \frac{n - 1}{N - 1} {\tilde{ρ}}_{i , T 1}} x_{i}^{T} T_{1}^{- 1} x_{i} \\ + \frac{n - 1}{n^{2} (N - 1)} \sum_{i \in r} {\tilde{ρ}}_{i , T 1} - (1 - \frac{n}{N}) \frac{{\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}}{n} - \frac{n_{r}}{n^{2}}], (3.9) \end{array}$

where $n_{r}$ denotes the size of the response set $r .$

In the case of Type 2 information, the adjusted version of the estimator for the R-indicator is as (3.8) with the Type 2 terms replacing the Type 1 information.

Appendix B derives the general expression for the bias of ${\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2} ,$ ${\tilde{B}}_{\bar{ρ} T 2} ({\tilde{S}}_{\bar{ρ} T 2}^{2}),$ under simple random sampling and the more general case of complex sampling. From Appendix B, the response-set based estimator for the bias under simple random sampling is:

$\begin{array}{l} {\tilde{B}}_{{\tilde{ρ}}_{T 2}}^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2}) & = \frac{N}{N - 1} {\frac{1}{n_{r}^{2}} \sum_{i \in r} {1 - \frac{n - 1}{N - 1} {\tilde{ρ}}_{i , T 2}} x_{i}^{T} {\hat{T}}_{2}^{- 1} \hat{F} {\hat{T}}_{2}^{- 1} \hat{t} \\ - \frac{N}{n n_{r}} \sum_{i \in r} {1 - \frac{n - 1}{N - 1} {\tilde{ρ}}_{i , T 2}} x_{i}^{T} {\hat{T}}_{2}^{- 1} z_{i} z_{i}^{T} {\hat{T}}_{2}^{- 1} \hat{t} \\ + \frac{N}{n^{2}} \sum_{i \in r} {1 - \frac{n - 1}{N - 1} {\tilde{ρ}}_{i , T 2}} x_{i}^{T} {\hat{T}}_{2}^{- 1} x_{i} \\ + \frac{n - 1}{n^{2} (N - 1)} \sum_{i \in r} {\tilde{ρ}}_{i , T 2} - (1 - \frac{n}{N}) \frac{{\tilde{S}}_{{\tilde{ρ}}_{T 2}}^{2}}{n} - \frac{n_{r}}{n^{2}}}, \end{array}$

where $\hat{F} = N n^{- 1} \sum_{r} z_{k} z_{k}^{T},$ $\hat{t} = N n^{- 1} \sum_{r} x_{k},$ and $z_{i} = (x_{i} - {\bar{x}}_{U}) .$

Turning to the composite estimator, it is straightforward to show that (3.7) can be rewritten as

${\tilde{S}}_{{\tilde{ρ}}_{T 1}^{C}}^{2} = (1 - λ) {\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}, (3.10)$

and its bias equals

$B ({\tilde{S}}_{{\tilde{ρ}}_{T 1}^{C}}^{2}) = (1 - λ) B ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}) - λ S_{ρ}^{2} . (3.11)$

A response-set based estimator for $B ({\tilde{S}}_{{\tilde{ρ}}_{T 1}^{C}}^{2})$ is obtained using the response-set based estimator developed for $B ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}) .$ For the Type 1 estimator and under simple random sampling:

$\begin{array}{l} {\tilde{B}}_{{\tilde{ρ}}_{T 1}^{C}}^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 1}^{C}}^{2}) & = (1 - λ) {\tilde{B}}_{{\tilde{ρ}}_{T 1}^{C}}^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}) - λ {\tilde{S}}_{{\tilde{ρ}}_{T 1}^{C}}^{2} \\ = (1 - λ) \frac{N}{N - 1} [\frac{N}{n^{2}} \sum_{i \in r} {1 - \frac{n - 1}{N - 1} {\tilde{ρ}}_{i , T 1}^{C}} x_{i}^{T} T_{1}^{- 1} x_{i}^{^{^{^{^{^{^{^{^{^{^{^{}}}}}}}}}}}} \\ + \frac{n - 1}{n^{2} (N - 1)} \sum_{i \in r} {\tilde{ρ}}_{i , T 1}^{C} - (1 - \frac{n}{N}) \frac{{\tilde{S}}_{{\tilde{ρ}}_{T 1}^{C}}^{2}}{n} - \frac{n_{r}}{n^{2}}] - λ {\tilde{S}}_{{\tilde{ρ}}_{T 1}^{C}}^{2} . (3.12) \end{array}$

The same approach applies for Type 2 estimator.

The variance of (3.10) is equal to

$V ({\tilde{S}}_{{\tilde{ρ}}_{T_{1}}^{C}}^{2}) = {(1 - λ)}^{2} V ({\tilde{S}}_{{\tilde{ρ}}_{T_{1}}}^{2}) . (3.13)$

To estimate the variance of ${\tilde{R}}_{\tilde{ρ} T 1}^{ADJ}$ in (3.8) as well as the variance of the composite estimator in (3.13) we need to estimate the variance of ${\tilde{S}}_{{\tilde{ρ}}_{T_{1}}}^{2}$ defined in (3.6) and denoted by $V ({\tilde{S}}_{{\tilde{ρ}}_{T_{1}}}^{2}) .$ To estimate this variance we use resampling methods. More specifically, we employ bootstrap methods (see: Efron and Tibshirani, 1993; Booth, Butler and Hall, 1994 and Wolter, 2007 for the use of bootstrapping methods for finite populations) and assess their performance in the evaluation study in Section 4.

We return now to the choice of $λ$ for the composite estimator in (3.7). The optimal $λ$ can be derived by combining (3.11) and (3.13), and then taking derivatives. Letting $B$ and $V$ denote $B ({\tilde{S}}_{{\tilde{ρ}}_{T_{1}}}^{2})$ and $V ({\tilde{S}}_{{\tilde{ρ}}_{T_{1}}}^{2}),$ respectively, it follows that the optimal $λ$ is

$λ_{opt} = \frac{B (B + S_{ρ}^{2}) + V}{{(B + S_{ρ}^{2})}^{2} + V} . (3.14)$

We note that as the sample size increases, both the $B$ and $V$ terms tend to zero and it is possible that $λ_{opt}$ might be negative. However, based on the evaluation study for the types of large-scale national surveys under consideration, this problem does not arise in practice.

In order to estimate $λ_{opt},$ the quantities $B, V$ and $S_{ρ}^{2}$ need to be estimated. Under Type 1 information and simple random sampling, we propose to estimate $B$ by ${\tilde{B}}_{{\tilde{ρ}}_{T 1}}^{SRS} ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2})$ as in (3.9), $S_{ρ}^{2}$ by ${\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2},$ and $V$ by the bootstrap variance estimator of ${\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2} .$ This leads to the population-based Type 1 estimator for $λ_{opt},$ denoted by ${\tilde{λ}}_{opt, T 1},$ and the population-based composite propensities

${\tilde{ρ}}_{i , T 1}^{PC} = (1 - {\tilde{λ}}_{opt, T 1}) {\tilde{ρ}}_{i, T 1} + {\tilde{λ}}_{opt, T 1} {\tilde{\bar{ρ}}}_{U} .$

The corresponding population-based R-indicator is then computed as in (3.5) and its bias-adjusted version as in (3.8), where the bias adjustment is given by (3.12).

We propose to estimate the variance of the population-based composite estimator by linearization

$\frac{{\tilde{V}}^{BT} ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}) {(1 - {\tilde{λ}}_{opt, T 1})}^{2}}{{\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}},$

where ${\tilde{V}}^{BT} ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2})$ is the bootstrap variance estimator for $V ({\tilde{S}}_{{\tilde{ρ}}_{T 1}}^{2}) .$

The same approach applies for Type 2 information.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2019-07-04

Language selection

Search and menus

Search

Estimation of response propensities and indicators of representative response using population-level information
Section 3. Estimation of R-indicators based on population totals

3.1 R-indicators

3.2 Sample-based R-indicators

3.3 Population-based R-indicators

3.4 Bias and standard error of the population-based R-indicators

Estimation of response propensities and indicators of representative response using population-level information Section 3. Estimation of R-indicators based on population totals

3.1 R-indicators

3.2 Sample-based R-indicators

3.3 Population-based R-indicators

3.4 Bias and standard error of the population-based R-indicators

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Estimation of response propensities and indicators of representative response using population-level information
Section 3. Estimation of R-indicators based on population totals