Estimation of response propensities and indicators of representative response using population-level information
Section 2. Population-based response propensities

2.1 General notation

We suppose that a sample survey is undertaken, where a sample $s$ is selected from a finite population $U .$ The sizes of $s$ and $U$ are denoted by $n$ and $N,$ respectively. The units in $U$ are labelled $i = 1, 2, \dots, N .$ The sample is assumed to be drawn by a probability sampling design $p (.),$ where the sample $s$ is selected with probability $p (s) .$ The first order inclusion probability of unit $i$ is denoted $π_{i}$ and $d_{i} = π_{i}^{- 1}$ is the design weight. The evaluation study is based on simple random sampling without replacement. Although large-scale national surveys may use more complex two-stage designs, many are generally planned so that all survey units have an equal inclusion probability. We also provide theoretical expressions under more general complex survey designs.

We suppose that the survey is subject to unit nonresponse. The set of responding units is denoted by $r,$ so $r \subset s \subset U .$ We denote summation over the respondents, sample and population by $Σ_{r} ,$ $Σ_{s}$ and $Σ_{U} ,$ respectively. Let $r_{i}$ be the response indicator variable so that $r_{i} = 1$ if unit $i$ responds and $r_{i} = 0,$ otherwise. Hence, $r = {i \in s; r_{i} = 1} .$ We shall suppose that the typical target of inference is a population mean $\bar{Y} = N^{- 1} \sum_{U} y_{i}$ of a survey variable, taking value $y_{i}$ for unit $i .$

We suppose that the data available for estimation purposes consists first of the values ${y_{i}; i \in r}$ of the survey variable, observed only for respondents. Secondly, we suppose that information is available on the values $x_{i} = {(x_{1, i}, x_{2, i}, \dots, x_{K , i})}^{T}$ of a vector of auxiliary variables $X .$ We shall usually suppose each $x_{k , i}$ is a binary indicator variable, where $x_{i}$ represents one or more categorical variables, since this will be the case in the applications we consider, but our presentation allows for general $x_{k , i}$ values. We assume that values of $x_{i}$ are observed for all respondents so that ${y_{i}, x_{i}; i \in r}$ is observed.

We distinguish two settings: one in which $x_{i}$ is known for all sample units, i.e., for both respondents and non-respondents, and one in which $x_{i}$ is known only at the aggregate level: the population total $\sum_{U} x_{i}$ and/or the population cross-products $\sum_{U} x_{i} x_{i}^{T} .$ We refer to the two types of information as sample-based auxiliary information and aggregate population-based auxiliary information. The first setting is relevant if the variables making up $X$ are available on a register. However, as outlined in the introduction, in many countries and surveys the availability of auxiliary information on non-respondents may be limited and the second setting using population-based auxiliary information may be more useful.

2.2 Definition of response propensities

The theory of propensity scores was introduced by Rosenbaum and Rubin (1983) and discussed in the context of survey nonresponse by Little (1986; 1988). Response propensities are defined as the conditional expectation of the response indicator variable $r_{i}$ given the values of specified variables and survey conditions: $ρ_{X} (x_{i}) = E_{m} (r_{i} | x_{i}),$ where the vector of auxiliary variables is defined as in Section 2.1. For simplicity, we shall write $ρ_{i} = ρ_{X} (x_{i})$ and hence denote the response propensity just by $ρ_{i} .$ $E_{m} (.)$ denotes expectation with respect to the model underlying the response mechanism. A detailed discussion of response propensities and their properties is presented in Shlomo et al. (2012). They argue that it is desirable to select auxiliary variables constituting $x_{i}$ in such a way that the missing at random assumption, denoted MAR (Little and Rubin, 2002), holds as closely as possible.

2.3 Estimation of response propensities using population-level information

In the case of sample-based auxiliary information, it is possible to estimate response propensities for all sampled units by means of regression models $g (ρ_{i}) = x_{i}^{T} β,$ where $g (.)$ is a link function, $r_{i}$ is the dependent variable, and $x_{i}$ is a vector of explanatory variables. Generally, the response propensities are modelled by generalized linear models. Shlomo et al. (2012) use a logistic link function.

In the population-based setting, it is convenient to consider the identity link function. The identity link function is a good approximation to the more widely used logistic link function when response rates are mid-range, between 30% and 70%, which is the typical response rate obtained in national and other surveys. We demonstrate this fact in the evaluation study presented in Section 4 where three ranges of response rates are investigated: low, medium and high. The identity link function also forms the basis for other representativeness indicators in the literature, such as the imbalance and distance indicators proposed by Särndal (2011) some of which are similar to the g-weights calculated in the Generalized Regression Estimators (GREG).

Under the identity link function we assume that the true response propensities satisfy the “linear probability model”

$ρ_{i} = x_{i}^{T} β, i \in U . (2.1)$

The linear probability model in (2.1) can be estimated by weighted least squares, where $d_{i}$ is the design weight. The implied estimator of $ρ_{i}$ is given by

${\hat{ρ}}_{i}^{OLS} = x_{i}^{T} {(\sum_{s} d_{i} x_{i} x_{i}^{T})}^{- 1} \sum_{s} d_{i} x_{i} r_{i}, i \in s . (2.2)$

In the case of population-based auxiliary information, we first note that $\sum_{s} d_{i} x_{i}$ and $\sum_{s} d_{i} x_{i} x_{i}^{T}$ are unbiased for $\sum_{U} x_{i}$ and $\sum_{U} x_{i} x_{i}^{T},$ respectively and that in large samples we may expect that $\sum_{s} d_{i} x_{i} \approx \sum_{U} x_{i}$ and $\sum_{s} d_{i} x_{i} x_{i}^{T} \approx \sum_{U} x_{i} x_{i}^{T} .$ It follows from (2.2) that, in the population-based setting, we may approximate ${\hat{ρ}}_{i}^{OLS}$ by

${\tilde{ρ}}_{i, T 1} = x_{i}^{T} T_{1}^{- 1} \sum_{r} d_{k} x_{k}, i \in r (2.3)$

where $T_{1} = \sum_{U} x_{j} x_{j}^{T} .$ We note that ${\tilde{ρ}}_{i, T 1}$ is computed only on the set of responding units.

The estimator in (2.3) requires knowledge of the population sums of squares and cross-products $\sum_{U} x_{i} x_{i}^{T}$ of the elements of $x_{i} .$ However the cross-products might be unknown. In that case, we can estimate $\sum_{s} d_{i} x_{i} x_{i}^{T}$ in (2.2) by rewriting

$\sum_{s} d_{i} x_{i} x_{i}^{T} = \sum_{s} d_{i} (x_{i} - {\bar{x}}_{s}) {(x_{i} - {\bar{x}}_{s})}^{T} + N {\bar{x}}_{s} {\bar{x}}_{s}^{T}, (2.4)$

where ${\bar{x}}_{s} = \sum_{s} d_{i} x_{i} / N .$ ${\bar{x}}_{s}$ may be replaced by ${\bar{x}}_{U}$ and the covariance matrix

$S_{x x} = N^{- 1} \sum_{s} d_{i} (x_{i} - {\bar{x}}_{s}) {(x_{i} - {\bar{x}}_{s})}^{T} (2.5)$

may be replaced by its estimate using the response set

${\hat{S}}_{x x} = {(\sum_{s} d_{j} r_{j})}^{- 1} \sum_{s} d_{i} r_{i} (x_{i} - {\bar{x}}_{U}) {(x_{i} - {\bar{x}}_{U})}^{T} . (2.6)$

We can also estimate (2.6) using propensity weighting by ${\tilde{ρ}}_{i}^{- 1}$ to adjust for nonresponse bias in the variance of the response propensities relative to a set of $X$ variables.

Combining (2.3), (2.4) and (2.6), we obtain the following estimator:

${\tilde{ρ}}_{i, T 2} = x_{i}^{T} {\hat{T}}_{2}^{- 1} \sum_{r} d_{k} x_{k}, i \in r, (2.7)$

where ${\hat{T}}_{2} = N {\hat{S}}_{x x} + N {\bar{x}}_{U} {\bar{x}}_{U}^{T} .$

We therefore distinguish between two types of aggregated population-based auxiliary information as denoted by the indices $“ T_{1} ”$ in (2.3) and $“ T_{2} ”$ in (2.7):

TYPE 1

Full aggregate population-based auxiliary information: the population cross products are available, i.e.,

\sum_{U} x_{i} x_{i}^{T}

\sum_{U} (x_{i} - {\bar{x}}_{U}) {(x_{i} - {\bar{x}}_{U})}^{T} ​,

where

{\bar{x}}_{U} = \sum_{U} x_{i} / N;

TYPE 2

Marginal aggregate population-based auxiliary information: only the population marginal counts are available, i.e.,

\sum_{U} x_{i} .

The first type implies that we have available all two-by-two tables, e.g., age times gender, age times marital status and gender times marital status. This information might be available to a national statistical institute which has access to population registers or detailed population demographics and wishes to use population-based information to monitor data collection due to a lack of sample-based information on the sample frames. The second type is more restrictive as we have only frequency counts, e.g., age, gender, marital status, without any knowledge about the interactions. This information would be routinely available through websites of national statistical institutes and therefore can be used by marketing and other data collection agencies to monitor their data collection.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2019-07-04

Language selection

Search and menus

Search

Estimation of response propensities and indicators of representative response using population-level information
Section 2. Population-based response propensities

2.1 General notation

2.2 Definition of response propensities

2.3 Estimation of response propensities using population-level information

Estimation of response propensities and indicators of representative response using population-level information Section 2. Population-based response propensities

2.1 General notation

2.2 Definition of response propensities

2.3 Estimation of response propensities using population-level information

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Estimation of response propensities and indicators of representative response using population-level information
Section 2. Population-based response propensities