Browse by

3. Estimating response probabilities

Alina Matei and M. Giovanna Ranalli

3.1 Using logistic regression to estimate $p_{k}$

Different methods to estimate $p_{k}$ are proposed in the literature. All of these methods are based on the use of auxiliary information known on the population or sample level. In the case of non-ignorable nonresponse, the variable of interest is itself the cause (or one of the causes) of the response behavior, and a covariance between the former and the response probability is produced through a direct causal relation (see Groves 2006). In such a case, the response probability $p_{k}$ could be modeled for $k \in s$ using logistic regression as follows

$p_{k} = P (R_{k} = 1 | y_{k j}) = \frac{1}{1 + \exp (- (a_{0} + a_{1} y_{k j}))}, (3.1)$

or as follows

$p_{k} = P (R_{k} = 1 | y_{k j}, z_{k}) = \frac{1}{1 + \exp (- (a_{0} + a_{1} y_{k j} + {z^{'}}_{k} α))}, (3.2)$

where $z_{k} = {(z_{k 1}, \dots, z_{k t})}^{'}$ is a vector with the values taken by $t \geq 1$ covariates on unit $k,$ and $a_{0}, a_{1},$ and $α$ are parameters.

Nonresponse bias in the unadjusted respondent total of the variable of interest $y_{j}$ depends on the covariance between the values $y_{k j}$ and $p_{k}$ (see Bethlehem 1988). An example of a covariate that reduces the covariance between $y_{k j}$ and $p_{k}$ is the interest in the survey topic, such as knowledge, attitudes, and behaviors related to the survey topic (see Groves, Couper, Presser, Singer, Tourangeau, Acosta and Nelson 2006). The set of covariates $z_{k}$ could be also related to the variable of interest $y_{j}$ to reduce sampling variance (Little and Vartivarian 2005).

Since $y_{k j}$ is only observed on respondents, Models (3.1) and (3.2) cannot be estimated. Therefore, usually, the values of $z_{k}$ that are known for both respondents and nonrespondents and are related to the $y_{k j} ’ s$ by a ‘hopefully strong regression’ (Cassel, Särndal and Wretman 1983) are used in the following model

$p_{k} = P (R_{k} = 1 | z_{k}) = \frac{1}{1 + \exp (- (a_{0} + {z^{'}}_{k} α))} . (3.3)$

Then, maximum likelihood can be used to fit Model (3.3) using the data $(R_{k}, z_{k})$ for $k \in s .$ This leads to estimate ${\hat{a}}_{0}$ and $\hat{α}$ and to the estimated response probabilities ${\hat{p}}_{k} = 1 / [1 + \exp (- ({\hat{a}}_{0} + {z^{'}}_{k} \hat{α}))]$ to be used in (2.1). This procedure provides some protection against nonresponse bias if $z_{k}$ is a powerful predictor of the response probability and/or of the variable of interest (Kim and Kim 2007).

In what follows, we propose a reweighting adjustment system based on an auxiliary variable that measures the propensity of each unit to participate to the survey. To this end, further assumptions on the response model are introduced in order to assume a dependence of the $p_{k} ’ s$ on one latent auxiliary variable that is connected to the propensity scores of Rosenbaum and Rubin (1983). The proposed approach can be used when no other auxiliary information is available on $k \in s .$

3.2 Latent variables as auxiliary information

To obtain a measure of response propensities, we consider the case in which item nonresponse on the variables of interest is also present. Then, following Chambers and Skinner (2003, page 278) ‘from a theoretical perspective the difference between unit and item nonresponse is unnecessary. Unit nonresponse is just an extreme form of item nonresponse’, we assume that item response on the variables of interest is driven on respondents by the same attitude and factors that drive unit response. Latent variable models can be used to estimate such factors that, therefore, can be used as covariates in a logistic response model.

As we have already mentioned we assume that item nonresponse affects $m$ survey variables of particular interest. A second response indicator is introduced for each item $ℓ .$ For each item $ℓ$ and each unit $k,$ a binary variable $x_{k ℓ}$ is defined that takes value 1 if unit $k$ answers to item $ℓ$ and 0 otherwise. Let $x_{k} = {(x_{k 1}, \dots, x_{k ℓ}, \dots, x_{k m})}^{'}$ denote the vector of response indicators for unit $k$ to the $m$ items and let $y_{k} = {(y_{k 1}, \dots, y_{k ℓ}, \dots, y_{k m})}^{'}$ be the study variable vector for unit $k .$ Thus $y_{k ℓ}$ is the response value of unit $k$ to item $ℓ$ and $x_{k ℓ}$ is its response indicator.

Suppose the $x_{k ℓ} ’ s$ are related to an assumed underlying latent continuous scale; they are the indicators of a latent variable denoted by $θ_{k} .$ De Menezes and Bartholomew (1996) call the variable $θ_{k}$ the ‘tendency to respond’ to the survey. We call it here the ‘will to respond to the survey’ of unit $k .$ A latent trait model with a single latent variable is used to compute $θ_{k}$ for each $k \in s$ (we will see later how; see Section 4.4). Assume for the moment that $θ_{k}$ is known on all sample units and, as with usual auxiliary information, can be used as a covariate. In the absence of other covariates, Model (3.3) is rewritten as

$p_{k} = P (R_{k} = 1 | θ_{k}) = \frac{1}{1 + \exp (- (α_{0} + α_{1} θ_{k}))} . (3.4)$

Covariate $θ_{k}$ can be viewed as a variable explaining the behavior related to the survey topic, and thus having good properties to reduce the covariance between $y_{k j}$ and $p_{k}$ and, therefore, nonresponse bias. If other suitable auxiliary information is available, it can be inserted in the model as supplementary covariates. Now, to estimate the parameters of Model (3.4), the value of $θ_{k}$ has to be available for all units in the sample. The following sections provide details on how to obtain estimated values of $θ_{k}$ for both respondents and nonrespondents.

Previous | Next

Date modified:: 2015-11-27

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

3. Estimating response probabilities

3.1 Using logistic regression to estimate $p_{k}$

3.2 Latent variables as auxiliary information