Publications

Survey Methodology

Browse by

5 Variance estimation and weight sharing

Anne Massiani

The calculations presented in this section are adaptations of the techniques developed by Lavallée (2002, Chapter 8.5) for the treatment of cluster non-response (CNR) in the context of indirect sampling. The results are given in the fictitious case where the probabilities of response at the different stages of the survey are known. The quantities $e_{k}$ as well as the quantities ${e^{'}}_{k}$ defined by (4.4) are also assumed to be known. All these quantities will be replaced by estimates in Section 7. Let $1_{{k \in s_{m}^{B, τ}}}$ denote the indicator variable that is equal to 1 if household $k$ present in survey year $t$ is included in sample $s_{m}^{B, τ}$ responding in the $τ^{th}$ wave. Conditional upon the fact that an attempt was made to contact it via the longitudinals that it contains, household $k$ belongs to $s_{m}^{B, τ}$ if it responded to the grid and then to the questionnaire in survey year $t .$ Therefore, we have:

$E (1_{{k \in s_{m}^{B, τ}}} | s_{p}^{A_{2}, t_{τ}}) = P(k \in s_{m}^{B, τ} | s_{p}^{A_{2}, t_{τ}}) = q_{k}^{b} q_{k}^{c} .$ (5.1)

Using theorem 8.1 of Lavallée (2002, page 151) and Note 1, we can easily verify that the estimator (4.7) can be rewritten in the following form:

${\hat{T}}_{τ} = \sum_{j \in s_{p}^{A_{2}, t_{τ}}} w_{j}^{A_{2}} {\hat{Z}}_{j} = \sum_{j \in s_{p}^{A_{2}, t_{τ}}} \frac{1}{π_{j}^{A_{2}}} {\hat{Z}}_{j}, (5.2)$

where we have noted, for any longitudinal $j \in s_{p}^{A_{2}, t_{τ}}$ belonging to household $k$ in the survey year:

${\hat{Z}}_{j} = \frac{{e^{'}}_{k}}{L_{k} + P_{k}} \frac{1}{q_{k}^{b}} \frac{1}{q_{k}^{c}} 1_{{k \in s_{m}^{B, τ}}} . (5.3)$

The estimator ${\hat{T}}_{τ}$ is consequently reduced to a sum over individuals selected directly. We now decompose the variance of ${\hat{T}}_{τ}$ in a standard way by conditioning on $s_{p}^{A_{2}, t_{τ}} :$

$var ({\hat{T}}_{τ}) = {var}_{s_{p}^{A_{2}, t_{τ}}} [E({\hat{T}}_{τ} | s_{p}^{A_{2}, t_{τ}})] + E_{s_{p}^{A_{2}, t_{τ}}} [var({\hat{T}}_{τ} | s_{p}^{A_{2}, t_{τ}})] .$ (5.4)

For any individual $j$ who is present in the population during the year in which $s_{p}^{A_{2}, t_{τ}}$ is drawn and who belongs to household $k$ during the survey year, let

$Z_{j} = \frac{{e^{'}}_{k}}{L_{k} + P_{k}} . (5.5)$

Using (5.1), we verify that for all longitudinal $j$ and $j^{'}$ included in $s_{p}^{A_{2}, t_{τ}}$ and belonging to households $k$ and $k^{'}$ respectively during the survey year, we have:

$E({\hat{Z}}_{j} | s_{p}^{A_{2}, t_{τ}}) = Z_{j}$ (5.6)

and

${cov}_{j j^{'}} = cov [({\hat{Z}}_{j}, {\hat{Z}}_{j^{'}}) | s_{p}^{A_{2}, t_{τ}}] = {\begin{array}{l} 0 & if k \neq k^{'} \\ {(\frac{{e^{'}}_{k}}{L_{k} + P_{k}})}^{2} \frac{1 - q_{k}^{b} q_{k}^{c}}{q_{k}^{b} q_{k}^{c}} & if k = k^{'} . \end{array} (5.7)$

Formula (5.4) then becomes:

$var({\hat{T}}_{τ}) = \underset{V_{A_{2}}^{τ}}{\underset{︸}{\sum_{j = 1}^{J_{τ}} \sum_{j^{'} = 1}^{J_{τ}} \frac{π_{j j^{'}}^{A_{2}} - π_{j}^{A_{2}} π_{j^{'}}^{A_{2}}}{π_{j}^{A_{2}} π_{j^{'}}^{A_{2}}} Z_{j} Z_{j^{'}}}} + \underset{V_{CNR}^{τ}}{\underset{︸}{E_{s_{p}^{A_{2}, t_{τ}}} {\sum_{j \in s_{p}^{A_{2}, t_{τ}}} \sum_{j^{'} \in s_{p}^{A_{2}, t_{τ}}} \frac{1}{π_{j}^{A_{2}}} \frac{1}{π_{j^{'}}^{A_{2}}} {cov}_{j j^{'}}}}}, (5.8)$

where $J_{τ}$ designates the number of persons present in the population during the year in which $s_{p}^{A_{2}, t_{τ}}$ is drawn. The first term $V_{A_{2}}^{τ}$ is the portion of the variance due to the mechanism for selecting the longitudinals in $s_{p}^{A_{2}, t_{τ}},$ while the second term $V_{CNR}^{τ}$ is the portion due to households' non-response to the grid and then to the household questionnaire in year $t,$ which constitutes cluster non-response.

To obtain an estimator of the variance of ${\hat{T}}_{τ},$ we adapt the variance estimation formula (8.37) of Lavallée (2002, page 154). The following differences should be noted. First, we are ignoring the fact that in practice, the response probabilities will have to be estimated, whereas Lavallée (2002) takes this into account. Second, the estimation method proposed by Lavallée (2002) provides biased estimates, even when it is applied in a situation where the response probabilities are known. Consequently, we have adapted his method so as to obtain an unbiased estimator of the variance. To justify our approach, we first explain below the bias obtained by applying the method of Lavallée (2002) in a case where response probabilities are known. His method consists in estimating $V_{CNR}^{τ}$ by an unbiased estimator ${\hat{V}}_{CNR}^{τ}$ then estimating $V_{A_{2}}^{τ}$ by ${\hat{V}}_{1}^{τ} = {\tilde{V}}_{A_{2}}^{τ} ({\hat{Z}}_{1}, {\hat{Z}}_{2}, \dots)$ where:

${\tilde{V}}_{A_{2}}^{τ} (Z_{1}, Z_{2}, \dots) = \sum_{j \in s_{p}^{A_{2}, t_{τ}}} \sum_{j^{'} \in s_{p}^{A_{2}, t_{τ}}} \frac{π_{j j^{'}}^{A_{2}} - π_{j}^{A_{2}} π_{j^{'}}^{A_{2}}}{π_{j}^{A_{2}} π_{j^{'}}^{A_{2}}} \frac{1}{π_{j j^{'}}^{A_{2}}} Z_{j} Z_{j^{'}} (5.9)$

is the Horvitz-Thompson estimator of the variance of

${\tilde{T}}_{τ} = \sum_{j \in s_{p}^{A_{2}, t_{τ}}} \frac{1}{π_{j}^{A_{2}}} Z_{j} . (5.10)$

This leads to the variance estimator:

${\hat{var}}_{L} ({\hat{T}}_{τ}) = {\hat{V}}_{1}^{τ} + {\hat{V}}_{CNR}^{τ} .$ (5.11)

The use of ${\hat{V}}_{1}^{τ},$ which is constructed by replacing the $Z_{j}$ vales that appear in (5.9) by ${\hat{Z}}_{j},$ is motivated by the fact that the $Z_{j}$ values are not known for all the longitudinals $j$ in $s_{p}^{A_{2}, t_{τ}},$ but only for the longitudinals who, in the survey year, belong to a household $k$ that responded to the questionnaire, that is, a household $k$ belonging to $s_{m}^{B, τ} .$ The use of ${\hat{Z}}_{j}$ values makes it possible to assign more weight to the longitudinals of $s_{p}^{A_{2}, t_{τ}}$ for which the $Z_{j}$ values are known. The problem is that the estimator ${\hat{V}}_{1}^{τ}$ thus constructed does not provide an unbiased estimate of $V_{A_{2}}^{τ} .$ This may easily be seen by observing what happens for the diagonal terms: for any longitudinal $j$ belonging to household $k$ during the survey year, the quantity $Z_{j}^{2}$ appearing in (5.9) is replaced by ${({\hat{Z}}_{j})}^{2} = Z_{j}^{2} / {(q_{k}^{b} q_{k}^{c})}^{2} 1_{{k \in s_{m}^{B, τ}}}$ while a weight increase of only a factor of $1 / (q_{k}^{b} q_{k}^{c})$ is probably a better choice. The same type of problem occurs for the product $Z_{j} Z_{j^{'}}$ when longitudinal $j$ and $j^{'}$ belong to the same household during the survey year. More specifically, we have, for all longitudinals $j$ and $j^{'}$ belonging to $s_{p}^{A_{2}, t_{τ}} :$

$E({\hat{Z}}_{j} {\hat{Z}}_{j^{'}} | s_{p}^{A_{2}, t_{τ}}) = Z_{j} Z_{j^{'}} + {cov}_{j j^{'}}$ (5.12)

which implies

$E[{\hat{V}}_{1}^{τ}] - V_{A_{2}}^{τ} = E_{s_{p}^{A_{2}, t_{τ}}} {E [{\tilde{V}}_{A_{2}}^{τ} ({\hat{Z}}_{1}, {\hat{Z}}_{2}, \dots) | s_{p}^{A_{2}, t_{τ}}]} - V_{A_{2}}^{τ}$ (5.13)

${= E}_{s_{p}^{A_{2}, t_{τ}}} [\sum_{j \in s_{p}^{A_{2}, t_{τ}}} \sum_{j^{'} \in s_{p}^{A_{2}, t_{τ}}} \frac{π_{j j^{'}}^{A_{2}} - π_{j}^{A_{2}} π_{j^{'}}^{A_{2}}}{π_{j}^{A_{2}} π_{j^{'}}^{A_{2}}} \frac{1}{π_{j j^{'}}^{A_{2}}} {cov}_{j j^{'}}] . (5.14)$

Since on the other hand ${\hat{V}}_{CNR}^{τ}$ is an unbiased estimator of $V_{CNR}^{τ},$ we have:

$B^{τ} = E [{\hat{var}}_{L} ({\hat{T}}_{τ})] - var({\hat{T}}_{τ})$ (5.15)

$= E_{s_{p}^{A_{2}, t_{τ}}} [\sum_{j \in s_{p}^{A_{2}, t_{τ}}} \sum_{j^{'} \in s_{p}^{A_{2}, t_{τ}}} \frac{π_{j j^{'}}^{A_{2}} - π_{j}^{A_{2}} π_{j^{'}}^{A_{2}}}{π_{j}^{A_{2}} π_{j^{'}}^{A_{2}}} \frac{1}{π_{j j^{'}}^{A_{2}}} {cov}_{j j^{'}}] . (5.16)$

The bias $B^{τ}$ depends in particular on the term ${cov}_{j j^{'}}$ defined by (5.7), and hence on the probabilities $q_{k}^{b}$ and $q_{k}^{c}$ of responding to the grid and the questionnaire in the survey year. Consider the simple case of the panel responding in the first wave, $τ = 1,$ in which the composition of the households has not yet begun to evolve. The quantity ${cov}_{j j^{'}}$ defined by (5.7) is positive if longitudinal $j$ and $j^{'}$ belong to the same household $k,$ and is otherwise nil. Also, for all longitudinals $j$ and $j^{'}$ belonging to the same household $k,$ we have, in accordance with the relation (2.4):

$\frac{π_{j j^{'}}^{A_{2}} - π_{j}^{A_{2}} π_{j^{'}}^{A_{2}}}{π_{j j^{'}}^{A_{2}}} = 1 - π_{k}^{A_{2}} . (5.17)$

Since the latter quantity is also positive, the expression of bias given by formula (5.16) means that $B^{1}$ is positive. On the other hand, the probabilities of inclusion $π_{k}^{A_{2}}$ are very low, and therefore $1 - π_{k}^{A_{2}} ≃ 1.$ Using (5.16) and (5.17), we are therefore able to make the following approximation:

$B^{1} ≃ E_{s_{p}^{A_{2}, t_{1}}} [\sum_{j \in s_{p}^{A_{2}, t_{1}}} \sum_{j^{'} \in s_{p}^{A_{2}, t_{1}}} \frac{1}{π_{j}^{A_{2}}} \frac{1}{π_{j^{'}}^{A_{2}}} {cov}_{j j^{'}}] = V_{CNR}^{1}, (5.18)$

where, as noted above, $V_{CNR}^{τ}$ is defined by (5.8) and corresponds in the first wave to the portion of the variance due to the non-response observed between the grid and the questionnaire. Consequently, the estimator ${\hat{var}}_{L} ({\hat{T}}_{1})$ overestimates the variance of ${\hat{T}}_{1}$ and the error committed is of the order of magnitude of $V_{CNR}^{1} .$ The bias may be relatively large if the probabilities of response to the questionnaire are low. As regards the other waves, the quantity $(π_{j j^{'}}^{A_{2}} - π_{j}^{A_{2}} π_{j^{'}}^{A_{2}}) / π_{j j^{'}}^{A_{2}}$ that appears in (5.16) depends on the households $k_{1}$ and ${k^{'}}_{1}$ to which the longitudinals $j$ and $j^{'}$ belonged during the year of their selection, and it is no longer easy to obtain an order of magnitude of the bias $B^{τ} .$

We introduce a term correcting the bias $B^{τ}$ and we give our variance estimation formula in proposition 1, below. Keeping in mind that $m_{k}$ designates all persons who comprise household $k$ during survey year $t$ (cf. page 11), let ${\tilde{m}}_{k}$ be the set, of cardinal $L_{k},$ consisting of the longitudinals $j$ belonging to $m_{k} .$

Proposition 1: An unbiased estimate of the variance of ${\hat{T}}_{τ}$ is given by

$\hat{var} ({\hat{T}}_{τ}) = {\hat{V}}_{1}^{τ} + {\hat{V}}_{2}^{τ},$ (5.19)

where

${\hat{V}}_{1}^{τ} = {\tilde{V}}_{A_{2}}^{τ} ({\hat{Z}}_{1}, {\hat{Z}}_{2}, \dots) = \sum_{j \in s_{p}^{A_{2}, t_{τ}}} \sum_{j^{'} \in s_{p}^{A_{2}, t_{τ}}} \frac{π_{j j^{'}}^{A_{2}} - π_{j}^{A_{2}} π_{j^{'}}^{A_{2}}}{π_{j}^{A_{2}} π_{j^{'}}^{A_{2}}} \frac{1}{π_{j j^{'}}^{A_{2}}} {\hat{Z}}_{j} {\hat{Z}}_{j^{'}}$

and

${\hat{V}}_{2}^{τ} = \sum_{k \in s_{m}^{B, τ}} \sum_{j, j^{'} \in {\tilde{m}}_{k}} \frac{1}{π_{j j^{'}}^{A_{2}}} {(\frac{{e^{'}}_{k}}{L_{k} + P_{k}})}^{2} \frac{1 - q_{k}^{b} q_{k}^{c}}{{(q_{k}^{b} q_{k}^{c})}^{2}} .$

The demonstration of proposition 1 is provided in Appendix B.

Note 3: The estimator $\hat{var} ({\hat{T}}_{τ}) = {\hat{V}}_{1}^{τ} + {\hat{V}}_{2}^{τ}$ is the sum of two biased estimators whose biases are brought into balance by construction, with the result that $\hat{var} ({\hat{T}}_{τ})$ gives an unbiased estimate of the variance of ${\hat{T}}_{τ} .$

Note 4: Proposition 1 is based on the assumption that the values $e_{k}$ and the response probabilities are known, which enables us to conclude that the estimator $\hat{var} ({\hat{T}}_{τ})$ given by formula (5.19) is unbiased. In practice, these quantities must be estimated. The consequence of this is that the estimator of variance thus obtained is no longer unbiased but only asymptotically unbiased, provided that the non-response models can be considered correct and that their parameters are estimated by an appropriate method.

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

5 Variance estimation and weight sharing