Variance estimation under monotone non-response for a panel survey
Section 2. Correction of non-response and attrition
2.1 Notation and main assumptions
We are interested in a finite population
A sample
is first selected according to some sampling
design
and we assume that the first-order inclusion
probabilities
are strictly positive for any
This first sampling phase corresponds to the
original inclusion of units in the sample.
We consider the case of a panel survey in which the sole
units in the original sample
are followed over time, without reentry or
late entry units at subsequent times to represent possible newborns. We are
therefore interested in estimating some parameter defined over the population
for some study variable
taking the value
for the unit
at time
The units in the sample
are followed at subsequent times
and the sample is prone to unit non-response
at each time. We note
for the response indicator for unit
at time
and
for the subset of respondents at time
We assume monotone non-response resulting in the nested
sequence
For
we note
for the response probability of some unit
to be a respondent at time
We assume that the data are missing at random,
i.e. the response probability
at time
can be explained by the variables observed at
times
including the variables of interest, see for
example Zhou and Kim (2012). Also, we assume that at any time
the units answer independently of one another,
and we note
for the probability that two distinct units
and
answer jointly at time
2.2 Reweighted estimator
We are interested in estimating the total
at time
In practice, the response probabilities at
each time are unknown and need to be estimated. We assume that at each time
the probability of response is parametrically
modeled as
for some known function
where
is a vector of variables observed for all the
units in
and
denotes some unknown parameter. Here and
elsewhere, the superscript
will be used when we account for non-response
at time
like for the probability
of unit
to be a respondent at time
Following the approach in Kim and Kim (2007), we assume that the true
parameter is estimated by
the solution of the estimating equation
with
some weight of unit
in the estimating equation. Customary choices
for these weights include
and
see Fuller and An (1998), Beaumont (2005) and Kim
and Kim (2007).
The estimated response probability at time
is
The propensity score adjusted estimator at
time
which will be simply called the reweighted
estimator in what follows, is defined as
Here and elsewhere, the subscript
will be used when the sample observed at time
is used for estimation, like for
which makes use of the sample
We simplify the notation as
when the total at time
is estimated by using the sample observed at
time
2.3 Variance computation
Under some regularity assumptions on the response
mechanisms and some regularity conditions on the
we obtain from Theorem 1 in Kim and Kim
(2007) that we can write
where
and where for any
we denote by
the value of
evaluated at
and
From (2.5), we obtain that
with
the estimator of
computed on
Using a proof by induction, it follows from (2.4)
and (2.7) that
is approximately unbiased for
Also, the variance of
may be asymptotically approximated by
The first term in the right-hand side of (2.8) is the variance due to the
sampling design, that we note as
The second term in the right-hand side of (2.8)
is the variance due to non-response, that we note as
From (2.5), this asymptotic variance is given
by
where
We note that for each of its component
the term
in (2.10) includes a centering term
which is essentially a prediction of
by means of regressors
This centering is due to the estimation of the
response probabilities. Suppressing these centering terms, equations (2.9) and
(2.10) would lead to the variance of the estimator of
we would obtain by replacing in (2.3) the
estimated probabilities by their true values. The variance of this estimator is
usually larger than that of the reweighted estimator in (2.3); see also Beaumont
(2005), equation (5.7) and Kim and Kim (2007), equation (17), for the case
2.4 Variance estimation
At time
an approximately unbiased estimator for the
variance due to the sampling design
is
where
and where
if
and
otherwise. Following equation (25) in Kim and Kim (2007),
may be approximately unbiasedly estimated at
time
by
where
This leads to the global variance estimator at time
A simplified estimator of the variance due to
non-response is obtained by ignoring the prediction terms
for each of the
variance components. After some algebra, this
leads to the simplified variance estimator
The main advantage of this simplified variance estimator is that it only
requires the knowledge of the estimated response probabilities. On the other
hand, the computation of the variance estimator in (2.12) requires the
knowledge of the response models used at all times. The simplified variance
estimator is therefore of particular interest for secondary users of the survey
data, for which the estimated response probabilities may be the only available
information related to the response modeling. This simplified variance
estimator will tend to overestimate the variance due to non-response of
if the prediction term
partly explains
2.5 Application to the logistic regression model
In the particular case when a logistic regression model
is used at each time
the model (2.1) may be rewritten as
We obtain
and the estimator for the variance due to
non-response is given by (2.12), with
If the reweighted estimator is computed at time
the estimator in (2.12) for the variance due
to non-response may be rewritten as
If the reweighted estimator is computed at time
the estimator in (2.12) for the variance due
to non-response may be rewritten as
In practice, the model of Response Homogeneity Groups
(RHG) is often assumed when correcting for unit non-response. Under this model,
it is assumed that at each time
the sub-sample
may be partitioned into
groups
such that the response probability
is constant inside a group. This model is a
particular case of the logistic regression model in (2.18), obtained with
and the variance due to non-response is estimated accordingly. Explicit
formulas are given in Appendix.