3 Methodology
Iván A. Carrillo and Alan F. Karr
Previous | Next
3.1 Motivation
Assume that (in a
non-survey context) interest lies in the vector parameter in the following model:
(3.1)
where is the response variable for subject at wave is a vector of covariates, is a matrix; is a monotonic one-to-one differentiable "link
function�; is the "variance function� with known form;
and is the "dispersion parameter.� Since, in
general, the covariance matrix is hard to specify, we model it as a "working� covariance matrix; where and is a "working� correlation matrix, both of
dimension and is a vector that fully characterizes (see Liang and Zeger 1986).
To estimate we select a (single-cohort) sample of elements from model and we (intend to) measure each of them at occasions. If all the elements in the sample
respond at every single occasion the task can be completed with the usual
generalized estimating equation (GEE) methodology of Liang and Zeger (1986).
However, in any study it is rarely the case that all subjects do respond at all
waves. It is more common to have some elements in the sample who drop out of
the study.
Under this situation, and
assuming that the missing responses can be regarded as missing at random or MAR
(see Rubin 1976), in particular that the dropout at a given wave does not
depend on the current (unobserved) value, Robins, Rotnitzky and Zhao (1995) proposed
to estimate by solving the estimating equations: where is the response indicator for subject at wave and is an estimate of the probability that subject
is observed through wave
For survey applications,
one would use the estimating equation where is the survey weight for subject Another way of writing this equation is with
We notice that the
diagonal elements of are simply wave-specific nonresponse-adjusted
survey weights whenever the subject is observed, and are equal to zero whenever
the subject is missing. This feature in and of itself suggests a solution to
the multi-cohort problem, which will be presented in the next section.
3.2 A novel
approach to combining cohorts in longitudinal surveys
Based on the discussion
in the previous section, if we have a fixed-panel, fixed-panel-plus-'births', repeated-panel,
rotating-panel, split-panel, or refreshment sample survey, we propose to
estimate the superpopulation parameter in model by the solution to the estimating equations:
(3.2)
where the sum is
over the sample i.e.,
over all the elements selected (for the first time) in any of the samples The diagonal matrix is with being the (nonresponse-adjusted)
cross-sectional weight for subject at wave (as long as subject is part of sample ) and is the indicator of whether subject belongs to finite population or not. In Section 3.2.1 we argue why this is
a reasonable estimation procedure, and in Section 3.2.2 we discuss the missing
value issue.
The cross-sectional
weights in are such that the sample represents when used in conjunction with said weights.
This means that, for each observation in sample there has to be a survey weight which could be regarded as the number of units
that such observation represents in However, remember that the sample is composed of different sets of subjects, or
different subsamples (the different cohorts), and the integration of these
subsamples into a single cross-sectional weight variable may not be a straightforward task.
For the SDR, the
construction of the cross-sectional weight for wave is not too complicated as the different
cohorts are selected independently, from non-overlapping populations. The base
weight in that case is easy to compute, and all that remains is the adjustment
for things like attrition and calibration to known totals in the population
On the other hand, in
other situations, for example, when a frame of new members does not
exist, the new cohort may need to be selected from the overall population at
the given wave, or from a frame containing new members plus some old
members, or from multiple frames. In such cases, the building of the
cross-sectional weights may not be as straightforward, and the theory of
multiple frames may need to be used. We refer the reader to the works of Lohr
(2007) and Rao and Wu (2010), and references therein, for cases like that.
Expression (3.2) is a
generalization of equation (2.25) in Vieira (2009). The latter is applicable
only when all the subjects have the same number of observations or any missing
responses can be regarded as missing completely at random or MCAR (see Rubin
1976). As discussed in Robins, et al.
(1995), using such an equation when the missing responses are not MCAR produces
inconsistent estimators; therefore, with a rotation scheme like that of the
SDR, where not all subjects are dropped (or kept) with the same probabilities,
its usage would not be appropriate. The adequacy of equation (3.2) in that case
and when there are missing responses is addressed in sections 3.2.1 and 3.2.2,
respectively. If all subjects have cross-sectional weights that do not vary
over time (or have a single longitudinal weight) equation (3.2) reduces to
equation (2.25) in Vieira (2009).
3.2.1 Unbiasedness
The unbiasedness property
of the estimating function is important because, as Song (2007, Section 5.4)
argues, it is the most crucial assumption in order to obtain a consistent
estimator.
Let us define the so-called "census estimator,� to be the
solution to the following finite population estimating equation:
(3.3)
where the sum is
over i.e.,
over all the elements who became members of the target population in any of and In order to show design-unbiasedness of the
estimating function we need to show that its design expectation is
for any
The sampling design
characteristics of a longitudinal survey can be thought of as those of a
multiphase sample, as can be seen in Särndal, Swensson and Wretman (1992,
Section 9.9). We therefore use the methodology of multiphase sampling for the
derivations. We assume, without loss of generality, that there are only three
waves; the derivations with just three waves show the patterns for general with respect to unbiasedness and variance.
As we mentioned earlier,
we assume that is the cross-sectional weight for subject at wave if that subject belongs to and zero otherwise. From the theory of
multiphase sampling we have that for and for and and for where is the inclusion probability of subject in sample and is the conditional inclusion probability of
subject in sample given
Using to denote the expectation with respect to the
sampling design, we have:
(3.4)
where and For example, for we obtain:
(3.5)
where
and similarly we can show that and From these expressions and equation (3.4) we
conclude that for any which means that the estimating function is design-unbiased for the finite population
estimating function.
Furthermore, as the
target of inference is the superpopulation parameter, we need to guarantee that
the model for is such that is satisfied, where represents the expectation with respect to
model For if this is the case, we have:
so that the
estimating function is model-design unbiased. The requirement means that the mean model needs to be
correctly specified; consequently, one needs to pay attention to residual
diagnostics for the particular model being fitted.
3.2.2 A note on nonresponse
In the SDR, as in any
other (longitudinal) survey, there is nonresponse. Some sampled individuals
choose not to participate at all, whereas some subjects participate in some
waves but not in others. The SDR remedies this situation by making a
nonresponse adjustment to the cross-sectional survey weights.
Assume that the
nonresponse adjustment at wave is a multiplication by the inverse of the
estimated wave response probability For example, the nonresponse-adjusted weight
for a person who did respond at wave 3 (and was first selected at wave
2), i.e., for would be
We need to redefine the
estimating equation, to include only the respondents, as where the sum is over the respondent set i.e.,
over all the elements who belonged for the first time in any of the respondent
sets and the matrix is Also, denote by the set of cohort respondents at wave Obviously, if
If additionally, the
response mechanism can be assumed to be MAR, we then have, for
example for
(3.6)
where
and The third equality in (3.6) requires that the
nonresponse model used for satisfies This means that in the model for we have to include as much available
information, thought to influence the nonresponse propensity, as possible, in
order for this assumption (i.e., the
MAR assumption) to be tenable. For example, if the nonresponse is thought to be
independent across waves, one should include, in the model for as many variables from the corresponding wave
as possible. If, on the other hand, it is reasonable to assume that the
response propensity at a given wave depends on previous responses (and possibly
response history), then those responses should be included in the response
model, and so on.
The design as well as the
model-design unbiasedness follow immediately from (3.6) together with the
previous section. Hereafter we therefore ignore the issue of nonresponse for
notational simplicity.
3.3 Variance and variance estimation
We now develop a (Taylor
Series) linearization for the variance of the proposed estimator. The basic
technique is due to Binder (1983). For simplicity in the derivations and
notation we divide through by we redefine
and
where Let be our estimator, which satisfies and let be the "census estimator,� which satisfies Assume and with and We can write the total error of as Sampling
Error + Model Error. After some straightforward calculations, the
total variance, or more precisely the total MSE, can be decomposed as:
(3.7)
where for any matrix is the "sampling variance� component, is the cross "sampling-model variance�
component, and Furthermore, by Taylor series expansions we
can obtain the following approximations: and where we define and
We then get, for and in (3.7),
(3.8)
(3.9)
where and the derivation of (3.9) can be found in the
Appendix.
In conclusion, so far we
have found that:
(3.10)
In (3.10) all the
terms can be estimated by "plugging in� the estimate except for the term this is the subject of the next section.
If the sampling fraction
is small, i.e., the first term in expression (3.10) is a good
approximation for the total variance; i.e.,
the expression for is simply (and lower order terms). If, on the other
hand, the sampling fraction is large, both terms in (3.10) are required.
3.3.1 Design variance of the estimating function
In order to derive an
expression for we assume as before. The methodology is that of
two-phase sampling (more precisely, multiphase sampling), as discussed in
chapter 9 of Särndal, et al.
(1992). After some derivations (see Appendix), and defining and we obtain:
(3.11)
where for
and in the Appendix
we show that:
for and In general, we have proved the following
Property 3.1 The (design)
variance of can be decomposed as:
(3.12)
(3.13)
where we let whenever and to get (3.13) we have changed variables
and used the independence among cohorts.
In (3.11), (3.12), and (3.13)
we have assumed that the cohorts are design-independent. However, in some cases
this assumption may not be tenable; an example of such a case is the multiple
frame situation discussed in the first part of Section 3.2. Another instance in
which it may not be appropriate to assume cohort independence is when weight
adjustments cross cohorts, which is the case of the SDR; we discuss this issue
in Section 5. Calculations for the case of three cohorts, in the Appendix, show
that (3.13) holds for the variance terms even without independence. The
Appendix also identifies conditions under which it is a good approximation for
the covariance terms.
3.3.2 Estimation
The estimation of in (3.10) can be achieved as follows. and can be estimated by can be estimated by where can be estimated by
We use (3.13) in Property
3.1 to estimate As long as there is a method to estimate the
variance of (cross-sectional) Horvitz-Thompson (H-T) estimators, expression (3.13)
can be used. If we define we notice that each of the terms involved in
the computation of (3.13), terms like is simply the variance of a wave H-T estimator. Obviously, the variance
estimation method needs to account for the sampling design as well as for any
nonresponse and calibration adjustments performed, but this does not present
any additional complications beyond what is found in any cross-sectional
problem, as everything is implemented cross-sectionally. The SDR uses
replication to estimate variances of cross-sectional estimators, but any method
of design variance estimation can be used.
We use the
cross-sectional replicate weights that SDR provides, but we do not re-estimate
the parameter of interest at each replicate. First, note that we require
replication only for the estimation of the "meat� of the design variance Secondly, although does appear in the expression for the H-T
estimator whose variance needs to be calculated (and re-calculated at each
replicate), the work of Roberts, Binder, Kova�ević, Pantel and Phillips (2003), who apply the
"estimating function bootstrap� (Hu and Kalbfleisch 2000) to survey data, show
that in a setting like ours, it is not necessary to re-compute the estimator at
each replicate, but that the full-sample estimator suffices. This
simplification speeds up the computation of the replicate estimates.
As a way of illustration,
say we currently are at wave i.e.,
we are estimating the term in (3.13). The replicate of the first term is where is the replicate weight for subject at wave and the replicate of the second term is where is the replicate weight for subject at wave
Previous | Next