State space time series modelling of the Dutch Labour Force Survey: Model selection and mean squared errors estimation
Section 2. The Dutch Labour Force Survey
2.1 The DLFS design
The DLFS has been based on a rotating panel design since
October 1999. Every month, a sample of addresses is drawn according to a
stratified two-stage sample design. Strata are formed by geographical regions;
municipalities are the primary sampling units and addresses are the secondary
sampling units. All households residing on one address are included in the
sample. In this paper, the DLFS data observed from January 2001 until June 2010
are considered. During this period, data in the first wave were collected by
means of computer assisted personal interviewing (CAPI) by interviewers that
visit sampled households at home. After a maximum of six attempts, an
interviewer leaves a letter with the request to contact the interviewer by
telephone to make an appointment for an interview. When a household member
cannot be contacted, proxy interviewing is allowed by members of the same
household. Respondents are re-interviewed four times at quarterly intervals. In
these four subsequent waves, data are collected by means of computer assisted
telephone interviewing (CATI). During these re-interviews, a condensed
questionnaire is applied to establish any changes in the labour market position
of the respondents. Proxy interviewing is also allowed during these
re-interviews. Mobile phone numbers and secret land line numbers are collected
in the first wave to avoid panel attrition. With the commencement of the
rotating panel design for the DLFS, the gross sample size was about 6,200
addresses per month on average, with about 65% completely responding
households. The response rates in the follow-up waves are about 90% compared to
the preceding wave.
The general regression (GREG) estimator (Särndal
et al. 1992) is applied to estimate the total unemployed labour force.
This estimator accounts for the complexity of the sample design and uses
auxiliary information available from registers to correct, at least partially,
for selective non-response. Let
denote the GREG estimate of the total number
of unemployed in month
based on the
wave of respondents. Five such estimates are
obtained per month, each of them being respectively based on the sample that
entered the survey in month
The GREG estimator for this population total
is defined as:
with
representing the sample observations that are
equal to 1 if the
person in the
household is unemployed, and zero otherwise;
is the number of persons aged 15 or above in
the
household;
are the regression weights for household
at time
The method of Lemaître and Dufour (1987) is used to obtain
equal weights for all persons within the same household:
where
is the inclusion probability of household
at time
is the size of household
at time
with
being a
dimensional vector with the weighting model auxiliary
information on the
person in the
household at time
Vector
contains population totals of auxiliary
variables. The weighting model is defined by the following variables (with the
number of categories in brackets): Age(5)Gender + Geographic Region(44) + Gender(2)
Age(21) + Age(5)
Marital Status(2) + Ethnicity(8), where
stands for interaction of variables, and
Age(5)Gender is a variable classified into eight classes where Age has five
categories, with the second, third and fourth age category being itemized into
two genders.
The variance of the GREG estimator
is approximated by:
where the GREG
residuals are
is the number of households in stratum
(with
being the total number of strata); vector
is a Horvitz-Thompson type estimator for the
regression coefficient that is obtained from regressing the target variable on
the auxiliary variables from the sample.
2.2 The STS model for the DLFS
There are two reasons why Statistics Netherlands took a
decision to switch to a time series model-based production approach in June
2010. One reason for that was inadequately small sample sizes for production of
monthly estimates. With a net sample size of about 4,000 households in the
first wave on average, the GREG estimates of the unemployed labour force had a
coefficient of variation of about 4% at the national level, which was considered
to be too volatile for official statistical publications. In addition to that,
monthly unemployment figures must be published for six domains based on a
classification of gender and age. The design-based estimates of these domains
feature much higher coefficients of variation. Another problem with the DLFS is
the so-called RGB, which refers to systematic differences between the estimates
of different waves (see, e.g., Bailar 1975 or Pfeffermann 1991). Common reasons
behind the RGB are panel attrition, panel-effects, and differences in
questionnaires and modes used in the subsequent waves. In the case of the DLFS,
the first wave estimates are assumed to be most reliable, with the subsequent
waves systematically underestimating the unemployed labour force numbers. See van
den Brakel and Krieg (2009) for a more detailed discussion.
Both problems are solved with an STS model, which uses
five series of GREG estimates for the five different waves as input. With an
STS model, an observed series is decomposed into several unobserved components,
e.g., trend and seasonal. The Kalman filter, optionally in combination with a
smoothing algorithm, can be applied to extract these components from the
observed time series. By doing so, estimates of the components that define the
signal for unemployment are separated from unexplained variance of the
population parameter and from the sampling variance. This generally results in
less volatile point estimates, with substantially smaller standard errors
compared to those of the GREG estimates. By modelling the systematic
differences between the five input series, the model also accounts for the RGB
of the rotating panel.
In each month
a five-dimensional vector
is observed, containing GREG estimates of the
total number of the unemployed labour force based on the five waves. Based on Pfeffermann
(1991), van den Brakel and Krieg (2009) developed the following model for the
GREG estimates
here,
is a five-dimensional column vector of ones,
is the unknown (scalar) true population
parameter,
is a vector containing state variables for the
RGB, and
is a vector of the survey errors that are
correlated with their counterparts from previous waves (the structure will be
presented later). For the true population parameter, it is assumed that:
which is the sum of a stochastic trend
a stochastic seasonal component
and an irregular component
For the stochastic trend
the so-called smooth-trend model is assumed:
where
and
represent the level and slope of the true
population parameter, respectively, with the slope disturbance term being
distributed as:
For the seasonal component
the trigonometric model is assumed:
where each of these six harmonics follows the process:
with
being
the
seasonal
frequency,
The
zero-expectation stochastic terms
and
are
assumed to be normally and independently distributed and to possess the same
variance within and across all the harmonics, such that:
The second component in (2.4) is the RGB. It is assumed
that the first wave is unbiased, as motivated in van den Brakel and Krieg
(2009). The RGBs for the follow-up waves are time-dependent and are modelled as
random walk processes. The rationale behind this is that field-work procedures
are subject to frequent changes. Apart from that, response rates change
gradually over time. This makes the RGB time-dependent, as illustrated by van
den Brakel and Krieg (2015), Figure 4.3. The RGB vector for the five waves can
be written in the following form:
with:
It
is assumed that the RGB disturbances are not correlated across different waves
and are normally distributed:
with equal variances in all the four waves.
The
last component in (2.4) contains the survey errors for the five GREG estimates,
i.e.,
To account for sampling error heterogeneity
caused by changes in the sample sizes over time, the sampling errors are
modelled proportionally to the design-based standard errors according to the
following measurement error model proposed by Binder and Dick (1990):
where
and
are standardised sampling errors that follow a
stationary process defined later in the text. Here,
are the design-based variance estimates
obtained from the micro data using (2.3). They are treated as a priori known
sampling variances in the STS model.
Since the sample in the first wave has no overlap with
samples observed in the past,
can be modelled as a white noise with
and
The variance of the survey errors
will be equal to the variance of the GREG
estimates if the maximum likelihood estimate of
is approximately equal to unity.
The survey errors in the follow-up waves are correlated
with the survey errors from the preceding waves. This autocorrelation
coefficient is estimated from the survey data using the approach proposed by Pfeffermann,
Feder and Signorelli (1998). The autocorrelation structure is modelled with an
AR(1) model where the autocorrelation coefficient is obtained with the
Yule-Walker equations (van den Brakel and Krieg 2009):
It is assumed that the first-order autocorrelation
coefficient is common for all the four waves. Its estimate is used as a priori
information in the model. Since
is an AR(1) process,
The variance of the sampling error
is approximately equal to
if the maximum likelihood estimate of
is approximately equal to
Five different hyperparameters
are assumed for the survey error components of
the five waves.
The disturbance variances, together with the
autocorrelation parameter
are collected in a hyperparameter vector
called
and the vector containing only the disturbance
variances is called
To avoid negative estimates, the disturbance
variance hyperparameters in
are estimated on a log-scale. The
quasi-maximum likelihood method is used (see e.g., Harvey 1989), where
estimates are treated as known. Numerical
analysis of this paper is conducted with OxMetrics 5 (Doornik 2007) in
combination with SsfPack 3.0 package (Koopman, Shephard and Doornik 2008).