State space time series modelling of the Dutch Labour Force Survey: Model selection and mean squared errors estimation
Section 2. The Dutch Labour Force Survey

2.1 The DLFS design

The DLFS has been based on a rotating panel design since October 1999. Every month, a sample of addresses is drawn according to a stratified two-stage sample design. Strata are formed by geographical regions; municipalities are the primary sampling units and addresses are the secondary sampling units. All households residing on one address are included in the sample. In this paper, the DLFS data observed from January 2001 until June 2010 are considered. During this period, data in the first wave were collected by means of computer assisted personal interviewing (CAPI) by interviewers that visit sampled households at home. After a maximum of six attempts, an interviewer leaves a letter with the request to contact the interviewer by telephone to make an appointment for an interview. When a household member cannot be contacted, proxy interviewing is allowed by members of the same household. Respondents are re-interviewed four times at quarterly intervals. In these four subsequent waves, data are collected by means of computer assisted telephone interviewing (CATI). During these re-interviews, a condensed questionnaire is applied to establish any changes in the labour market position of the respondents. Proxy interviewing is also allowed during these re-interviews. Mobile phone numbers and secret land line numbers are collected in the first wave to avoid panel attrition. With the commencement of the rotating panel design for the DLFS, the gross sample size was about 6,200 addresses per month on average, with about 65% completely responding households. The response rates in the follow-up waves are about 90% compared to the preceding wave.

The general regression (GREG) estimator (Särndal et al. 1992) is applied to estimate the total unemployed labour force. This estimator accounts for the complexity of the sample design and uses auxiliary information available from registers to correct, at least partially, for selective non-response. Let $Y_{t}^{j}$ denote the GREG estimate of the total number of unemployed in month $t$ based on the $j^{th}$ wave of respondents. Five such estimates are obtained per month, each of them being respectively based on the sample that entered the survey in month $t - l,$ $l = {0, 3, 6, 9, 12} .$ The GREG estimator for this population total is defined as:

$Y_{t}^{j} = \sum_{k \in s} w_{k, t} (\sum_{i = 1}^{n_{k, t}} y_{i, k, t}) (2.1)$

with $y_{i, k, t}$ representing the sample observations that are equal to 1 if the $i^{th}$ person in the $k^{th}$ household is unemployed, and zero otherwise; $n_{k, t}$ is the number of persons aged 15 or above in the $k^{th}$ household; $w_{k, t}$ are the regression weights for household $k$ at time $t .$ The method of Lemaître and Dufour (1987) is used to obtain equal weights for all persons within the same household:

$w_{k, t} = \frac{1}{π_{k, t}} [1 + (X_{t} - \sum_{k \in s} \frac{x_{k, t}}{π_{k, t}}) {(\sum_{k \in s} \frac{x_{k, t} x_{k, t}^{^{'}}}{π_{k, t} g_{k, t}})}^{- 1} \frac{x_{k, t}}{g_{k, t}}], (2.2)$

where $π_{k, t}$ is the inclusion probability of household $k$ at time $t,$ $g_{k, t}$ is the size of household $k$ at time $t;$ $x_{k, t} = \sum_{i = 1}^{n_{k, t}} x_{i, k, t},$ with $x_{i, k, t}$ being a $J -$ dimensional vector with the weighting model auxiliary information on the $i^{th}$ person in the $k^{th}$ household at time $t .$ Vector $X_{t}$ contains population totals of auxiliary variables. The weighting model is defined by the following variables (with the number of categories in brackets): Age(5)Gender + Geographic Region(44) + Gender(2) $\times$ Age(21) + Age(5) $\times$ Marital Status(2) + Ethnicity(8), where $\times$ stands for interaction of variables, and Age(5)Gender is a variable classified into eight classes where Age has five categories, with the second, third and fourth age category being itemized into two genders.

The variance of the GREG estimator $Y_{t}^{j}$ is approximated by:

$\hat{Var} (Y_{t}^{j}) = \sum_{h =1}^{H} \frac{n_{h, t}}{n_{h, t} - 1} (\sum_{k = 1}^{n_{h, t}} {(w_{k, t} {\hat{e}}_{k, t})}^{2} - \frac{1}{n_{h, t}} {(\sum_{k = 1}^{n_{h, t}} w_{k, t} {\hat{e}}_{k, t})}^{2}), j ={1, 2, 3, 4, 5}, (2.3)$

where the GREG residuals are ${\hat{e}}_{k, t} = \sum_{i = 1}^{n_{k, t}} (y_{i, k, t} - x_{i, k, t}^{^{'}} {\hat{β}}_{t});$ $n_{h, t}$ is the number of households in stratum $h$ (with $H$ being the total number of strata); vector ${\hat{β}}_{t}$ is a Horvitz-Thompson type estimator for the regression coefficient that is obtained from regressing the target variable on the auxiliary variables from the sample.

2.2 The STS model for the DLFS

There are two reasons why Statistics Netherlands took a decision to switch to a time series model-based production approach in June 2010. One reason for that was inadequately small sample sizes for production of monthly estimates. With a net sample size of about 4,000 households in the first wave on average, the GREG estimates of the unemployed labour force had a coefficient of variation of about 4% at the national level, which was considered to be too volatile for official statistical publications. In addition to that, monthly unemployment figures must be published for six domains based on a classification of gender and age. The design-based estimates of these domains feature much higher coefficients of variation. Another problem with the DLFS is the so-called RGB, which refers to systematic differences between the estimates of different waves (see, e.g., Bailar 1975 or Pfeffermann 1991). Common reasons behind the RGB are panel attrition, panel-effects, and differences in questionnaires and modes used in the subsequent waves. In the case of the DLFS, the first wave estimates are assumed to be most reliable, with the subsequent waves systematically underestimating the unemployed labour force numbers. See van den Brakel and Krieg (2009) for a more detailed discussion.

Both problems are solved with an STS model, which uses five series of GREG estimates for the five different waves as input. With an STS model, an observed series is decomposed into several unobserved components, e.g., trend and seasonal. The Kalman filter, optionally in combination with a smoothing algorithm, can be applied to extract these components from the observed time series. By doing so, estimates of the components that define the signal for unemployment are separated from unexplained variance of the population parameter and from the sampling variance. This generally results in less volatile point estimates, with substantially smaller standard errors compared to those of the GREG estimates. By modelling the systematic differences between the five input series, the model also accounts for the RGB of the rotating panel.

In each month $t,$ a five-dimensional vector $Y_{t} = {(Y_{t}^{1} Y_{t}^{2} Y_{t}^{3} Y_{t}^{4} Y_{t}^{5})}^{'}$ is observed, containing GREG estimates of the total number of the unemployed labour force based on the five waves. Based on Pfeffermann (1991), van den Brakel and Krieg (2009) developed the following model for the GREG estimates $Y_{t} :$

$Y_{t} = 1_{5} ξ_{t} + λ_{t} + e_{t}, (2.4)$

here, $1_{5}$ is a five-dimensional column vector of ones, $ξ_{t}$ is the unknown (scalar) true population parameter, $λ_{t}$ is a vector containing state variables for the RGB, and $e_{t}$ is a vector of the survey errors that are correlated with their counterparts from previous waves (the structure will be presented later). For the true population parameter, it is assumed that: $ξ_{t} = L_{t} + γ_{t} + ε_{t},$ which is the sum of a stochastic trend $L_{t},$ a stochastic seasonal component $γ_{t},$ and an irregular component $ε_{t} \overset{iid}{~} N (0, σ_{ε}^{2}) .$

For the stochastic trend $L_{t},$ the so-called smooth-trend model is assumed:

$\begin{array}{l} L_{t} & = L_{t - 1} + R_{t - 1}, \\ R_{t} & = R_{t - 1} + η_{R, t}, \end{array}$

where $L_{t}$ and $R_{t}$ represent the level and slope of the true population parameter, respectively, with the slope disturbance term being distributed as: $η_{R, t} \overset{iid}{~} N (0, σ_{R}^{2}) .$

For the seasonal component $γ_{t},$ the trigonometric model is assumed:

$γ_{t} = \sum_{l =1}^{6} γ_{t, l},$

where each of these six harmonics follows the process:

$\begin{array}{l} γ_{t , l} & = cos (h_{l}) γ_{t - 1, l} + sin (h_{l}) γ_{t - 1, l}^{*} + ω_{t , l}, \\ γ_{t , l}^{*} & = - sin (h_{l}) γ_{t - 1, l} + cos (h_{l}) γ_{t - 1, l}^{*} + ω_{t , l}^{*}, \end{array}$

with $h_{l} = \frac{π l}{6}$ being the $l^{th}$ seasonal frequency, $l = {1, \dots 6} .$ The zero-expectation stochastic terms $ω_{t, l}$ and $ω_{t, l}^{*}$ are assumed to be normally and independently distributed and to possess the same variance within and across all the harmonics, such that:

$\begin{array}{l} Cov (ω_{t, l}, ω_{t^{'}, l^{'}}) & = Cov (ω_{t, l}^{*}, ω_{t^{'}, l^{'}}^{*}) = (\begin{array}{l} σ_{ω}^{2} & if l = l^{'} and t = t^{'}, \\ 0 & if l \neq l^{'} or t \neq t^{'}, \end{array} \\ Cov (ω_{t, l}, ω_{t, l}^{*}) & =0 for all l and t . \end{array}$

The second component in (2.4) is the RGB. It is assumed that the first wave is unbiased, as motivated in van den Brakel and Krieg (2009). The RGBs for the follow-up waves are time-dependent and are modelled as random walk processes. The rationale behind this is that field-work procedures are subject to frequent changes. Apart from that, response rates change gradually over time. This makes the RGB time-dependent, as illustrated by van den Brakel and Krieg (2015), Figure 4.3. The RGB vector for the five waves can be written in the following form: $λ_{t} = {(0 λ_{t}^{2} λ_{t}^{3} λ_{t}^{4} λ_{t}^{5})}^{'},$ with:

$λ_{t}^{j} = λ_{t - 1}^{j} + η_{λ, t}^{j}, j = {2, 3, 4, 5} .$

It is assumed that the RGB disturbances are not correlated across different waves and are normally distributed: $η_{λ, t}^{j} \overset{iid}{~} (0, σ_{λ}^{2}),$ with equal variances in all the four waves.

The last component in (2.4) contains the survey errors for the five GREG estimates, i.e., $e_{t} = {(e_{t}^{1} e_{t}^{2} e_{t}^{3} e_{t}^{4} e_{t}^{5})}^{'} .$ To account for sampling error heterogeneity caused by changes in the sample sizes over time, the sampling errors are modelled proportionally to the design-based standard errors according to the following measurement error model proposed by Binder and Dick (1990): $e_{t}^{j} = {\tilde{e}}_{t}^{j} z_{t}^{j},$ where $z_{t}^{j} = \sqrt{\hat{Var} (Y_{t}^{j})}$ and ${\tilde{e}}_{t}^{j}$ are standardised sampling errors that follow a stationary process defined later in the text. Here, $\hat{Var} (Y_{t}^{j})$ are the design-based variance estimates obtained from the micro data using (2.3). They are treated as a priori known sampling variances in the STS model.

Since the sample in the first wave has no overlap with samples observed in the past, ${\tilde{e}}_{t}^{t}$ can be modelled as a white noise with $E ({\tilde{e}}_{t}^{1}) =0$ and $Var ({\tilde{e}}_{t}^{1}) = σ_{v_{1}}^{2} .$ The variance of the survey errors $e_{t}^{t}$ will be equal to the variance of the GREG estimates if the maximum likelihood estimate of $σ_{v_{1}}^{2}$ is approximately equal to unity.

The survey errors in the follow-up waves are correlated with the survey errors from the preceding waves. This autocorrelation coefficient is estimated from the survey data using the approach proposed by Pfeffermann, Feder and Signorelli (1998). The autocorrelation structure is modelled with an AR(1) model where the autocorrelation coefficient is obtained with the Yule-Walker equations (van den Brakel and Krieg 2009):

${\tilde{e}}_{t}^{j} = ρ {\tilde{e}}_{t - 3}^{j - 1} + ν_{t}^{j}, ν_{t}^{j} \overset{iid}{~} N (0, σ_{v_{j}}^{2}), j = {2,3,4,5} .$

It is assumed that the first-order autocorrelation coefficient is common for all the four waves. Its estimate is used as a priori information in the model. Since ${\tilde{e}}_{t}^{j}$ is an AR(1) process, $Var ({\tilde{e}}_{t}^{j}) = σ_{v_{j}}^{2} / (1 - ρ^{2}) .$ The variance of the sampling error $e_{t}^{j}$ is approximately equal to $\hat{Var} (Y_{t}^{j})$ if the maximum likelihood estimate of $σ_{v_{j}}^{2}$ is approximately equal to $(1 - ρ^{2}) .$ Five different hyperparameters $σ_{v_{j}}^{2} , j = {1,2,3,4,5},$ are assumed for the survey error components of the five waves.

The disturbance variances, together with the autocorrelation parameter $ρ,$ are collected in a hyperparameter vector called $θ = {(σ_{R}^{2} σ_{ω}^{2} σ_{ε}^{2} σ_{λ}^{2} σ_{v_{1}}^{2} σ_{v_{2}}^{2} σ_{v_{3}}^{2} σ_{v_{4}}^{2} σ_{v_{5}}^{2} ρ)}^{'},$ and the vector containing only the disturbance variances is called $θ_{σ} = {(σ_{R}^{2} σ_{ω}^{2} σ_{ε}^{2} σ_{λ}^{2} σ_{v_{1}}^{2} σ_{v_{2}}^{2} σ_{v_{3}}^{2} σ_{v_{4}}^{2} σ_{v_{5}}^{2})}^{'} .$ To avoid negative estimates, the disturbance variance hyperparameters in $θ_{σ}$ are estimated on a log-scale. The quasi-maximum likelihood method is used (see e.g., Harvey 1989), where $\hat{ρ} -$ estimates are treated as known. Numerical analysis of this paper is conducted with OxMetrics 5 (Doornik 2007) in combination with SsfPack 3.0 package (Koopman, Shephard and Doornik 2008).

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: semi-annual

Ottawa

Date modified:: 2017-06-22

Language selection

Search and menus

Search

State space time series modelling of the Dutch Labour Force Survey: Model selection and mean squared errors estimation
Section 2. The Dutch Labour Force Survey

2.1 The DLFS design

2.2 The STS model for the DLFS

State space time series modelling of the Dutch Labour Force Survey: Model selection and mean squared errors estimation Section 2. The Dutch Labour Force Survey

2.1 The DLFS design

2.2 The STS model for the DLFS

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

State space time series modelling of the Dutch Labour Force Survey: Model selection and mean squared errors estimation
Section 2. The Dutch Labour Force Survey