# Small area estimation for unemployment using latent Markov models

Section 2. Data and preliminary analysis

As already mentioned, LMAs are unplanned domains for the LFS. In fact, the sampling design is as follows. Within a given LAU1, municipalities are classified as Self-Representing Areas (larger municipalities) and Non-Self-Representing Areas (smaller municipalities). In Self-Representing Areas, a stratified cluster sampling design is applied: each municipality is a single stratum and households are selected by means of systematic sampling. In Non-Self-Representing Areas, the sample is based on a stratified two stage sampling design: municipalities are Primary Sampling Units, while households are Secondary Sampling Units. Primary Sampling Units are divided into strata of the same dimension in terms of population size. One Primary Sampling Unit is drawn from each stratum without replacement and with probability proportional to the Primary Sampling Unit population size. Secondary Sampling Units are selected by means of systematic sampling in each Primary Sampling Unit. All members of each sample household, both in Self-Representing Areas and in Non-Self-Representing Areas are interviewed. In each quarter, about 70,000 households and 1,350 municipalities are included in the sample. Note that some LMAs (usually the smallest ones) have a very small sample size. Furthermore, usually about one third of the LMAs is not included in the sample at all (i.e., they have a zero sample size).

The LFS follows a rotating panel sampling design, according to a 2-(2)-2 scheme: households are interviewed in two consecutive quarters and, after a two-quarter break, they are interviewed for two additional consecutive quarters. Although the LFS panel design induces correlation among quarterly estimates, due to partial overlap of the sample units, we do not account for it in our model specification in the application illustrated in Section 5. In any case, we expect that this does not affect the comparison among different methods.

In this work we model quarterly unemployment incidences for 611 LMAs for the period 2004-Q1 to 2014-Q4 (44 quarters). Figure 2.1 shows the map of direct estimates in the first and in the last time occasion of the observation time span. Figure 2.2, on the other hand, shows all the direct estimates for each small area in two NUTS2 areas: Lombardy (left panel) is a rich region in the North of Italy, while Sicily (right panel) is the southern Island and is much less wealthy. We observe, in general, that direct estimates are extremely variable and that unemployment has decreased over the first three years, and then started to increase considerably.

Direct estimates in unplanned domains are characterized by a high Coefficient of Variation (CV), which is used as a measure of uncertainty associated with the estimates. In addition, 6,762 out of 26,884 direct estimates cannot be computed because the sample dimension is zero. Usually, in Official Statistics, an estimate for a Labor Force parameter with a CV greater than 33.3% is considered too unreliable and is not recommended for release. Estimates with a CV between 16.6% and 33.3% must be released with caveats because their sampling variability is quite high, while estimates with a CV smaller than 16.6% are of sufficient accuracy and have no release restrictions; see Statistics Canada (2016, page 35). In our data, the vast majority of direct estimates has a very large CV and cannot be considered reliable, as it is shown in Figure 2.3.

## Description for Figure 2.1

Figure showing two Italy maps to illustrate the direct estimates of unemployment incidence in 5 categories: not applicable, less than 1.71%, between 1.71 and 3%, between 3 and 4.78% and more than 4.78%. The first map shows the unemployment estimates for the first quarter of 2004. The North of Italy shows lower unemployment estimates, mainly below 3%. The South shows unemployment estimates that are generally higher than 4.78%.

The second map presents unemployment estimates for the fourth quarter of 2014, which have increased for most areas.

## Description for Figure 2.2

Figure made of two line charts of the quarterly direct estimates of unemployment incidences in two NUTS2 regions: Lombardy on the left and Sicily on the right, from 2004-Q1 to 2014-Q4. For the Lombardy region, the unemployment on the y-axis goes from 0 to 0.12. The time is on the x-axis. The unemployment estimates are lower for this region. For the Sicily region, the unemployment on the y-axis goes from 0 to 0.30. The time is on the x-axis. We observe, in general, that direct estimates are extremely variable and that unemployment has decreased over the first three years, and then started to increase considerably.

## Description for Figure 2.3

Figure illustrating for each quarter the distribution of the sampled small areas according to the CV of the direct estimates. The LLMAs are on the y-axis and the time is on the x-axis. The three CV classes are lower than 16.6%, between 16.6 and 33.3% and higher than 33.3%. This graph shows that the vast majority of direct estimates for the LLMAs has a very large CV, in the third class, and cannot be considered reliable. A low proportion of the direct estimates for the LLMAs has a CV lower than 16.6%.

The basic idea of SAE is to introduce a statistical model to exploit the relationship between the variable of interest and some covariates for which population information is available. Auxiliary variables available for these data are the population rates in $sex\times 7age$ classes (15-19, 20-24, 25-29, 30-39, 40-49, 50-59, 60-74). Since LFS estimates are not seasonally adjusted, we take seasonality into account using year and quarter effects through 10 and 3 dummy variables, respectively.

The CVs of direct estimates are estimates themselves and their precision is a function of the sample size. Therefore, they are subject to a relevant sampling error that can affect small area modeling in different ways (Rao and Yu, 1994) and smoothing estimated Mean Squared Errors (MSEs) is necessary (see Rao, 2003, Chapter 5). In this work, we propose to use a regression model with a logarithmic transformation of the CV and of the MSE (see Wolter, 2007, Chapter 7). In particular, our approach is based on two steps: the first step consists in modeling the CV and then computing the smoothed MSE from this model. At the second step, we model the MSE directly for those small areas for which we do not have a valid CV (i.e., for those LMAs with a zero estimate).

Let ${\widehat{\theta}}_{it}$ be the direct survey estimate for small area $i\mathrm{=1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}m,$ with $m\mathrm{=611},$ at time $t\mathrm{=1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}T,$ with $T\mathrm{=44.}$ Let ${\text{CV}}_{it}$ denote the corresponding estimate of the CV. Note that Italy is divided into four geographic areas, namely broad-areas (e.g., North-West, North-East, Center, South and Islands), and that each LMA belongs to only one of these broad-areas. In order to smooth estimates of MSEs, we have the following auxiliary information:

- ${M}_{it}$ is the population size at time $t$ of the broad-area to which LMA $i$ belongs;
- ${N}_{it}$ is the population size of LMA $i$ at time $t;$
- ${r}_{it}$ is a 14-dimensional column-vector that contains population rates in $sex\times 7age$ classes, for LMA $i$ at time $t.$

At the first step of the proposed procedure, we fit the following regression model for each broad-area:

$$\mathrm{log}\left({\text{CV}}_{it}\right)={\beta}_{0}+\mathrm{log}\left({\widehat{\theta}}_{it}\right){\beta}_{1}+\mathrm{log}\left(\frac{{N}_{it}}{{M}_{it}}\right){\beta}_{2}+\mathrm{log}{\left({r}_{it}\right)}^{\prime}{\beta}_{3}+\mathrm{log}{\left({1}_{14}-{r}_{it}\right)}^{\prime}{\beta}_{4}\mathrm{,}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(2.1)$$

where ${1}_{14}$ is a 14-dimensional column vector of ones. The use of the log-transformation and the choice of the covariates has been assessed using standard model selection techniques, such as AIC and adjusted ${R}^{2}\text{}.$ Using predictions denoted by ${\widehat{\text{CV}}}_{it}$ from this model, smoothed MSEs are obtained as

$${\widehat{\text{MSE}}}_{it}\mathrm{=}{\widehat{\text{CV}}}_{it}\times {\widehat{\theta}}_{it}\mathrm{.}$$

In the second step of the proposed procedure, for all ${\widehat{\theta}}_{it}\mathrm{=0},$ CVs cannot be computed while MSEs are available since direct estimates are based on calibrated weights and MSE estimates are based on the residuals of a generalized regression that accounts for the auxiliary variables used in the calibration constraints. Then, MSEs are modeled directly and separately for each broad-area using the following model:

$$\mathrm{log}\left({\text{MSE}}_{it}\right)\mathrm{=}{\beta}_{0}+\mathrm{log}\left(\frac{{N}_{it}}{{M}_{it}}\right){\beta}_{1}+\mathrm{log}{\left({r}_{it}\right)}^{\prime}{\beta}_{2}+\mathrm{log}{\left({1}_{14}-{r}_{it}\right)}^{\prime}{\beta}_{3}\mathrm{.}$$

Smoothed MSEs are obtained as predictions from this model. Note that, we have resorted to this two-step procedure because the former model, the one for CVs, fitted better than the latter for MSEs in our application. Figure 2.4 reports the final output of this two-step procedure and displays the original and the smoothed MSEs versus unemployment incidence for all sampled areas.

## Description for Figure 2.4

Scatter plot comparing the original and smoothed MSEs vs the unemployment incidence for all sampled areas. MSEs are on the y-axis, going from 0 to 0.007. Unemployment is on the x-axis, ranging from 0 to 0.35. Initial MSEs are more scattered and generally higher than the smoothed MSEs.

## Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

- Date modified: