Estimation of domain discontinuities using Hierarchical Bayesian Fay-Herriot models
Section 1. Introduction
Official statistics
produced by national statistical institutes are generally based on repeated
sample surveys. Much of their value lies in their continuity, enabling
developments in society and the economy to be monitored, and policy actions
decided. Survey samples contain besides sampling errors different sources of
non-sampling errors that have a systematic effect on the outcomes of a survey.
As long as the survey process is kept constant, this bias component is not
visible. This is often an argument to keep survey processes of repeated surveys
unchanged as long as possible. From time to time changes in surveys are needed
to improve the efficiency, reduce the survey related costs, or meet new
requirements, and this is seen strongly in the use of mixed-mode surveys
including web-based questionnaires in official statistics. A redesign of the
survey process generally has systematic effects on the survey estimates, since
the biases induced by the aforementioned non-sampling errors are changed,
disturbing comparability with figures published in the past.
Systematic differences in
the outcomes of a repeated survey due to redesign of the survey process are
called discontinuities. To avoid the implementation of a new survey process
disturbing the comparability of estimates over time, it is important to
quantify these discontinuities. This avoids confounding real change in the
parameters of interest with changing measurement bias due to alteration of the
survey process.
Several methods to
quantify discontinuities are proposed in the literature (van den Brakel,
Smith and Compton, 2008). A reliable and straightforward approach is to
conduct the old and new approach alongside of each other at the same time for
some period of time, further referred to as a parallel run. Ideally this is
based on a randomized experiment that can be embedded in the probability sample
of the survey (van den Brakel,
2008). In this paper small area
estimation methods for estimating domain discontinuities are proposed. We
consider the situation where the regular survey, used for the production of
official figures, is conducted at the full sample size and is conducted in
parallel with an alternative approach. Due to budget limitations, the sample
that is assigned to the alternative approach is often not sufficiently large to
observe minimum detectable differences at prespecified significance and power
levels using standard direct estimators, particularly for sub populations or
domains.
To explain the problem
addressed in this paper, some notation is introduced. Let
denote the real
population value of a variable of interest for domain
Furthermore,
and
denote direct
estimates of
based on the
regular survey and the alternative survey approach, respectively. Since the
regular survey is conducted at the regular sample size,
is a reliable
direct estimate for
at least for
the planned domains. Due to the reduced sample size of the new survey in the
parallel run
however, will
be insufficiently precise. More precise domain estimates with the small sample
available under the new approach can be obtained with the Fay-Herriot (FH)
model (Fay and Herriot, 1979), which is defined as
with
a vector with
covariates at the domain level,
the regression
coefficients,
the random
domain effects and
the sampling
error. To obtain more precise domain estimates for the alternative approach, van den Brakel,
Buelens and Boonstra (2016) proposed an hierarchical Bayesian (HB) univariate FH
model, where sample estimates of the regular survey are considered as potential
auxiliary variables in a model selection procedure. This implies that
is used as a
covariate in
besides the
usual covariates that are available from registers or censuses. This results in
an area level model, with measurement error (Ybarra and Lohr, 2008). The use
of reliable direct estimates observed in the regular survey significantly
increased the precision of the domain estimates for the alternative approach
conducted at reduced sample size (van den Brakel
et al., 2016).
Let
denote the
small area prediction for
based on the
aforementioned FH model under the small sample assigned to the alternative
survey approach. In the approach followed by van den Brakel et al. (2016), point estimates for domain discontinuities are
obtained as the difference between the direct estimate obtained with the
regular survey and the model based domain prediction obtained under the
alternative approach, i.e.,
The use of the
direct estimate of the regular survey as an auxiliary variable in the small
domain predictions of the alternative survey, results in strong positive
correlations between both estimators, which cannot be ignored when computing
the standard errors for the discontinuities. More precisely,
Since
is also used as
a covariate in
in the FH model
for
will be
nonzero. To this end, two analytic approximations for the standard errors of
the discontinuities are proposed. The first approach combines the design-based
variance estimate of the direct estimator of the regular survey
with the
posterior variance of the HB domain predictions of the alternative survey
and a
design-based estimator for the covariance between both point estimates
This approach
is unstable in the sense that even negative variance estimates occur in the
case of strong positive covariance estimates. A related issue is that
design-based and model-based variance approximations are combined in one
uncertainty measure for the discontinuities. Therefore a second analytic
approximation was proposed, where a design-based estimator for the variance of
the HB domain predictions
is derived and
combined with the design-based variance for the direct estimator for the
regular survey and the design-based covariance between both point estimates.
Several references to
design-based mean squared error estimation in small area estimation can be
found in the literature. Gonzalez
and Waksberg (1973) introduced the
concept of an average design-based mean squared error of a set of synthetic
estimators and proposed an estimator that, however, can be unstable and take
negative values. Marker (1995) proposed a more stable but biased estimator for the
design-based mean squared error for small area estimates, which can also take
negative values. Lahiri and
Pramanik (2019) proposed a design-based
estimator that cannot take negative values, following the concepts of an
average design-based mean squared error, originally introduced by Gonzalez and Waksberg (1973). Rivest and
Belmonte (2000) proposed an estimator for
the mean squared error that measures the uncertainty with respect to the
sampling design conditional on the random effects of the model and assuming
normality of the sampling model. Rao, Rubin-Bleuer and Estevao (2018) and Pfeffermann and Ben-Hur (2018) also propose a model for the design-based mean
squared error in small area estimators. Rao et al. (2018) estimate
the model parameters through restricted maximum likelihood while Pfeffermann and Ben-Hur (2018) applies a bootstrap method.
The complications with
variance estimation of domain discontinuities under a univariate FH model can
also be circumvented by setting up a full Bayesian framework for the analysis
of the domain discontinuities. Two approaches are proposed in this paper. The
first approach is a bivariate FH model to model the direct estimates under the
regular and alternative approach simultaneously, i.e., a bivariate area level
model for the vector
The random
component of this model accounts for the correlation between the domain
parameters under the regular and alternative approach. The precision of the
estimated discontinuities is improved by increasing the effective sample size
within the domains by means of cross-sectional correlations. In addition, a
positive correlation between the random domain effects further decreases the
standard error of the estimated discontinuities. The second approach uses a
univariate FH model for the direct estimates of the discontinuities, i.e., a
univariate FH model for
This method is
considered as a less complex alternative for the bivariate FH model. It is,
however, anticipated that it is harder to construct good prediction models,
since the available covariates from registers might be good predictors for the
target variables of the sample survey but probably not for systematic
differences between the differences of two estimates for the same variable
obtained with different survey processes.
The univariate FH model
proposed by van den Brakel
et al. (2016) was applied to
estimate domain discontinuities in five key target variable of the Dutch Crime
Victimization Survey (CVS) using data obtained in a parallel run where the
regular survey is conducted at the regular sample size and the alternative
survey at a sample size that is about one fourth of the regular sample size. In
this paper the bivariate FH model and the univariate FH model for the domain
discontinuities are also applied to the same redesign of the CVS. The results
are compared with the univariate FH model proposed in van den Brakel et al. (2016).
Model selection in this
paper is based on a step forward selection procedure that minimizes the WAIC
criteria (Watanabe, 2010, 2013). To avoid selecting over-parameterized models, it is
proposed to add covariates in a step forward selection procedure only if they
decrease the WAIC by more than the standard error of the WAIC. This prevents
selection of several covariates that only marginally improves the WAIC,
resulting in models that tend to overfit the data.
The FH model (Fay and Herriot, 1979) is frequently applied in the context of small area
estimation (Rao and Molina,
2015). FH models are particularly
appropriate if auxiliary information is available at the domain level. Datta,
Ghosh, Nangia and Natarjan (1996) employed a multivariate FH model fitted in an HB
framework to estimate median income. Multivariate FH models fitted in a
frequentist framework are considered in Gonzales-Manteiga, Lombardia, Molina,
Morales and Santamaria (2008);
Benavent and Morales (2016). Several
authors provided time-series FH models to use sample information from previous
editions of a survey as a form of small area estimation (Rao and Yu, 1994; Datta, Lahiri, Maiti and Lu, 1999; You and Rao, 2000; Estaban, Morales, Perez and Santamaria, 2012; Marhuenda, Molina and Morales, 2013). Pfeffermann and Burck (1990); Pfeffermann and Tiller (2006); van den Brakel
and Krieg (2016); Bollineni-Balabay,
van den Brakel, Palm and Boonstra (2017) are some examples of FH
time-series models casted in a state-space framework. Boonstra and van den Brakel (2019) discuss how FH time series models can be expressed
either in a state space frame work and fitted with the Kalman filter or alternatively
expressed as time series multilevel models in an hierarchical Bayesian
framework, and estimated using a Gibbs sampler.
The paper is structured
as follows. In Section 2 the Crime Vicitimization Survey, the redesign and
the set up of the parallel run are described. The bivariate FH model is
explained in Section 3, including the HB framework and the model selection
and evaluation approach. Results are presented in Section 4. The paper
ends with a discussion in Section 5.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2021
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa