Bayesian inference for a variance component model using pairwise composite likelihood with survey data
Section 1. Introduction
Multi-stage survey designs are used in many
population-based surveys. Increasingly, multilevel models have been used to
make inferences when data are obtained from a multi-stage survey.
Desiring to improve such inferences Rao, Verret and
Hidiroglou (2013) (RVH) proposed using a weighted log pairwise composite
likelihood approach. There is an extensive literature on composite likelihoods:
see review papers by Varin (2008), Varin, Reid and Firth (2011) and Yi (2017),
and many applications. In their Section 4 RVH describe a unified approach
applicable to both linear and generalized linear models. Important aspects of
their work include (a) obtaining design-consistent point estimates of mean and
regression parameters and variance components, and (b) using only first-order
inclusion probabilities and second-order probabilities within clusters. In
particular, RVH work in (a) is important because of design inconsistency when
the number of clusters (first-stage units) grows while the cluster sample sizes
remain small (Pfeffermann, Skinner, Holmes, Goldstein and Rasbash, 1998).
Unlike the pseudo-likelihood approach in common use (Rabe-Hesketh and Skrondal,
2006) their method ensures that (a) holds for outcomes from generalized linear
models. The research in RVH has been extended by Yi, Rao and Li (2016) (YRL),
who provide a more general framework, additional theory and extensive
simulations.
Two related developments have led to our research.
First, there is increasing interest in the use of Bayesian methods for
inferences from survey data. Section 5 has a general reference together
with an introduction to papers describing extensive use of Bayesian methods at
the National Agricultural Statistical Service of the US Department of
Agriculture. Second, there is (Bayesian) literature demonstrating the
possibility of overstated precision by using unadjusted composite
likelihoods, e.g., Ribatet, Cooley and Davison (2012) (RCD) and Stoehr and
Friel (2018).
Our approach is to start with a posterior distribution
taken proportional to the product of a composite likelihood and prior
distribution. Comparing this approximate posterior distribution with one using
the full likelihood, we show that inferences based on the approximate posterior
exhibit overstated precision. Making adjustments to the posterior distribution
based on the composite likelihood as in RCD, we then use simulations to compare
the three ways of formulating a posterior distribution, i.e., those based on
the full, composite and adjusted composite likelihoods. This is done by visual
evaluation of the graphs of the posterior densities and coverages (over
repeated simulations) of 95% credible intervals for the model parameters.
The methodology is described in Section 2.3. The
adjustments to the approximate posterior distribution based on a composite
likelihood are derived from a transformation of the logarithm of the composite
likelihood at its mode, designed so that the negative of the inverse of the
curvature matrix of the approximate posterior density at its mode will match
the corresponding
posterior variance-covariance matrix of the parameters. This is similar to the
property in frequentist inference that the inverse of the observed Fisher
information matrix (negative of the Hessian of the log likelihood at its mode)
estimates the variance-covariance matrix of the maximum likelihood estimates.
To focus on the main issue we use a “noninformative” prior
distribution for the parameters of our model, described below. Then the
corresponding posterior density is close to the normalized likelihood, and
advances shown in a Bayesian context would also be seen in a frequentist
model-based approach.
To simplify the initial investigation a standard linear
random effects (intercept only) superpopulation model is assumed. Consider a
survey population drawn from that superpopulation and composed of a large
number of clusters, each of a common size, say, Let denote the continuous response variable for
elementary unit in cluster with and Then we write
where all and are
independent, and , and are
parameters.
We also begin by assuming that the survey sampling
design is a simple random sample of clusters, where is a positive integer. This has the advantage
that the model (1.1) holds not only for the superpopulation and the finite
population but also (replacing by for the sample itself, arising from generation
of the population followed by the selection of the sample using the sampling
design. It ensures that the likelihood function to be used in Bayesian
inference is well defined. As well, it can be shown that Bayesian inference
from the sample for the parameters of model (1.1) would be interpretable also
in terms of the frequentist theory for analytic uses of survey data (Skinner,
Holt and Smith, 1989).
Our work is valuable because we show the perils of using
an unadjusted pairwise composite likelihood to form an approximate
posterior distribution for inference even in this very simple and
straightforward case. Extensions to unequal probability sampling designs are
discussed in Section 4.
The proposed adjustment leads to excellent frequentist
properties for inference about the mean The posterior mean of has low bias in the frequency sense, and the
frequentist coverage of credible intervals aligns with the nominal levels. For it provides a significant improvement over
using the unadjusted composite likelihood. However, the coverage falls short of
the nominal level, leading to the need for additional research about how to
modify the adjustment.
The rest of the paper is structured as follows. Section 2
provides the definitions of the full, composite and adjusted composite likelihoods
and the prior distributions. There follows a description of the curvature
adjustment and the reasons for its use. The simulation studies are described in
Section 3 including the model, prior distributions, sample sizes and their
settings, number of replications, etc. This section also describes how the
results are displayed together with a summary of our findings. Extensions to
unequal probability sampling cases are discussed in Section 4. Conclusions
are in Section 5.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa