Bayesian inference for a variance component model using pairwise composite likelihood with survey data
Section 2. Full likelihood, pairwise likelihood and Bayesian implementation
2.1 Model and formulae
As in Section 1, let denote the response variable for second-stage
unit in first-stage unit for and We use lower case letter to represent realized values of Let denote the sample data with for where T denotes transpose.
In a more general random effects model, we might assume
that, conditional on random effects for the are independently distributed as
for
where is a
known density function and is the
associated parameter vector. Next, we model random effects by assuming that the
are
independent and identically distributed as
for
where is a
given density function indexed by the parameter vector
Let be the vector of model parameters which is of
interest. In the frequentist framework, the maximum likelihood method is
commonly used to conduct inference about by maximizing the likelihood function
where
An alternative to the likelihood method is the composite
likelihood approach (Lindsay, 1988). In particular, the pairwise likelihood
method has often been employed. Let be the density of determined by
For let be the
joint density for paired responses determined by
Then a
marginal pairwise likelihood function can be formulated as
where and are
weights that can be user-specified to enhance efficiency or to facilitate some
specific features of the formulation. Discussion on choosing weights can be
found in Cox and Reid (2004), Lindsay, Yi and Sun (2011), Varin, Reid and Firth
(2011), and Yi (2017). To confine our attention to the use of marginal pairwise
likelihoods, in line with the approach of RVH, here we consider the case with and
Returning to the special case of model (1.1), suppose
that is known, and take to consist of and In a Bayesian approach it is necessary to
choose a prior distribution for We will assume a prior distribution in which and are independent, with a uniform distribution
with large support for and a distribution for that is close to uniform on an interval
assumed to contain the support of the full likelihood function for with high probability. Gelman (2006) presents
a thorough treatment of choosing a prior distribution of in the random effects model (1.1). He
recommends using a uniform prior for for moderate to large values of but a half-Cauchy prior for smaller values of (see, especially, Sections 3.2 and 5.2 of
Gelman, 2006). The half-Cauchy prior is supported on and is given by
where is a scale
hyperparameter.
2.2 Unadjusted pairwise composite likelihood
Again, assume model (1.1), and assuming known, let be the vector of model parameters. We are
interested in comparing the performance of the posterior distribution of based on using the full likelihood or the
pairwise likelihood, together with the adjusted posterior pairwise distribution
to be described below.
To start, consider a simple situation where also is assumed to be known and only is unknown. Let be a prior density of Then the posterior density of is
where the
subscript FL indicates that it is based on the full likelihood. In contrast, we
consider
where and then define
to be the “pairwise”
posterior density of We wish to compare the variances of derived
from and shown in
the following theorem, of which the derivations are straightforward.
Theorem: Assume
that is a uniform prior. Then
(a) is a normal density with mean and variance
(b) is a normal density with mean and variance where
The theorem shows that when is greater than 2, the variance derived from
the “pairwise” posterior density is smaller than that of the posterior density This finding is intuitively reasonable,
because the pairwise likelihood is effectively taking all pairs of observations within each cluster to
be independent. It motivates us to examine an adjusted version of to be discussed in the sequel.
For the case where is also unknown, it can be shown that a
similar kind of adjustment is needed. Assuming independent uniform priors for and it is straightforward to show that
where represents the unit
vector, and stands
for the identity
matrix.
After some algebra the pairwise composite likelihood
posterior (PL) can be shown to be
where, with
Note that is
defined in (2.7) with
Assuming independent uniform priors for and we consider the posterior density of with integrated out. To assess the relative
precisions of Bayesian inference in the two cases, we must use approximations
because of the complexity of the two posterior densities. Specifically, we
compare the curvature of the log posterior and the log pairwise posterior
densities for at their modes. The ratio of the latter to the
former can be shown to be equal for large to
implying that
using the unadjusted pairwise posterior density for would
overestimate the precision of estimation of
Thus, for both and (or basing an approximate log likelihood for
Bayesian inference directly on the pairwise composite likelihood would lead to
posterior intervals that are too narrow.
Note: In Section 3 the parameter vector is set to be (with variance replaced by standard deviation and a half-Cauchy prior distribution is used
for However, the comparison of full and log
pairwise posterior densities will remain similar under the appropriate
transformations.
2.3 Curvature adjustment for the log pairwise
likelihood
In this section we motivate the curvature adjustment for
the log pairwise likelihood from the standpoint of estimating function theory,
as presented, for example by Jørgensen and Knudsen (2004).
First, we note that if has a -variate normal distribution with mean vector and variance-covariance matrix the logarithm of the multivariate density of has form
The
expression in (2.9) as a function of has its
maximum at and
curvature or second derivative matrix (Hessian) at the maximum equal to Intuitively, this correspondence between the
curvature of the log density at the maximum and inverse of the covariance
matrix can be expected to hold approximately for a multivariate density that is
close to being normal.
Consider a model in which the distribution of the
observation variable depends on a vector parameter Given an observation the log likelihood is denoted where is the density of Under regularity conditions, (e.g., Lehmann,
1999, Chapter 7) the MLE is found by solving the system
where denotes
the score function, the gradient of The system (2.10) is an unbiased (vector)
estimating equation, and is optimally efficient, having minimal asymptotic
variance-covariance matrix (in the sense of positive definite difference) among
solutions of unbiased estimating equation systems. In regular cases (e.g.,
Lehmann, 1999, Chapter 7) the score function satisfies the second
Bartlett identity (e.g., Lindsay, 1988):
where Var
denotes a variance-covariance matrix, and represents a gradient. As well,
asymptotically, through a Taylor series approximation of we have:
Thus,
standard (frequentist) likelihood inference estimates the variance-covariance
of as the
reciprocal of the observed Fisher information matrix
which is the
negative of the Hessian (curvature matrix) of the log likelihood function at
its maximum.
In Bayesian inference, if is a prior density for the logarithm of the posterior density for is
where
If the prior
density is flat in areas of appreciable likelihood, the posterior density of which
quantifies the inference about approximates a density with mode at and the
curvature of its logarithm equal to the negative of the Fisher information, making
the posterior variance-covariance of approximately equal to the reciprocal of in (2.13).
Thus the Bayesian estimation of is
efficient in the frequentist sense; alternatively, the frequentist inference is
close to the Bayesian inference.
Suppose that in the frequentist context, the score
function is replaced by another estimating function that is unbiased in the sense of having
expectation 0. See, for example, Lindsay, Yi and Sun (2011). Then the estimator
is no longer optimally efficient. However, it
is consistent, and its variance can be estimated by the delta method, or linearization
of the function We might wish to think of treating as a stand-in for a score vector, or as the gradient
with respect to of a substitute for the log likelihood
function. In particular, composite likelihood equations might be thought of as
stand-ins for score estimating equations.
A question is then whether a substitute for the log
likelihood function having gradient could play the role of the log likelihood in
Bayesian inference, and lead to an approximately correct posterior when
substituted into (2.14), and if not, whether there are principled ways in which
we could correct it.
Thus, suppose we have an alternative to the score
function, namely estimating function that is unbiased for in the sense of having
Suppose the
solution of the
equation maximizes a function which we
would like to think of as an alternative to the log likelihood function; for
example, could be
a log pairwise composite likelihood function, and Then would be
approximately equal to what the log posterior density would be if the prior
were non-informative, and if we took to be a
stand-in for the log likelihood function. The stand-in posterior
variance-covariance of would be
approximately the inverse of the negative of the curvature matrix of at By
estimating function theory (e.g., Heyde, 1997), using the same kind of Taylor
series approximation as in (2.12), the frequentist variance-covariance of satisfies
If were the
log pairwise composite likelihood function, we would have, in the notation of
RCD,
where is the
true value of is minus
the expectation of and is equal
to the variance-covariance matrix of the
gradient of
If had the property (analogous to (2.11)) that
so that then the
right-hand side of (2.15) or of (2.16) would be approximately the same as the
stand-in posterior variance-covariance of
The property (2.17) is called information unbiasedness
of an estimating function (Lindsay, 1982). Given a that does not satisfy (2.17), then to produce
a approximately satisfying (2.17), we could set
for a
constant matrix so that
the gradient of is times
the gradient of while
the point estimate of that
maximizes and its
approximate variance-covariance, are unchanged.
We want and it can be shown that this is equivalent to
which is a
curvature adjustment like the one in RCD, who suggest taking the solution
of (2.19) that sets where and