Bayesian inference for a variance component model using pairwise composite likelihood with survey data
Section 2. Full likelihood, pairwise likelihood and Bayesian implementation

2.1 Model and formulae

As in Section 1, let $Y_{i j}$ denote the response variable for second-stage unit $j$ in first-stage unit $i$ for $i = 1, \dots, n,$ and $j = 1, \dots, m .$ We use lower case letter $y_{i j}$ to represent realized values of $Y_{i j} .$ Let $y (n) = {y_{1}, \dots, y_{n}}$ denote the sample data with $y_{i} = {(y_{i 1}, \dots, y_{i m})}^{T}$ for $i = 1, \dots, n,$ where T denotes transpose.

In a more general random effects model, we might assume that, conditional on random effects $u_{i}$ for $i = 1, \dots, n,$ the $Y_{i j}$ are independently distributed as

$Y_{i j} ~ f_{y | u} (y_{i j} | u_{i}; θ_{y})$ for $j = 1, \dots, m, (2.1)$

where $f_{y | u}$ is a known density function and $θ_{y}$ is the associated parameter vector. Next, we model random effects by assuming that the $u_{i}$ are independent and identically distributed as

$u_{i} ~ f_{u} (u_{i} | θ_{u})$ for $i = 1, \dots, n, (2.2)$

where $f_{u}$ is a given density function indexed by the parameter vector $θ_{u} .$

Let $η = {(θ_{y}^{T}, θ_{u}^{T})}^{T}$ be the vector of model parameters which is of interest. In the frequentist framework, the maximum likelihood method is commonly used to conduct inference about $η$ by maximizing the likelihood function

$L (η) = \prod_{i =1}^{n} f (y_{i}; η),$

where

$f (y_{i}; η) = \int {\prod_{j =1}^{m_{i}} f_{y | u} (y_{i j} | u_{i}; θ_{y})} f_{u} (u_{i} | θ_{u}) d u_{i} . (2.3)$

An alternative to the likelihood method is the composite likelihood approach (Lindsay, 1988). In particular, the pairwise likelihood method has often been employed. Let $L_{i j} (η) = f (y_{i j}; η)$ be the density of $Y_{i j},$ determined by

$f (y_{i j}; η) = \int f_{y | u} (y_{i j} | u_{i}; θ_{y}) f_{u} (u_{i} | θ_{u}) d u_{i} .$

For $j \neq k,$ let $L_{i j k} (η) = f (y_{i j}, y_{i k}; η)$ be the joint density for paired responses $(Y_{i j}, Y_{i k}),$ determined by

$f (y_{i j}, y_{i k}; η) = \int f_{y | u} (y_{i j} | u_{i}; θ_{y}) f_{y | u} (y_{i k} | u_{i}; θ_{y}) f_{u} (u_{i} | θ_{u}) d u_{i} .$

Then a marginal pairwise likelihood function can be formulated as

$C (η) = \prod_{i =1}^{n} \prod_{j < k} L_{i j k}^{d_{j k}} (η) \times L_{i j}^{d_{j}} (η) \times L_{i k}^{d_{k}} (η),$

where $d_{j k},$ $d_{j},$ and $d_{k}$ are weights that can be user-specified to enhance efficiency or to facilitate some specific features of the formulation. Discussion on choosing weights can be found in Cox and Reid (2004), Lindsay, Yi and Sun (2011), Varin, Reid and Firth (2011), and Yi (2017). To confine our attention to the use of marginal pairwise likelihoods, in line with the approach of RVH, here we consider the case with $d_{j} = d_{k} = 0$ and $d_{j k} = 1.$

Returning to the special case of model (1.1), suppose that $σ_{e}^{2}$ is known, and take $η$ to consist of $θ_{y} = θ$ and $θ_{u} = σ_{u}^{2} .$ In a Bayesian approach it is necessary to choose a prior distribution for $η .$ We will assume a prior distribution in which $θ$ and $σ_{u}^{2}$ are independent, with a uniform distribution with large support for $θ,$ and a distribution for $σ_{u}$ that is close to uniform on an interval assumed to contain the support of the full likelihood function for $σ_{u}^{2}$ with high probability. Gelman (2006) presents a thorough treatment of choosing a prior distribution of $σ_{u}$ in the random effects model (1.1). He recommends using a uniform prior for $σ_{u}$ for moderate to large values of $n,$ but a half-Cauchy prior for smaller values of $n$ (see, especially, Sections 3.2 and 5.2 of Gelman, 2006). The half-Cauchy prior is supported on $(0, \infty)$ and is given by

$π (σ_{u}) \propto {(1 + {(\frac{σ_{u}}{A})}^{2})}^{- 1}, (2.4)$

where $A$ is a scale hyperparameter.

2.2 Unadjusted pairwise composite likelihood

Again, assume model (1.1), and assuming $σ_{e}^{2}$ known, let $η = {(θ, σ_{u}^{2})}^{T}$ be the vector of model parameters. We are interested in comparing the performance of the posterior distribution of $η$ based on using the full likelihood or the pairwise likelihood, together with the adjusted posterior pairwise distribution to be described below.

To start, consider a simple situation where $σ_{u}^{2}$ also is assumed to be known and only $θ$ is unknown. Let $π (θ)$ be a prior density of $θ .$ Then the posterior density of $θ$ is

$p_{FL} (θ | y (n)) \propto π (θ) \prod_{i =1}^{n} f (y_{i}; θ), (2.5)$

where the subscript FL indicates that it is based on the full likelihood. In contrast, we consider

$L_{i, PL} (θ) = \prod_{1 \leq j < k \leq m} L_{i j k} (θ),$

where $L_{i j k} (θ) = \int f_{y | u} (y_{i j} | u_{i}; θ) f_{y | u} (y_{i k} | u_{i}; θ) f_{u} (u_{i}) d u_{i},$ and then define

$p_{PL} (θ | y (n)) \propto π (θ) \prod_{i =1}^{n} L_{i, PL} (θ) (2.6)$

to be the “pairwise” posterior density of $θ .$ We wish to compare the variances of $θ$ derived from $p_{FL} (θ | y (n))$ and $p_{PL} (θ | y (n)),$ shown in the following theorem, of which the derivations are straightforward.

Theorem: Assume that $π (θ)$ is a uniform prior. Then

(a) $p_{FL} (θ | y (n))$ is a normal density with mean $\bar{y}$ and variance $\frac{σ_{e}^{2} + m σ_{u}^{2}}{m n};$

(b) $p_{PL} (θ | y (n))$ is a normal density with mean $\bar{y}$ and variance $\frac{σ_{e}^{2} + 2 σ_{u}^{2}}{m (m - 1) n}$ where $\bar{y} = {(m n)}^{- 1} \sum_{i =1}^{n} \sum_{j =1}^{m} y_{i j} .$

The theorem shows that when $m$ is greater than 2, the variance derived from the “pairwise” posterior density $p_{PL} (θ | y (n))$ is smaller than that of the posterior density $p_{FL} (θ | y (n)) .$ This finding is intuitively reasonable, because the pairwise likelihood is effectively taking all $m (m - 1) / 2$ pairs of observations within each cluster to be independent. It motivates us to examine an adjusted version of $p_{PL} (θ | y (n)),$ to be discussed in the sequel.

For the case where $σ_{u}^{2}$ is also unknown, it can be shown that a similar kind of adjustment is needed. Assuming independent uniform priors for $θ$ and $σ_{u}^{2},$ it is straightforward to show that

$p_{FL} (θ, σ_{u}^{2} | y (n)) \propto {| Σ_{m} |}^{- n / 2} exp [- 0 .5 tr (Σ_{m}^{- 1} S_{0})] (2.7)$

where $S_{0} = \sum_{i =1}^{n} (y_{i} - μ_{m}) {(y_{i} - μ_{m})}^{T}, μ_{m} = θ 1_{m}, Σ_{m} = σ_{e}^{2} I_{m} + σ_{u}^{2} 1_{m} 1_{m}^{T},$ $1_{m}$ represents the $m \times 1$ unit vector, and $I_{m}$ stands for the $m \times m$ identity matrix.

After some algebra the pairwise composite likelihood posterior (PL) can be shown to be

$p_{PL} (θ, σ_{u}^{2} | y (n)) \propto {| Σ_{2} |}^{n m (m - 1) / 4} exp [- 0 .5 tr (Σ_{2}^{- 1} S_{0 PL})] (2.8)$

where, with $z_{i j k} = {(y_{i j} - θ, y_{i k} - θ)}^{T},$

$S_{0 PL} = \sum_{i = 1}^{n} \sum_{j < k} z_{i j k} z_{i j k}^{T} .$

Note that $Σ_{2}$ is defined in (2.7) with $m = 2.$

Assuming independent uniform priors for $θ$ and $σ_{u}^{2},$ we consider the posterior density of $σ_{u}^{2}$ with $θ$ integrated out. To assess the relative precisions of Bayesian inference in the two cases, we must use approximations because of the complexity of the two posterior densities. Specifically, we compare the curvature of the log posterior and the log pairwise posterior densities for $σ_{u}^{2}$ at their modes. The ratio of the latter to the former can be shown to be equal for large $n$ to

$\frac{2 (m - 1) {(σ_{e}^{2} + m σ_{u}^{2})}^{2}}{m {(σ_{e}^{2} + 2 σ_{u}^{2})}^{2}},$

implying that using the unadjusted pairwise posterior density for $m > 2$ would overestimate the precision of estimation of $σ_{u}^{2} .$

Thus, for both $θ$ and $σ_{u}^{2}$ (or $σ_{u}),$ basing an approximate log likelihood for Bayesian inference directly on the pairwise composite likelihood would lead to posterior intervals that are too narrow.

Note: In Section 3 the parameter vector $η$ is set to be ${(θ, σ_{u})}^{T}$ (with variance $σ_{u}^{2}$ replaced by standard deviation $σ_{u}),$ and a half-Cauchy prior distribution is used for $σ_{u} .$ However, the comparison of full and log pairwise posterior densities will remain similar under the appropriate transformations.

2.3 Curvature adjustment for the log pairwise likelihood

In this section we motivate the curvature adjustment for the log pairwise likelihood from the standpoint of estimating function theory, as presented, for example by Jørgensen and Knudsen (2004).

First, we note that if $X$ has a $q$ -variate normal distribution with mean vector $μ$ and variance-covariance matrix $Σ,$ the logarithm of the multivariate density of $X$ has form

$- \frac{q}{2} log (2 π) - \frac{1}{2} log | Σ | - \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ) . (2.9)$

The expression in (2.9) as a function of $x$ has its maximum at $μ$ and curvature or second derivative matrix (Hessian) at the maximum equal to $- Σ^{- 1} .$ Intuitively, this correspondence between the curvature of the log density at the maximum and inverse of the covariance matrix can be expected to hold approximately for a multivariate density that is close to being normal.

Consider a model in which the distribution of the observation variable $Y (n)$ depends on a vector parameter $η .$ Given an observation $Y (n) = y (n),$ the log likelihood is denoted $l (η; y (n)) = log (f (y (n); η))$ where $f$ is the density of $Y (n) .$ Under regularity conditions, (e.g., Lehmann, 1999, Chapter 7) the MLE $\hat{η}$ is found by solving the system

$s (η; y (n)) = 0, (2.10)$

where $s (η; y (n))$ denotes the score function, the gradient of $l (η; y (n)) .$ The system (2.10) is an unbiased (vector) estimating equation, and is optimally efficient, having minimal asymptotic variance-covariance matrix (in the sense of positive definite difference) among solutions of unbiased estimating equation systems. In regular cases (e.g., Lehmann, 1999, Chapter 7) the score function satisfies the second Bartlett identity (e.g., Lindsay, 1988):

${Var}_{η} [s (η; y (n))] = - E_{η} [\nabla s (η; y (n))] = - E_{η} [\nabla^{2} l (η; y (n))], (2.11)$

where Var denotes a variance-covariance matrix, and $\nabla$ represents a gradient. As well, asymptotically, through a Taylor series approximation of $s (\hat{η}; y (n)) - s (η; y (n)) = 0 - s (η; y (n)),$ we have:

$\hat{η} - η ≃ - {[\nabla s (η; y (n))]}^{- 1} s (η; y (n)) . (2.12)$

Thus, standard (frequentist) likelihood inference estimates the variance-covariance of $\hat{η}$ as the reciprocal of the observed Fisher information matrix

$I = - \frac{\partial^{2}}{\partial η \partial η^{T}} {l (η; y (n)) |}_{\hat{η}} = - \nabla^{2} {l (η; y (n)) |}_{\hat{η}}, (2.13)$

which is the negative of the Hessian (curvature matrix) of the log likelihood function at its maximum.

In Bayesian inference, if $π (η)$ is a prior density for $η,$ the logarithm of the posterior density for $η$ is

$log π (η | y (n)) = log π (η) + l (η; y (n)) - K (y (n)), (2.14)$

where

$K (y (n)) = log {\int π (η) f (y (n); η) d η} .$

If the prior density is flat in areas of appreciable likelihood, the posterior density of $η,$ which quantifies the inference about $η,$ approximates a density with mode at $\hat{η}$ and the curvature of its logarithm equal to the negative of the Fisher information, making the posterior variance-covariance of $\hat{η}$ approximately equal to the reciprocal of $I$ in (2.13). Thus the Bayesian estimation of $η$ is efficient in the frequentist sense; alternatively, the frequentist inference is close to the Bayesian inference.

Suppose that in the frequentist context, the score function is replaced by another estimating function $g (y (n); η)$ that is unbiased in the sense of having expectation 0. See, for example, Lindsay, Yi and Sun (2011). Then the estimator $\hat{η}$ is no longer optimally efficient. However, it is consistent, and its variance can be estimated by the delta method, or linearization of the function $g .$ We might wish to think of treating $g$ as a stand-in for a score vector, or as the gradient with respect to $η$ of a substitute for the log likelihood function. In particular, composite likelihood equations might be thought of as stand-ins for score estimating equations.

A question is then whether a substitute for the log likelihood function having gradient $g$ could play the role of the log likelihood in Bayesian inference, and lead to an approximately correct posterior when substituted into (2.14), and if not, whether there are principled ways in which we could correct it.

Thus, suppose we have an alternative to the score function, namely estimating function $g (y (n); η),$ that is unbiased for $η$ in the sense of having

$E_{η} [g (y (n); η)] = 0 .$

Suppose the solution $\hat{η}$ of the equation $g (y (n); η) = 0$ maximizes a function $h (y (n); η)$ which we would like to think of as an alternative to the log likelihood function; for example, $h (y (n); η)$ could be a log pairwise composite likelihood function, and $g (y (n); η) = \nabla h (y (n); η) .$ Then $h (y (n); η)$ would be approximately equal to what the log posterior density would be if the prior were non-informative, and if we took $h (y (n); η)$ to be a stand-in for the log likelihood function. The stand-in posterior variance-covariance of $η$ would be approximately the inverse of the negative of the curvature matrix of $h (y (n); η)$ at $\hat{η} .$ By estimating function theory (e.g., Heyde, 1997), using the same kind of Taylor series approximation as in (2.12), the frequentist variance-covariance of $\hat{η}$ satisfies

${Var}_{η} ({\hat{η}}^{T}) ≃ {E_{η} [\nabla g (y (n); η)]}^{- 1} {Var}_{η} [g (y (n); η)] {E_{η} {[\nabla g (y (n); η)]}^{T}}^{- 1} . (2.15)$

If $h (y; η)$ were the log pairwise composite likelihood function, we would have, in the notation of RCD,

${Var}_{η} ({\hat{η}}^{T}) ≃ \frac{1}{n} {[H (η_{0}) J {(η_{0})}^{- 1} H (η_{0})]}^{- 1}, (2.16)$

where $η_{0}$ is the true value of $η,$ $n H (η_{0})$ is minus the expectation of $\nabla h,$ and $n J (η_{0})$ is equal to the variance-covariance matrix of $g,$ the gradient of $h .$

If $g$ had the property (analogous to (2.11)) that

${Var}_{η} [g (y (n); η)] = - E_{η} [\nabla g (y (n); η)], (2.17)$

so that $J (η_{0}) = - H (η_{0}),$ then the right-hand side of (2.15) or of (2.16) would be approximately the same as the stand-in posterior variance-covariance of $η .$

The property (2.17) is called information unbiasedness of an estimating function (Lindsay, 1982). Given a $g$ that does not satisfy (2.17), then to produce a $g^{*}$ approximately satisfying (2.17), we could set

$h^{*} (y (n); η) = h (y (n); \hat{η} + C (η - \hat{η})) = h (y (n); η^{*}) (2.18)$

for a constant matrix $C,$ so that the gradient of $h^{*}$ is $C^{T}$ times the gradient of $h,$ while the point estimate of $η$ that maximizes $h^{*},$ and its approximate variance-covariance, are unchanged.

We want ${Var}_{η} (g^{*}) = - E_{η} \nabla g^{*},$ and it can be shown that this is equivalent to

$H (η_{0}) J {(η_{0})}^{- 1} H (η_{0}) = C^{T} H (η_{0}) C, (2.19)$

which is a curvature adjustment like the one in RCD, who suggest taking the solution of (2.19) that sets $C = M^{- 1} M_{A},$ where $M_{A}^{T} M_{A} = H (η_{0}) J {(η_{0})}^{- 1} H (η_{0})$ and $M^{T} M = H (η_{0}) .$

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2022-06-21

Language selection

Search and menus

Search

Bayesian inference for a variance component model using pairwise composite likelihood with survey data
Section 2. Full likelihood, pairwise likelihood and Bayesian implementation

2.1 Model and formulae

2.2 Unadjusted pairwise composite likelihood

2.3 Curvature adjustment for the log pairwise likelihood

Bayesian inference for a variance component model using pairwise composite likelihood with survey data Section 2. Full likelihood, pairwise likelihood and Bayesian implementation

2.1 Model and formulae

2.2 Unadjusted pairwise composite likelihood

2.3 Curvature adjustment for the log pairwise likelihood

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Bayesian inference for a variance component model using pairwise composite likelihood with survey data
Section 2. Full likelihood, pairwise likelihood and Bayesian implementation