3 Pseudo-likelihood-based selection with BIC

Chen Xu, Jiahua Chen and Harold Mantel

3.1 BIC in surveys

With the model settings described in Section 2, it is clear that, if the measurement $(y_{i}, x_{i})$ is observed for every unit in population $D,$ the randomness in the data introduced by the probability sampling design is completely gone. In this situation, the selection of the influential variables is based on the entire population and the classical selection criteria developed in non-survey settings (purely model-based) remain valid for model-design-based inference. In particular, let $s \subseteq {1, \dots, p}$ be an arbitrary set of $τ (s)$ covariates, which corresponds to a candidate model in form of (2.1). The "census-based� BIC (Schwarz 1978) selects the model (covariates) that minimizes

${BIC}_{N} (s) = - 2 l_{N} ({\overset{⌣}{β}}_{s}) + τ (s) \log N, (3.1)$

where $l_{N} (β) = \sum_{i = 1}^{N} \log f (y_{i}; x_{i} β)$ is the census log-likelihood function and ${\overset{⌣}{β}}_{s}$ is the maximizer of $l_{N} (β)$ based on $s .$ It can be seen that the BIC (3.1) is a decreasing function of the maximized log-likelihood and an increasing function of the number of variables included in the model. Hence, a lower BIC implies either a simpler model (fewer explanatory variables), a better fit (higher maximized likelihood), or both. A model with balanced complexity and goodness of fit is preferred.

We note that the census BIC (3.1) is conceptual, because observing $(y_{i}, x_{i})$ for all units in $D$ is usually not feasible in applications. Instead, a representative sample $d = {i_{1}, \dots, i_{n}} \subset {1, \dots, N}$ with $n$ units is often drawn from $D$ and the measurements are observed based on the sampled units. Due to the intrinsic dependence structure among the sampled units, a full likelihood on $d$ is prohibitive to compute in general. Alternatively, for the model-design-based inference, a pseudo-log-likelihood function is frequently used, which takes the form

$l_{n} (β) = \sum_{i \in d} w_{i} \log f (y_{i}; β) (3.2)$

with $w_{i} = k / P (i \in d)$ denoting the survey weight for the $i^{th}$ unit. The scaling parameter $k$ in $w_{i}$ does not have analytical impacts on the pseudo-likelihood-based inference. For the simplicity of presentation, we choose $k = n / N$ such that $n^{- 1} l_{n} (β)$ is design-unbiased to $N^{- 1} l_{N} (β) .$ Maximizing $l_{n} (β)$ over $β$ leads to a maximum pseudo-likelihood estimator (MPLE) $\hat{β}$ for $β,$ i.e.,

$\hat{β} = arg \max_{β} l_{n} (β) .$

Under the appropriate sampling designs, $\hat{β}$ is often $n^{- 1 / 2}$ consistent for $β$ under the joint randomization framework. The idea of using pseudo-likelihood for inference on model parameters has been widely adopted in the literature (see, e.g., Binder 1983; Godambe and Thompson 1986; Molina and Skinner 1992).

In this paper, we aim to develop an analogue of BIC criterion based on the pseudo-likelihood. Following the super-population formulation described in Section 2, let $β_{s}$ be the $τ (s) $ dimensional coefficient of model $s$ and let $ν_{s}$ be the prior density of $β_{s} .$ Then a pseudo-marginal density function of the data is given by

$P_{n} (y | s) = \int L_{n} (y; β_{s}) ν_{s} (β_{s}) d β_{s}$

with $L_{n} (y; β_{s}) = \exp {l_{n} (y; β_{s})} .$ Consequently, we may regard the following expression as the pseudo-posterior probability of the model $s :$

$P_{n} (s | y) = \frac{P_{n} (y | s) P (s)}{\sum_{s \in S} P (s) P_{n} (y | s)}, (3.3)$

where $S$ denotes the collection of all candidate models. In the spirit of Bayesian analysis, the model with the highest $P_{n} (s | y)$ is then considered to be the one that receives the most support from the data. Since $\sum_{s \in S} P (s) P_{n} (y | s)$ does not depend on any specific model, the highest $P_{n} (s | y)$ is achieved by the model that maximizes the corresponding $P_{n} (y | s) P (s)$ . When the uniform prior $P (s) = ζ$ is used and the weight scaling is chosen as $k = n / N,$ we obtain a Laplace approximation under some regularity conditions (see Xu and Chen 2012):

$- 2 \log {P_{n} (y | s)} = - 2 l_{n} ({\hat{β}}_{s}) + τ (s) \log n + O_{p} (1) .$

Accordingly, we choose the model $s$ that minimizes

${BIC}_{n} (s) = - 2 l_{n} ({\hat{β}}_{s}) + τ (s) \log n . (3.4)$

Compared with the census BIC (3.1), the first term in BIC (3.4) is the maximum survey-weighted pseudo-likelihood, which is potentially helpful to avoid sampling errors that might lead to biased inferences for the target population. We refer to (3.4) as a pseudo-likelihood-based version of BIC in the context of surveys. In the joint randomization framework, we establish the selection consistency of using BIC (3.4) through a PPL-based implementation procedure, as will be seen in Section 4.

3.2 Implementing BIC via penalized pseudo-likelihood

In applications, a straightforward way to implement BIC is best-subset selection, where BIC is evaluated and compared for each candidate model. However, this procedure can be computationally impractical when the number of covariates is large. Alternatively, penalized likelihood methods have recently been used as computationally efficient procedures for implementing a selection criterion. These methods exclude variables from the model by estimating their coefficients to be zero, and shrink the other coefficients accordingly. By varying the penalty on the likelihood, we can obtain a series of models with differing sparsity. To avoid an exhaustive search of the entire model space, the selection criterion is used to pick an optimal one among these sparse models. The effectiveness of this implementation strategy has been illustrated in the non-survey context for BIC (Wang, Li and Tsai 2007; Liu, Wang and Liang 2011) and GCV (Fan and Li 2001; Xie, Pan and Shen 2008) among others.

Sharing the same spirit, we proposed a penalized pseudo-likelihood (PPL) procedure for the implementation of BIC (3.4) for survey data. Specifically, following pseudo-likelihood (3.2) with $k = n / N,$ we define the survey-weighted penalized estimator ${\hat{β}}_{λ}$ that maximizes the penalized pseudo-likelihood function

$Q_{n} (β) = l_{n} (β) - n \sum_{j = 1}^{p} ϕ_{λ} (| β_{j} |), (3.5)$

where $ϕ_{λ} (\cdot)$ is a penalty function indexed by a tuning parameter $λ$ controlling the size of the penalty. With an appropriate choice of $ϕ_{λ} (\cdot), {\hat{β}}_{λ}$ contains zero estimates for some coefficients and thus automatically produces a sparse model. The desirable sparsity of ${\hat{β}}_{λ}$ typically requires the singularity of the corresponding $ϕ_{λ} (\cdot)$ at the origin. Some popular choices of $ϕ_{λ} (\cdot)$ include the $L_{γ}$ penalty (Frank and Friedman 1993; Tibshirani 1996), i.e., $ϕ_{λ} (| β |) = λ {| β |}^{γ}$ with $γ \in (0,1],$ and the SCAD penalty (Fan and Li 2001), which is defined by the following derivative:

${ϕ^{'}}_{λ} (| β |) = λ {I (| β | \leq λ) + \frac{{(a λ - | β |)}_{+}}{(a - 1) λ} I (| β | > λ)} (3.6)$

with $a =$ 3.7 being a common choice.

With different values of $λ$ for a properly specified $ϕ_{λ} (\cdot), {\hat{β}}_{λ}$ leads to models of differing sparsity. These sparse models (with respect to $λ$ ) naturally form a collection of candidate models. BIC (3.4) can then be used to select an optimal model within this collection. To be more specific, let $Ω$ be the range of $λ$ and let $s_{λ}$ denote the model produced by ${\hat{β}}_{λ} .$ We treat $S_{Ω} = {s_{λ} : λ \in Ω}$ as the collection of candidate models under consideration, and select the model $s^{*} \in S_{Ω}$ such that ${BIC}_{n} (s^{*}) = \min_{λ \in Ω} BIC (s_{λ}) .$ We refer to this selection procedure as the penalized pseudo-likelihood-based BIC method (PPL-BIC). Compared with traditional best-subset selection, the PPL-BIC procedure focuses on the models that are produced by the survey-weighted penalized estimators, and therefore it can be much less computationally expensive.

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

3 Pseudo-likelihood-based selection with BIC

3.1 BIC in surveys

3.2 Implementing BIC via penalized pseudo-likelihood