Firth’s penalized likelihood for proportional hazards regressions for complex surveys
Section 1. Introduction
The Cox proportional hazards regression model (Cox, 1972) is widely used to analyze
survival data. It is a semiparametric model that explains the effect of
explanatory variables on hazard rates. The model assumes a linear form for the
effect of the explanatory variables but allows an unspecified form for the
underlying survivor function. The parameters of the model are estimated by
maximizing a partial likelihood (Cox,
1972, 1975).
For estimating canonical parameters in the exponential
family distributions, Firth (1993)
suggested multiplying the likelihood by the Jeffreys prior to obtain a maximum
likelihood estimate that is first-order unbiased. The penalized likelihood is
of the form
where
is the unpenalized likelihood,
is the information matrix, and
is a vector of regression parameters. Firth’s
penalized likelihood is a very useful technique in practice, not only to reduce
bias but also to correct for monotone likelihoods.
Proportional hazards regression models often suffer from
monotone likelihoods, in which the likelihood converges to a finite value but
at least one parameter diverges (Heinze,
1999). Firth’s penalized likelihood is also used to correct monotone
likelihoods and to obtain parameter estimates that converge (Heinze, 1999; Heinze and Schemper, 2001;
Heinzel, Rüdiger and Schilling,
2002).
Although Firth’s penalized likelihood is useful for
reducing biases and for obtaining estimates from monotone likelihoods, the
penalized likelihood is not studied for complex surveys involving unequal
weights. It is reasonable to use a weighted likelihood for complex surveys to
compensate for unequal weighting (Fuller,
1975; Binder and Patak, 1994). Survey data sets commonly include design
weights or analysis weights for which the sum of the weights is an estimator of
the population size. However, these unscaled weights will not appropriately
scale the information matrix that is used in the penalty term. It is desirable
for proportional hazards regression parameters for survey data to have the
following two properties:
- Invariance: The point estimates and
standard errors for the regression parameters should be invariant to the scale
of the weights.
- Closeness: The Taylor linearized variance
for the estimated regression parameters should be close to the delete-one
jackknife variance.
In this article, we first show that if the Firth
correction is not used, then both the invariance and closeness are satisfied;
but if the Firth correction is used with the unscaled weights, then the point
estimates and the standard errors are not invariant to the scale of the
weights. That is, if the weights are multiplied by a constant and the Firth
correction is used, then the point estimates and standard errors will be
different. We then propose a commonsense weight scaling method and demonstrate
that the Firth correction using the scaled weights has both properties. The
only difference between the scaled and unscaled weights is that the sum of the
scaled weights is equal to the sample size, but the sum of the unscaled weights
is an estimator of the population size.
1.1 Example that uses unscaled weights
We used a data
set from a study of 65 myeloma patients who were treated with alkylating agents
(Lee, Wei and Amato, 1992) to
demonstrate the properties of Firth’s penalized likelihood that uses unscaled
weights. Survival times in months were recorded for each patient. Patients who
were alive after the study period were considered to be censored. The following
variables were available for each patient:
- Time: Survival time in months,
- Vstatus: Patient status, zero or one, indicating
whether the patient was alive or dead, respectively,
- LogBUN: Log of blood urea nitrogen level,
- HGB: Blood hemoglobin level.
To create a monotone likelihood, we added a new
explanatory variable, Contrived, such that its value at all event times is the
largest of all values in the risk set (see the example “Firth’s Correction for
Monotone Likelihood” in “The PHREG Procedure” in SAS Institute Inc. (2018)). The variable Contrived has the value
1 if the observed survival time is less than or equal to 65; otherwise it has
the value 0.
To demonstrate the effect of weights in Firth’s
penalized likelihood, we created three weight variables, w1, w3, and w5, with
the values of 1, 1,000, and 100,000 for each observation, respectively.
Proportional hazards regression parameters are estimated by maximizing a
weighted likelihood as described in Section 1.2. Because
has the value 1 for all observations, using
in the analysis is equivalent to performing
the unweighted analysis.
We fitted the following two proportional hazards models
using the PHREG procedure in SAS/STAT® (see “The PHREG Procedure” in
SAS Institute Inc. (2018)):
where
and
are the hazard function and the baseline
hazard function, respectively. Firth’s penalized likelihood is not required in
order to fit the first model without Contrived (the likelihood converged in
three iteration steps), but the second model containing the variable Contrived
does not converge without the Firth penalty in the likelihood. Table 1.1
displays the value of the likelihood and the three regression coefficients for
14 iterations. Although the objective function and the coefficients for LogBun
and HGB converge to a finite value after the fourth iteration, the coefficients
for Contrived diverges. This is an example of a monotone likelihood for the
variable Contrived. Because of this monotonicity, Firth’s penalized likelihood
must be used to fit the second model containing Contrived.
Table 1.1
Maximum likelihood iteration history showing a monotone likelihood for the variable Contrived
Table summary
This table displays the results of Maximum likelihood iteration history showing a monotone likelihood for the variable Contrived. The information is grouped by Iteration Number (appearing as row headers), Likelihood Value, LogBun, HGB and Contrived (appearing as column headers).
| Iteration Number |
Likelihood Value |
LogBun |
HGB |
Contrived |
| 1 |
-140.693405 |
1.994882 |
-0.084319 |
1.466331 |
| 2 |
-137.784163 |
1.679468 |
-0.109068 |
2.778361 |
| 3 |
-136.971190 |
1.714061 |
-0.111564 |
3.938095 |
| 4 |
-136.707893 |
1.718174 |
-0.112273 |
5.003054 |
| 5 |
-136.616426 |
1.718755 |
-0.112370 |
6.027436 |
| 6 |
-136.583520 |
1.718829 |
-0.112382 |
7.036445 |
| 7 |
-136.571515 |
1.718839 |
-0.112384 |
8.039764 |
| 8 |
-136.567113 |
1.718841 |
-0.112384 |
9.040985 |
| 9 |
-136.565495 |
1.718841 |
-0.112384 |
10.041434 |
| 10 |
-136.564900 |
1.718841 |
-0.112384 |
11.041600 |
| 11 |
-136.564681 |
1.718841 |
-0.112384 |
12.041660 |
| 12 |
-136.564601 |
1.718841 |
-0.112384 |
13.041683 |
| 13 |
-136.564571 |
1.718841 |
-0.112384 |
14.041691 |
| 14 |
-136.564560 |
1.718841 |
-0.112384 |
15.041694 |
If Contrived is not used as an explanatory variable,
then all three sets of weights produce the same point estimates and Taylor
linearized variance estimates (Table 1.2). The delete-one jackknife
variance estimates are also the same for all three sets of weights. Thus, the
point estimates and the standard errors are invariant to the scale of the
weights when the Firth correction is not used.
Table 1.2
Parameter estimates and standard errors without the Firth correction for all three sets of weights
Table summary
This table displays the results of Parameter estimates and standard errors without the Firth correction for all three sets of weights. The information is grouped by (appearing as row headers), Estimate and Std. Err. (appearing as column headers).
|
Estimate |
Std. Err. |
| LogBun |
1.674 |
0.583 |
| HGB |
-0.119 |
0.060 |
However, if the unscaled weights are used, then the
point estimates for Contrived are not invariant to the scale of the weights.
Table 1.3 displays the parameter estimates for three sets of weights when
Contrived is used as an explanatory variable (and Firth’s penalized likelihood
is applied). Because the likelihood is not monotone (Table 1.1) for LogBun and HGB, the point estimates for these two coefficients are not affected by the
scale of the weights.
Table 1.3
Parameter estimates with the Firth correction and unscaled weights
Table summary
This table displays the results of Parameter estimates with the Firth correction and unscaled weights. The information is grouped by (appearing as row headers), Weight (équation) and Weight (équation) (appearing as column headers).
|
Weight |
Weight |
Weight
|
| Estimate |
Std. Err. |
Estimate |
Std. Err. |
Estimate |
Std. Err. |
| LogBun |
1.722 |
0.584 |
1.719 |
1.85E-2 |
1.719 |
1.85E-3 |
| HGB |
-0.112 |
0.061 |
-0.112 |
1.93E-3 |
-0.112 |
1.93E-4 |
| Contrived |
3.815 |
1.558 |
10.629 |
1.38 |
14.633 |
1.02 |
If Contrived is not used as an explanatory variable,
then the ratio of jackknife standard errors to Taylor linearized standard
errors is 1.13 and 1.10 for all three sets of weights for the variables LogBun and HGB, respectively. Thus the ratio of the jackknife variance to the Taylor
linearized variance for the unpenalized likelihood is invariant to the scale of
weights, and it is reasonable to expect the ratio to be invariant when the
penalized likelihood is used.
1.2 A brief review of point and variance estimates
for regression parameters for finite populations
Before we discuss the weight scaling method, we briefly
review point and variance estimates for regression parameters for proportional
hazards regression of complex surveys involving unequal weights. Lin and Wei (1989); Binder (1990, 1992); Lin
(2000); and Boudreau and Lawless (2006) discussed pseudo-maximum
likelihood estimation of proportional hazard regression parameters for survey
data. For a more general description for estimating regression parameters for complex
surveys, see Kish and Frankel (1974);
Godambe and Thompson (1986); Pfeffermann (1993), Korn and Graubard (1999, Chapter 3),
Chambers and Skinner (2003, Chapter 2), and Fuller (2009, Section 6.5).
Wolter (2007) described different variance estimation techniques for
survey data.
Let
be the set of indices and let
be the set of values for a finite population
of size
The survival time of each member of the finite
population is assumed to follow its own hazard function,
expressed as
where
is an arbitrary and unspecified baseline
hazard function,
is a vector of size
of explanatory variables for the
unit at time
and
is a vector of unknown regression parameters.
The partial likelihood function introduced by Cox (1972, 1975) eliminates the
unknown baseline hazard
and accounts for censored survival times. If
the entire population is observed, then this partial likelihood function can be
used to estimate
Let
be the desired estimator.
Assuming a working model with uncorrelated responses,
is obtained by maximizing the partial log
likelihood,
with respect to
where
is Cox’s partial likelihood function.
Assume that a probability sample
is selected from the finite population
Let
be the selection probability and
be the sampling weight for unit
Further assume that explanatory variables
and survival time
are available for every unit in sample
A design unbiased estimator for the finite
population log likelihood is
A sample-based estimator
for the finite population quantity
can be obtained by maximizing the partial
pseudo-log likelihood
with respect to
The design-based variance for
is obtained by assuming that the set of finite
population values
is fixed.
The weighted Breslow likelihood can be expressed as
where
is the risk set just before the
ordered event time
is the set of individuals who fail at the
and
is the number of distinct event times.
The point estimates for
are obtained by maximizing
Although the weights are sufficient for estimating regression
coefficients for the finite population, stratification and clustering
information must also be used to estimate sampling variability. In order to
estimate sampling variability, you can use either the Taylor series
linearization method or a replication method.
1.2.1 Analytic variance estimator using the Taylor
series linearization method
The Taylor series linearization method uses a sum of
squares of the weighted score residuals to estimate the sampling variability.
Define
where
and
The score residual for the
subject is
where
is the event indicator.
Then the Taylor linearized variance estimator is
where
is the observed information matrix and the
matrix
is defined as
where
are the joint inclusion probabilities for
units
and
In particular, for stratified cluster designs in which
the PSUs are selected by using a simple random sample without replacement, the
matrix
reduces to
where
is the weighted sum of the score residuals,
in stratum
and PSU
is the mean of
is the number of PSUs; and
is the sampling fraction in stratum
These estimators are well studied in the sample survey
literature. For example, Binder (1992) and Lin (2000) provide conditions under
which
and
are consistent. Chambless and Boyle (1985)
derived the design-based variance and asymptotic normality for discrete
proportional hazards models.
1.2.2 Replication variance estimator using the delete-one
jackknife method
The jackknife method is a commonly used replication
variance estimation method for complex surveys. To create replicates, it
deletes (assigns a zero weight to) one PSU at a time from the full sample. In
each replicate, the sampling weights of the remaining PSUs are modified by the
jackknife coefficient
The modified weights are called replicate
weights.
Let PSU
in stratum
be omitted from the
replicate; then the replicate weights and
jackknife coefficients are given by
and
respectively, for all observation units
in stratum
and PSU
The number of PSUs in stratum
is
The jackknife method can be applied to estimate
variances for the estimated regression parameters for Cox’s model because the
model parameters are solutions of a set of estimating equations that are smooth
functions of totals (the corresponding score functions are given in Section 2).
Properties of jackknife variance estimators for proportional hazard regression
models are discussed in Shao and Tu (1995, Section 8.3).
To apply the jackknife method, model parameters are
estimated by using the full sample and by using every replicate sample. Let
be the estimated proportional hazards
regression coefficients from the full sample, and let
be the estimated regression coefficients from
the
replicate. Then the covariance matrix of
is estimated by
If the sampling fractions are not ignorable, then the
covariance matrix of
is estimated by
where
is the sampling fraction in stratum
In practice, both Taylor linearized variance and
jackknife variance estimates are used to construct Wald
confidence intervals with
degrees of freedom, where
is the number of PSUs (or the number of
replicates) and
is the number of strata.
It is straightforward to show that the jackknife
variance estimator is algebraically equivalent to the Taylor linearized
estimator for design linear estimators. But for design nonlinear estimators,
such as the regression coefficients for proportional hazards regression models,
the jackknife method tends to produce slightly higher variance estimates than the
Taylor linearized method (Fuller, 2009).
Note that if the full sample estimate suffers from a
monotone likelihood, then it is very likely that most replicate samples will
also suffer from monotone likelihoods. This will results in many “unusable”
replicate estimates.
Survey data analysis procedures in SAS/STAT support both
Taylor linearized and replication variance estimation methods (Mukhopadhyay,
An, Tobias and Watts, 2008).