An alternative way of estimating a cumulative logistic model with complex survey data
Section 1. Introduction: Fitting a regression model with complex survey data
The
goal of this paper is to show an alternative way of estimating a cumulative
logistic model (also called the ordinal logistic model or the ordinal
regression model), that is, a regression model with a categorical dependent
variable having more than two ordered categories, given complex survey data.
The standard estimation methods cannot be implemented with most conventional
“design-based” software, such as SAS (SAS Institute Inc., 2015), except when
the “parallel line assumption” holds as we shall see.
The
standard “design-based” framework for fitting a regression model to survey data
was introduced by Fuller (1975) for linear regression and by Binder (1983) more
generally. This framework treats the finite population as a realization of
independent trials from a conceptual population. A maximum likelihood
regression estimator could, in principle, be estimated from the
finite-population values. The goal in the Fuller/Binder framework is to
estimate the conceptual maximum-likelihood estimator, or its limit as the
population grows arbitrarily large, from survey data. Skinner (1989) refers to
this as the “pseudo-maximum-likelihood” approach.
Kott
(2018) describes an alternative model-based approach to estimating regression
models with complex survey data dubbed “design sensitive” robust model-based
estimation. Following Kott (2007), the standard model is defined in this
approach in this manner:
where
Although apparently very general, there is a key restriction
imposed by the standard model in equation (1.1):
no matter the value of
This assumption can fail and the standard
model not be appropriate in the population being analyzed.
In
the extended model,
in equation (1.1) is replaced by
Unlike the standard model, the robust more
general extended model rarely fails.
With
an independent identically distributed (iid) population
of
elements, it is easy to see that
under the extended model. Given
a complex sample
with weights
each
(nearly) equal to the inverse of the corresponding element’s selection
probability,
under mild conditions on the
sampling design. The parenthetical “nearly” needs to be added when the weights
include adjustments for unit nonresponse or coverage errors in the frame which
the analyst assumes have been accounted for in an asymptotically unbiased
manner. Calibration weight adjustments for statistical efficiency are another
reason to add “nearly”.
Whether
the analyst assumes the standard or the extended model holds in the population,
solving for
in the weighted estimating equation (Godambe and Thompson, 1986)
provides a consistent estimator
for
under
mild conditions.
The
pseudo-maximum-likelihood estimating equation in Binder is
where
For
logistic, Poisson, and ordinary least squares (OLS) linear regression,
This
equality may not hold for general least squares (GLS) linear regression,
however even when the elements are uncorrelated. It also need not hold for a
cumulative logistic regression model.
The
cumulative logistic model is a multinomial logistic regression model for
categories with a natural ordering (e.g.,
always, frequently, sometimes, never). Being in the first category is assumed
to fit a logistic model. Being in either the first or second category is
assumed to fit a logistic model. Being in the first, second, or third category
is assumed to fit a logistic model, and so forth.
The general cumulative logistic model is (splitting out the intercept from
the rest of the covariates)
for
where
when
is in one of the first
categories, 0 otherwise. The parallel-lines assumption is that
for
all integer values of
less than
with
each such value having its own intercept
The
cumulative logistic model under the parallel-lines assumption is often called a proportional-odds model. We will call it the “simple cumulative logistic
model,” although it is more commonly referred to as the cumulative logistic model (or the ordinal logistic model).
Finding
the
and
that satisfy the estimating equation:
for
can be used for estimating the general cumulative logistic model. This
is not the pseudo-maximum-likelihood estimating equation in the surveylogistic routine in SAS/STAT 14.1 (An (2002, page 7) discusses the multivariate
pseudo-maximum-likelihood estimating equation fit by this procedure), the logistic routine in SUDAAN 11 (Research
Triangle Institute, 2012) or the gologit2 routine in STATA
(Williams, 2005) for the simple cumulative
logistic model. Only the STATA routine allows the
to vary.
Given
nominal categories and complex survey data, SAS and SUDAAN can fit the general
multinomial logistic model,
for
with
when
is
in the
category, 0 otherwise; this is not the same
thing as the general cumulative logistic
model, which these programs cannot estimate with complex survey data.
In what follows, we introduce a
modest example of a simple cumulative logistic model. Given complex survey
data, we fit the model both with the pseudo-maximum-likelihood technique and
with equation (1.4). The latter is accomplished by creating a data set with
observations for each respondent
(note that
are
in the same primary sampling unit). We follow Kott (2018) and call this fitting method the
“design-sensitive” technique, even though, strictly speaking, it is model
based. Moreover, the pseudo-maximum-likelihood approach is also sensitive to
the design weights and other aspects of the sampling design.
The article goes on to test the
parallel-lines assumption. A simple example is presented in Section 2.
Section 3 concludes with a discussion.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2019
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa