An alternative way of estimating a cumulative logistic model with complex survey data
Section 1. Introduction: Fitting a regression model with complex survey data

Table of contents

The goal of this paper is to show an alternative way of estimating a cumulative logistic model (also called the ordinal logistic model or the ordinal regression model), that is, a regression model with a categorical dependent variable having more than two ordered categories, given complex survey data. The standard estimation methods cannot be implemented with most conventional “design-based” software, such as SAS (SAS Institute Inc., 2015), except when the “parallel line assumption” holds as we shall see.

The standard “design-based” framework for fitting a regression model to survey data was introduced by Fuller (1975) for linear regression and by Binder (1983) more generally. This framework treats the finite population as a realization of independent trials from a conceptual population. A maximum likelihood regression estimator could, in principle, be estimated from the finite-population values. The goal in the Fuller/Binder framework is to estimate the conceptual maximum-likelihood estimator, or its limit as the population grows arbitrarily large, from survey data. Skinner (1989) refers to this as the “pseudo-maximum-likelihood” approach.

Kott (2018) describes an alternative model-based approach to estimating regression models with complex survey data dubbed “design sensitive” robust model-based estimation. Following Kott (2007), the standard model is defined in this approach in this manner:

$y_{k} = f (x_{k}^{T} β) + ε_{k},$ where $E (ε_{k} | x_{k}) = 0. (1.1)$

Although apparently very general, there is a key restriction imposed by the standard model in equation (1.1): $E (ε_{k}) = 0$ no matter the value of $x_{k} .$ This assumption can fail and the standard model not be appropriate in the population being analyzed.

In the extended model, $E (ε_{k} | x_{k}) = 0$ in equation (1.1) is replaced by $E (x_{k} ε_{k}) = 0 .$ Unlike the standard model, the robust more general extended model rarely fails.

With an independent identically distributed (iid) population $U$ of $N$ elements, it is easy to see that

$p \lim {N^{- 1} \sum_{U} [y_{k} - f (x_{k}^{T} β)] x_{k}} = 0$

under the extended model. Given a complex sample $S$ with weights ${w_{k}},$ each (nearly) equal to the inverse of the corresponding element’s selection probability,

$p \lim {N^{- 1} \sum_{S} w_{k} [y_{k} - f (x_{k}^{T} β)] x_{k}} = 0 (1.2)$

under mild conditions on the sampling design. The parenthetical “nearly” needs to be added when the weights include adjustments for unit nonresponse or coverage errors in the frame which the analyst assumes have been accounted for in an asymptotically unbiased manner. Calibration weight adjustments for statistical efficiency are another reason to add “nearly”.

Whether the analyst assumes the standard or the extended model holds in the population, solving for $b$ in the weighted estimating equation (Godambe and Thompson, 1986)

$\sum_{S} w_{k} [y_{k} - f (x_{k}^{T} b)] x_{k} = 0 (1.3)$

provides a consistent estimator for $β$ under mild conditions.

The pseudo-maximum-likelihood estimating equation in Binder is

$\sum_{S} w_{k} \frac{f^{'} (x_{k}^{T} b)}{v_{k}} [y_{k} - f (x_{k}^{T} b)] x_{k} = 0,$

where $v_{k} = E (ε_{k}^{2} | x_{k}) .$ For logistic, Poisson, and ordinary least squares (OLS) linear regression, $f^{'} (x_{k}^{T} β) / v_{k} = 1.$ This equality may not hold for general least squares (GLS) linear regression, however even when the elements are uncorrelated. It also need not hold for a cumulative logistic regression model.

The cumulative logistic model is a multinomial logistic regression model for $L$ categories with a natural ordering (e.g., always, frequently, sometimes, never). Being in the first category is assumed to fit a logistic model. Being in either the first or second category is assumed to fit a logistic model. Being in the first, second, or third category is assumed to fit a logistic model, and so forth.

The general cumulative logistic model is (splitting out the intercept from the rest of the covariates)

$E (y_{l k} | x_{k}) = \frac{\exp (α_{l} + x_{k}^{T} β_{l})}{1 + \exp (α_{l} + x_{k}^{T} β_{l})}$ for $l = 1, \dots, L - 1,$

where $y_{l k} = 1$ when $k$ is in one of the first $l$ categories, 0 otherwise. The parallel-lines assumption is that $β_{l} = β$ for all integer values of $l$ less than $L$ with each such value having its own intercept $(α_{l}) .$ The cumulative logistic model under the parallel-lines assumption is often called a proportional-odds model. We will call it the “simple cumulative logistic model,” although it is more commonly referred to as the cumulative logistic model (or the ordinal logistic model).

Finding the $a_{l}$ and $b_{l}$ that satisfy the estimating equation:

$\sum_{k \in S} w_{k} [y_{l k} - \frac{\exp (a_{l} + x_{k}^{T} b_{l})}{1 + \exp (a_{l} + x_{k}^{T} b_{l})}] [\begin{matrix} 1 \\ x_{k} \end{matrix}] = 0$ for $l = 1, \dots, L - 1 (1.4)$

can be used for estimating the general cumulative logistic model. This is not the pseudo-maximum-likelihood estimating equation in the surveylogistic routine in SAS/STAT 14.1 (An (2002, page 7) discusses the multivariate pseudo-maximum-likelihood estimating equation fit by this procedure), the logistic routine in SUDAAN 11 (Research Triangle Institute, 2012) or the gologit2 routine in STATA (Williams, 2005) for the simple cumulative logistic model. Only the STATA routine allows the $b_{l}$ to vary.

Given $L$ nominal categories and complex survey data, SAS and SUDAAN can fit the general multinomial logistic model,

$E (y_{l k} | x_{k}) = \frac{\exp (α_{l} + x_{k}^{T} β_{l})}{1 + \sum_{j = 1}^{L - 1} \exp (α_{j} + x_{k}^{T} β_{j})}$ for $l = 1, \dots, L - 1,$

with $y_{l k} = 1$ when $k$ is in the $l^{th}$ category, 0 otherwise; this is not the same thing as the general cumulative logistic model, which these programs cannot estimate with complex survey data.

In what follows, we introduce a modest example of a simple cumulative logistic model. Given complex survey data, we fit the model both with the pseudo-maximum-likelihood technique and with equation (1.4). The latter is accomplished by creating a data set with $L - 1$ observations for each respondent $k$ (note that $y_{1 k}, \dots, y_{L - 1 k}$ are in the same primary sampling unit). We follow Kott (2018) and call this fitting method the “design-sensitive” technique, even though, strictly speaking, it is model based. Moreover, the pseudo-maximum-likelihood approach is also sensitive to the design weights and other aspects of the sampling design.

The article goes on to test the parallel-lines assumption. A simple example is presented in Section 2. Section 3 concludes with a discussion.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2019-07-04

Language selection

Search and menus

Search

An alternative way of estimating a cumulative logistic model with complex survey data
Section 1. Introduction: Fitting a regression model with complex survey data

An alternative way of estimating a cumulative logistic model with complex survey data Section 1. Introduction: Fitting a regression model with complex survey data

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

An alternative way of estimating a cumulative logistic model with complex survey data
Section 1. Introduction: Fitting a regression model with complex survey data