### Publications

### The Research Data Centres Information and Technical Bulletin

# Assessing the impact of potentially influential observations in weighted logistic regression

**By Bridget L. Ryan, John Koval, Bradley Corbett, Amardeep Thind, M. Karen Campbell, and Moira Stewart.**

- Introduction
- Data and methods
- Results
- Discussion
- References
- Appendix 1 – Algorithm for assessing potentially influential observations in weighted logistic regression in SAS 9.1
- Appendix 2 – Assessing potentially influential observations in weighted logistic regression in SAS 9.1
- Appendix 3 – Assessing potentially influential observations in weighted logistic regression in SAS – SAS output

## Introduction

Influential observations in logistic regression can be characterized as those observations that have a notable effect on certain aspects of the fit of the linear logistic model, such as the parameter estimates or fit statistics. Collett (2003) and Hosmer and Lemeshow (2000) provide extensive explanations about the identification of influential observations in the case of classical logistic regression. The use of datasets with large sample sizes (e.g. Statistics Canada survey data) is thought to mitigate concerns about potentially influential observations by minimizing the contribution of any given observation. However, influential observations can still arise in these large samples. For example, observations may exert influence if the observations have large weights resulting in a large impact on parameter estimates (Macnab et al., 2005). Therefore, it is important to identify potentially influential observations when conducting logistic regression using Statistics Canada data. Few papers contain information about influence diagnostics particularly for complex survey data (with Roberts, Rao and Kumar (Roberts et al., 1987) being one of these); unfortunately, the diagnostics developed in these papers are not available in any of the common statistical packages for analysis of complex survey data. However, Heeringa et al. (2010, p. 245), for example, recommend the following:

“Use one or more of the techniques described in Chapter 5 of Hosmer and Lemeshow (2000) to evaluate the fit of the model for individual patterns of covariates. If the complex sample logistic regression modeling program in your chosen software system (e.g., SAS PROC SURVEYLOGISTIC) does not include the full set of diagnostic capabilities of the standard programs, use standard programs (e.g., SAS PROC LOGISTIC) with a weight specification. As mentioned before, the weighted estimates of parameters and predicted probabilities will be identical, and serious breakdowns in the model for specific covariate patterns should be identifiable even though the standard program does not correctly reflect the variances and covariances of the parameter estimates given the complex sample design”.* *

This paper seeks to implement this recommendation for diagnostics for coefficient sensitivity by describing a straight-forward algorithm and code for examining potentially influential observations with weighted data using SAS software (SAS Institute Inc., 2009).

## Data and methods

### Data source and sample

The algorithm and code described in this paper was applied in a study that examined the factors associated with family physician utilization for adolescents in Canada (Ryan et al., 2011). The study employed a cross-sectional design, conducting a secondary analysis of data for adolescent and young adult respondents to the 2005 (Cycle 3.1) Canadian Community Health Survey (CCHS) (Statistics Canada, questionnaire, 2005; Statistics Canada, User’s Guide, 2005). The sample sizes for the study were 4985 respondents for early adolescents (12 to 14 years old); 8718 for middle adolescents (15 to 19 years old); and 6681 for young adults (20 to 24 years old).

Permission was received from the Statistics Canada Research Data Centre (RDC) to access these data at The University of Western Ontario. Approval from The University of Western Ontario Health Sciences Research Ethics Board was not required because this was a secondary analysis of survey data with no possibility of identification of individual survey respondents.

### Study analysis

The full study analysis, described elsewhere (Ryan et al., 2011) and summarized here, was conducted separately for each of the three age groups: early adolescence, middle adolescence; and young adulthood. Two logistic regressions were conducted for each age group resulting in a total of six regressions. Analysis used design-based software employing sampling weights to adjust the sample for the unequal probability of selection and bootstrapping to adjust the confidence intervals for the complex survey design effect. The binary outcome for the first regression was whether or not the adolescent had used family physician services within the last 12 months. Within those respondents who had used services, the outcome of the second logistic regression was whether the adolescent was a high user (4 or more visits) or a low user (1 to 3 visits). The independent variables were chosen according to Andersen’s Behavioral Model of Health Services Use (Andersen, 1995). Wherever possible, the same variables were used for each of the three age groups to facilitate comparison across groups, and non-significant variables were left in the models to facilitate reporting across the age groups. Predisposing variables available and used were: age, sex, school attendance and educational attainment, ethnicity, community belonging, marital status (young adults), and work status (middle adolescents and young adults). Enabling variables used were: household income adequacy, living arrangement (young adults), having a regular medical doctor, and geography (urban or rural). Perceived need variables were: self-perceived general health, self-perceived mental health, opinion of own weight, and stress (available for middle adolescents and young adults only). Evaluated need variables were: BMI category, and the number of chronic conditions. Health practice variables used were: physical activity, smoking, sexual activity (available for middle adolescents and young adults only), and alcohol drinking. The CCHS does not provide health care system or external environment variables; however, province was used as a measure of context.

### Identifying potentially influential observations

In the full study, each of the six logistic regression models were evaluated to determine whether any observations in each dataset had an undue influence on the parameter estimates from the logistic regression. The identification of potentially influential observations was conducted in SAS Version 9.1 (SAS Institute Inc., 2009). SAS PROC LOGISTIC will fit a logistic model using weights and can produce several of the diagnostic influence statistics described in Hosmer and Lemeshow (2000). While these statistics do not appropriately take all of the survey design into account (such as in how variance estimates are made), and it is too unwieldy to plot the values of these statistics for every data point (due to large sample size) they can still be useful in allowing the researcher to identify cases that have potentially undue influence on the parameter estimates using the weights in Statistics Canada survey data. It should be noted that currently SAS Version 9.3 provides PROC SURVEY LOGISTIC; however, the required diagnostic statistics are not available.

The examination of potentially influential observations focused on two main statistics, the confidence interval displacement diagnostic (C diagnostic) and the DFbeta diagnostic as suggested by Pregibon (1981). These statistics were calculated and output into separate datasets using commands that followed the logistic regression command. Appendix 1 contains the algorithm that was followed for the identification and examination of the potentially influential observations. Appendix 2 contains the SAS code that was used to identify the potentially influential observations. It should be noted that the formulas for calculating the C statistics and the DFbetas each contain variance elements which ideally should be estimated using the bootstrap method. As mentioned above, SAS will estimate the model using bootstrapping; however, it cannot calculate these statistics with bootstrapping.

The ‘confidence interval displacement diagnostic’ provides scalar measures of the influence of individual observations on the logistic regression parameter estimates. A scalar measure is one that provides a measure of the magnitude of the influence on estimates but not the direction of that influence. One C statistic is calculated for each observation for the overall logistic regression. The C diagnostic is based on the same principle as the Cook distance in linear regression theory (SAS Institute Inc., 2009). Observations that have a C statistic greater than one are generally considered as possible influential observations (Hosmer, 2000, p.180). However, given that the variance has been estimated without appropriate bootstrapping to account for the design effect, the use of the suggested cut-off values must be employed with caution. It is important to examine any unusually large values as an indication of potential influence (Hosmer, 2000). Therefore, the code also includes PROC UNIVARIATE which will print out the five lowest and highest values regardless of absolute size.

The formula for the C statistic used by SAS (SAS Institute, 2008) is listed below. It is based on that developed by Pregibon (1981) but was modified specifically for logistic regression:

${C}_{j}={\chi}_{j}^{2}{h}_{jj}/{\left(1-{h}_{jj}\right)}^{2}$ (1)

where

${\chi}_{j}^{2}={\frac{{w}_{j}\left({r}_{j}-{p}_{j}\right)}{{p}_{j}{q}_{j}}}^{2},$ (2)

and

${h}_{jj}={w}_{j}{p}_{j}{q}_{j}\left(1{x}_{j}^{\u02ca}\right)V\left(\hat{b}\right)\left(\begin{array}{c}1\\ {x}_{j}\end{array}\right).$ (3)

Moreover,

*r _{j}*is the response (0 or 1),

*w*_{j} is the weight of the *j*th observation,

π_{j} is the probability of a response for the *j*^{th} observation which is given by

*π _{j} = F(β_{0} + βˊ x_{j} *), where F(∙ ) is the inverse link function,

**b** is the maximum likelihood estimate (MLE) of (β_{0} β_{1}…βs )ˊ,

*s* is the number of variables,

${\hat{V}}_{b}$
is the estimated covariance matrix of **b,**

*p _{j}* is the estimate of

*π*evaluated at

_{j}**b**,

and *q _{j} = *1

*- p*.

_{j}A limitation of the C statistic is that it is a summary measure of change over all the coefficients in the model. Therefore, it is important to examine the changes in the individual coefficients (Hosmer, 2000, p. 181). The DFbeta is the standardized difference in the parameter estimate due to deleting each given observation. DFbetas are useful in detecting observations that are causing changes in coefficients (SAS Institute Inc., 2009). The underlying distribution of the DFbetas is unknown so there is no certain determination of what constitutes ‘large’. The convention, therefore, is to use the value of 2 which coincides approximately with the usual critical value of the normal distribution (1.96). For any given variable, then, observations that have a DFbeta greater than two are considered as possible influential observations. As with the C statistic, the standard error has been estimated without appropriate bootstrapping, so the use of the suggested cut-off values must again be employed with caution. It is important to examine any unusually large values as an indication of potential influence. Therefore, the SAS code also includes PROC UNIVARIATE which will print out the five lowest and highest values regardless of absolute size.

Possible influential observations were identified using the following formula given by SAS (SAS Institute Inc., 2008) (developed by Pregibon, 1981).

$DFbet{a}_{ij}=\frac{{\Delta}_{i}{b}_{j}^{1}}{{\sigma}_{i}},\text{i=0,1,\u2026,s,}$ (4)

Where

σ_{i} is the standard error of the *i*^{th} component of **b**,

**Δ**_{i} **b**_{j}^{1} is the *i*^{th} component of the one-step difference, and

$\Delta {b}_{j}^{1}=\left(\frac{{w}_{j}\left({r}_{j}-{p}_{j}\right)}{1-{h}_{jj}}\right){\hat{V}}_{b}\left(\begin{array}{c}1\\ {x}_{j}\end{array}\right).$ (5)

In other words, **Δb**_{j}1 is an approximation to the change, *b - b _{j}*1, in the vector of parameter estimates due to the omission of the

*j*

^{th}observation.

## Assessing potentially influential observations

After identifying potentially influential observations, the next step was to run logistic regressions excluding all observations identified as potentially influential by either statistic. Parameter estimates were compared between the regression with all cases and the regression without the potentially influential observations. Researchers must decide how large a change in parameter estimates is considered important for any given study (Rothman, 1998). Changes in parameter estimates of more than 10% were considered to be substantial changes for this study. In the case of substantial parameter changes, the observations should be examined carefully to determine if there might be any common covariance patterns associated with the influential observations.

Researchers must decide whether these observations are part of the study population or not. If they cannot be deemed outliers, and are in fact part of the study population, they should stay in the model.

## Results

While all six regressions from the full study were examined to identify potentially influential observations, only one of these is reported here for illustration; the regression for the young adult age group with the outcome of whether or not the respondent had used family physician services. Appendix 3 provides an annotated example of output for the C statistics. Perturbation was used to alter the observation numbers and C statistics to protect confidentiality.

For the C statistic, eleven cases were identified with large C statistics. This suggests possible influential observations and warrants further investigations. The DFbetas were then reviewed and no cases had large DFbetas for any variable. The lack of cases with large DFbetas suggests that there were no potentially influential observations causing undue instability in the parameter estimates. The eleven potentially influential observations based on the C statistics were removed and the logistic regression was run again. Three non-significant parameters changed by greater than 10%; however, none of these changed from non-significant to significant in the regression. Therefore, the decision was made to include all cases in the reported regression model.

## Discussion

This paper provides an algorithm and SAS code that can be easily applied to analyses using complex survey data such as the Canadian Community Health Survey in order to identify possible influential observations in logistic regression models. Caution is advised in using automatic cut-off values for identifying potentially influential observations. Kleinbaum, Kupper, and Muller (1988, p. 201) state that “some observation must be the most extreme in every sample. It would be silly to delete automatically this most extreme observation, or some cluster of extreme observations, based on statistical testing procedures. The goal of regression diagnostics in evaluating outliers is to warn the data analyst to examine more closely such extreme observations. Scientific judgment is more important here than statistical tests, once influential observations have been flagged”. Rather the researcher must make a decision on handling potentially influential observations informed by knowledge of the study population and a careful examination of the data as described herein. Having made this decision, this should be reported in the results section of a manuscript and the discussion should explain the reasoning and the possible effects of the decision.

## References

Andersen RM. 1995. "Revisiting the behavioral model and access to medical care: does it matter?". J Health Soc Behav. Vol. 36. no. 1, 1-10.

Collett D. 2003. Modelling Binary Data. [2nd ed]. Boca Raton, FL: Chapan & Hall/CRC Press.

Gagné, C., Roberts, G., and Keown, L.A. 2014. Weighted estimation and bootstrap variance estimation for analyzing survey data: How to implement in selected software. The Research Data Centres Information and Technical Bulletin, (Winter) Vol. 6 no. 1, 5-70. Statistics Canada Catalogue no. 12-002-X

Heeringa S, West B, Berglund P. 2010. Applied Survey Data Analysis. Boca Raton, FL: Chapman and Hall/CRC Press.

Hosmer DW, Lemeshow S. 2000. Applied Logistic Regression. [2nd ed]. New York: Wiley.

Kleinbaum DG, Kupper LL, Muller KE. 1988. Applied Regression Analysis and Other Multivariable Methods. [2nd ed]. Belmont,CA: Duxbury Press.

Macnab JJ, Koval JJ, Speechley KN, Campbell MK. 2005. "Influential observations in weighted analyses: examples from the National Longitudinal Survey of Children and Youth (NLSCY)". Chronic Dis Can 2005; Vol. 26, no. 1, 1-8. See http://www.ncbi.nlm.nih.gov/pubmed/16117839. (accessed January 31, 2012).

Pregibon D. 1981. Logistic Regression Diagnostics. Annals of Statistics. Vol. 9. no. 4, 705-724.

Roberts G, Rao JNK, Kumar S. 1987. "Logistic-Regression Analysis of Sample Survey Data". Biometrika. Vol. 74, no. 1, 1-12. See http://biomet.oxfordjournals.org/content/74/1/1.full.pdf. (accessed January 31, 2012).

Rothman KJ, Greenland S. 1998. Modern Epidemiology. [2nd ed]. Philadelphi, PA: Lippincott Williams & Wilkins.

Ryan BL, Stewart M, Campbell MK, Koval J, Thind A. 2011. "Understanding adolescent and young adult use of family physician services: a cross-sectional analysis of the Canadian Community Health Survey". BMC Family Practice. Vol. 12, no. 118, 1-10.

SAS Institute Inc. 2009. SAS/Stat 9.1 Software. Cary, NC.

SAS Institute Inc. 2008. Regression Diagnostics, SAS 9.1 Online Documentation. (Path - SAS/STAT; SAS/STAT User's Guide; The Logistic Procedure; Details; Regression Diagnostics).

Statistics Canada. 2005. Canadian Community Health Survey (CCHS) Cycle 3.1. (questionnaire). http://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SurvId=1630&InstaId=22642&SDDS=3226 [Accessed from: URL: http://www23.statcan.gc.ca/imdb-bmdi/instrument/3226_Q1_V3-eng.pdf]

Statistics Canada. 2005. Canadian Community Health Survey (CCHS) Cycle 3.1. User's Guide. http://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SurvId=1630&InstaId=22642&SDDS=3226 [Accessed from: URL: http://www23.statcan.gc.ca/imdb-bmdi/document/3226_D7_T9_V3-eng.pdf]

## Appendix 1 – Algorithm for assessing potentially influential observations in weighted logistic regression in SAS 9.1

- Develop logistic regression model in a design-based software using population weights and bootstrapping.
- Determine the potentially influential observations in SAS 9.1. (Appendix 2).
- Run logistic regression with standardized weights (Gagné, Roberts, and Keown, 2014).
- Output into a temporary file Confidence Interval Displacement Statistics (C statistic) from the logistic regression.
- Run proc univariate for the confidence interval displacement statistic. Examine the extreme values. If all five are greater than one (suggesting there may by more than five), then proc print can be used to print all cases where the C statistic is greater than one.
- Output into another temporary file DFbeta statistics from the logistic regression.
- Run proc univariate for the DFbeta statistics. Examine the extreme values. If all five are greater than two (suggesting there may by more than five), then proc print can be used to print all cases where the DFbeta is greater than 2.
- Remove the influential observations and determine the effect on the model.
- Create a duplicate dataset and delete the potentially influential observations identified by either the C statistics or the DFbetas.
- Run the logistic regression again using standardized weights1 on this duplicate dataset.
- Create an Excel spreadsheet with a column for the parameter estimates (all cases used) and a second column for the parameter estimates (potentially influential observations deleted). In third and fourth columns, calculate the absolute difference and percentage difference for each parameter estimate between the two logistic regressions.
- Determine how large a change in parameter estimates constitutes influential for the particular study; for example, a change in a parameter estimate of 10%.
- Compare the differences to see if removal of potentially influential observations affected parameter estimates. Flag percentage differences greater than 10% as parameter estimates that were significantly changed by influential observations.
- Examine reasons for differences such as large weights, possibly miscoded data, or unique covariance patterns.

## Appendix 2 – Assessing potentially influential observations in weighted logistic regression in SAS 9.1

***Code written in bold refers to headings;**

**Code written in italics refers to file and variable names that will vary depending on dataset and variables being used;*

## Appendix 3 – Assessing potentially influential observations in weighted logistic regression in SAS – SAS output

The output from the logistic regression will appear first in its usual format followed by “The Univariate Procedure” output and/or the “Proc Print” output as shown below.

**A. Output for the Extreme Confidence Interval Displacement values **

**B. Output for the Proc Print Confidence Interval Displacement values greater than 1**