Linear Models
Generalized linear models (Littell, chapter 10)

This site supports tutorial instruction on an linear models, based on Littell, et al. (2002) SAS for Linear Models.  The materials in this site are appropriate for someone who has a reasonable command of basic linear regression.  In a basic regression course, it is usually assumed that we are interesting in modeling effects based on observations of independent and identically distributed observations (e.g. a single cross section with simple random sampling).  In the materials in this site, we expand the application of the linear model to the analysis of data arising from more complex design and sampling scenarios (e.g. experimental and quasi-experimental designs, cases with nested or clustered samples).  The site is provided by Robert Hanneman, in the Department of Sociology at the University of California, Riverside.  Your comments and suggestions are welcome, as is your use of any materials you find here.

10.1  Introduction

All examples so far have been cases where the outcome can be assumed to be drawn from a normal distribution.  Generalized linear models retain all the power of the models on the "right hand side," but extend analyses to predict the mean of variables that are not reasonably assumed to be normally distributed.  Such models are also sometimes discussed as "probability models" or model of specific types (e.g. "logistic regression, Poisson regression, models for counts data, etc., etc., etc.).

This chapter deals with some of the most common and important models of this type.  A separate volume on PROC GENMOD is useful further reading after mastering the basics in this (long) chapter.

One such generalization deals with outcomes that have two possible outcomes -- Bernoulli outcomes.  These are often measured as 1/0 outcomes for each trial, or the number of "successes" per "number of trials" for different populations at the various levels of X.  Binomial distributions are better assumptions for such cases.

Another major generalization is the analysis of variables that are measured as counts.  Such variables cannot have negative values, and hence are not normal.  Poisson and negative binomial distributions are better than normal for such cases.

Analysis of survival time, or time until an event is commonly assumed to have an underlying exponential or gamma distribution (see also other techniques for analyzing survival data).

Sometimes we are interested in analyzing the variance of an outcome under different X, rather than the mean outcome under different X.  For such models, a transformation of the Chi-square distribution, where Y is the variance of the outcome, are commonly used.

Why should one use generalized linear models, rather than simply transforming a dependent variable to try to accomplish the same analysis?  Often transformations stablize variance, but do not deal with skewness; they result in models that are expressed in very strange terms.

Nelder and Wedderburn (1972) integrated the previously disparate and separate approaches to models for non-normal cases in a framework called "generalized linear models."  The key elements of their approach is to describe any given model in terms of it's link function and it's variance function.

The variance function describes the relationship between the mean and the variance of the dependent variable.  This allows the proper calculation of the variance (and everything that depends on it) under non-normal conditions.  The link function describes the (usually) non-linear relationship between the mean of the dependent variable and the linear right hand side.

Most non-normal dependent variables have a "characteristic" or "cannonical" or expected link function.  In logistic regression, for example, we assume that the dependent variable is distributed by a binomial process, and that the log of the mean of Y is linearly associated with X.

Separating the assumptions about the distribution of y, and the functional form of the relationship between the mean of Y and X, however, creates a general classification that allows a wide range of possible models with different assumptions.

PROC GENMOD covers most such applications.  More specific details for particular models are available from some other PROCs.

GENMOD can handle many repeated measures, and split plots.  It does not deal with random-effects models.  NLMIXED does some such models, and GLIMMIX also apply, but are not discussed here.

10.2  The logistic and probit regression models

Logistic regression is helpful for data given in the form of numbers of successes and trails at each of a number of levels of X.  What is modeled is pi (the probability of Y, or the mean of a zero-1 variable).  The distribution of Y is assumed to be binomial.

10.2.1  Logistic Regression: The Challenger Shuttle O-ring data example

23 launches were observed, which happened to occur at 16 different ambient temperatures.  Whether thermal distress on the o-ring was observed for each launch is recorded.  pi is the probability of failure, and it is modeled as a function of a direct variable, temperature.  The link function is assumed to be the log odds or logit.  GENMOD code is given.

Goodness of fit critera are given (deviance, scaled deviance, pearson chisquare, scaled pearson chi-square).  All of these refer to the size of the residual variance, and test whether the residuals.  Large values indicate lack of fit, and can be tested against a chi-square distribution for the df given.

The log-likelhood is also given.  Another approach to measuring goodness of fit is to run the model Y=  ; with no IVs, then run the model, and calculate the percentage change in the -2(log Likelihood) -- this is the basis of various pseudo R2 statistics (note, SPSS reports these statistics).

Parameters are tested against no effect with the Wald chi-square statistic.

Parameters are interpreted as effects on the predicted logit or log-odds.  This is often not very helpful.

10.2.2.  Using the Inverse link to get the predicted probability

With a little programming to get estimated logits, and calculating the predicted probabilities, one can get the actual predicted probability and confidence interval for desired levels of X.  This can be graphed.

10.2.3  Alternative logistic regression using 0-1 data

The same problem can be approached by entering one data line for each observation (rather than one data line summarizing successes and trails for each level of X).  Goodness of fit statistics and df are affected by this, but parameter estimates are not.

10.2.4  An Alternative Link:  Probit Regression

If one believes that the success/trial at any X is actually the discretely observed outcome of a continuous underlying normal distribution of probabilities, then use of the cumulative normal distribution function (the "probit" or "normit").  All the approaches to running the model, generating estimated probits (which are actually z scores), and estimated probabilities and confidence intervals are modifications of the logit example.

10.3  Binomial models for analysis of variance and covariance

Exactly the same approach applies when the right-hand side contains classification variables or a mix of classification and direct variables.  Both logit and probit links are commonly applied.

10.3.1.  Logistic ANOVA

An example is given of favorable and unfavorable outcomes of a drug versus control, across the blocking factor of eight trials or clinics.  Again the data can be entered either for each combination of Xs, or as individual observations.

If we assume both factors as fixed, we just use the class= statement and proceed, specifying the dist= and link= commands.

Useful:  you can compute the lsmeans for the treatment variable (or for the blocking variable, but we don't care).  These are the predicted logit for the levels of the treatment controlling for the blocking factor.  A little programming is shown that produces the predicted probabilities and confidence intervals -- which are not produced by default.

As with ANOVA, contrasts can be used to test specific differences.

The main-effects only model, and the two-way interaction model are shown for the example (the latter asks:  is the difference between the control and treatment group the same across the clinics?), and is, in this case, the saturated model.

10.3.2.  ANOVA with a Probit link

The analysis is repeated specifying the probit, along with estimated values and contrasts.  Code is given to generate predicted values, and the "delta rule" is used to create code that estimates the confidence interval at each level.

10.3.3  Logistic Analysis of Covariance

An example shows a drug against a standard (class), with levels of dosage (direct), affecting the numbers of living out of 20 trials for each combination of Xs.  Again, the data could also be entered case-wise, rather than cell-wise.  The model examined allows for differences in slopes.

Code is show to test the equal slopes assumption and to estimate the different slopes.

Lsmeans can be used to get the adjusted predicted logits for each treatment.  Contrasts can be applied.

10.4  Count data and overdispersion

Count outcomes were, historically, viewed as the outcomes of Poisson processes.  Naturally occurring count data, however, often displays overdispersion due to correlated errors in time or space, or other forms of non-independence of the observations.  For such cases, the negative binomial distribution is a widely used alternative.

One approach is to use Poisson, but adjust the standard errors.  Alternatively, use the negative binomial.

10.4.1  An Insect Count Example

There are 10 levels in the experiment, a control group and a 3 by 3 factorial.  Each of these was implemented at each of four locations -- leading to a complete random blocks design.  The dependent variable is the count of the number of insects.

A dummy variable is created to represent control group or not; the crossed treatments are coded 1,2,3, each and enterred as class variables.  The control group, and the interaction of the two treatments are included as effects.  Type 1 and Type 3 analyses are done (are these the same sequential and simultaneous?).

10.4.2 Model Checking

The output produces deviance and Pearson chi-squares, df, and ms.  In this case, these residual statistics are large (significant with 27df).  This suggests that the model is not a good fit to the data.  There are several possible causes, including the wrong distribution and the wrong link.

Plot the standardized residuals against the predicted mean:  hetroscedasticity suggests the variance function is not adequate (try negative binomial, instead).  Patterned errors may suggest a non-log link, need for an interaction, or non-linear transform of some X.

Plot predicted Y against the estimated link function -- like the standard normal plot for checking normality of residuals in OLS regression.

10.4.3  Correction for Overdispersion

Failure to correct for overdispersion results in standard error estimates that are too small, and too much confidence in the results.

By adding the dscale option to the dist=poisson, a dispersion parameter is estimated and corrected estimates generated.  The scaled deviance is the deviance divided by the dispersion parameter.

Code is shown to get the estimates for particular cells, and to do contrasts.  Code is also given to generate the estimated actual count and confidence intervals at particular values of X.

This corrects standard errors, but may be too simple.

10.4.4  Fitting Negative Binomial

Alternatively, one can use the negative binomial distribution.  The cannonical link is the log.

10.4.5  PROC GENMOD to fit negative binomial with log link

all as before, except...   model.... / dist=negbin type 1 type 3;

A dispersion parameter is estimated, and tested for significance.

as before, contrasts and estimates can be applied.

10.4.7  Advanced application - user supplied programs

10.5  Generalized linear models with repeated measures -- Generalized Estimating Equations (GEE)

PROC GENMOD uses "generalized estimating equations" (Liang and Zeeger, 1986) to extend the generalized model to the case of repeated measures (as in panel models with non-normal distributions of outcomes).

10.5.1  A Poisson Repeated-Measures Example

ID is a given patient.  There are two levels of the treatment trt=0 (placebo) and trt=1 (drug).  The outcome is the number of siezures in a two-week period.  A baseline measurement is taken, then observations are made for four 2-week intervals.  The (log of ) the baseline measure and (log of) age are included as covariates.  Time period (that is trend) and time period by treatement interaction (that is, different trend for the placebo and treatment) are included in the model.

10.5.2  Using GENMOD to do a GEE analysis of repeated measures.

Code is shown to translate the compact four measures per line format into the necessary one line per measure format.  So, we have one line of data per observation.

model y = trt  time  trt*time    log_base   trt*log_base   log_age / dist=poisson  link=log, type1 type3;
repeated subject=id  / type=exch corrw;

The repeated implements GEE.  GEE uses the "working correlation matrix" corrw to deal with modeling the correlated errors among measures within ID.

The types of error correlation structures available are:
IND (independent)
EXCH (exchangeable or compound symmetry)
AR (similar to AR(1))
UN unstructured.

The corrw statement prints the estimated error correlation matrix.

GEE can be applied to other variance functions and link functions.

Data sets

Outputs 10.1 through 10.4  data challenger.sas

Outputs 10.5 through 10.6 data o_ring.sas

Outputs 10.7 through 10.8 data challenger2.sas

Outputs 10.9 through 10.13 data a.sas

Output 10.14 data prob_hat.sas

Outputs 10.15 through 10.16 data s10_15.sas

Outputs 10.17 through 10.23 data fr_t7_3.sas

Outputs 10.24 through 10.38 data counts.sas

Outputs 10.39 through 10.42 data leppik.sas