Linear Models
Understanding linear models concepts (Littell, chapter 6)

This site supports tutorial instruction on an linear models, based on Littell, et al. (2002) SAS for Linear Models.  The materials in this site are appropriate for someone who has a reasonable command of basic linear regression.  In a basic regression course, it is usually assumed that we are interesting in modeling effects based on observations of independent and identically distributed observations (e.g. a single cross section with simple random sampling).  In the materials in this site, we expand the application of the linear model to the analysis of data arising from more complex design and sampling scenarios (e.g. experimental and quasi-experimental designs, cases with nested or clustered samples).  The site is provided by Robert Hanneman, in the Department of Sociology at the University of California, Riverside.  Your comments and suggestions are welcome, as is your use of any materials you find here.
This page is organized in parallel to the Littell text:
6.1  Introduction

Both fixed effects and mixed effects models use dummy variables to parameterize effects -- that is, they express effects as deviations of each of the categories from a reference category.  GLM estimates the effects of all sets of dummy variables simultaneously and jointly as in regression; MIXED, however, separates the dummy variable sets for fixed and random effects, and estimates them separately.

The GLM or regression approach is particularly useful for unbalanced data -- where cell frequencies and/or cell variances may differ greatly.  All approaches to parameterization raise the issue of what functions of the parameters are estimable.

return to chapter table of contents


6.2  The dummy-variable model

The points are illustrated with a simple one-way classification with random assignment

6.2.1  The Simplest Case

The score of a case is equal to the mean of it's condition plus the error or uniqueness.  The mean of the condition is expressed as the deviation of the mean of that condition from some reference value.  But, there is an identification/estimability problem in that parameterization of the effects of all groups is underidentified.

6.2.2  Parameter estimation

There are two approaches to applying restrictions to achieve estimability.  One is to fix the effect of one category to be zero.  In this case, the intercept then becomes the mean of this group, and all other effects are expressed as deviations from it.  The alternative is to define the intercept as the mean of group means ("grand mean") and express group effects as deviations from it; excluding one group to achieve estimability.

Alternatively, "the generalized inverse" methodology can be applied, instead of explicit restrictions.  GLM uses this method, and parameterizes results with the intercept as a reference category mean, and effects as deviations from it.

6.2.3.  Proc GLM for ANOVA

The class statement creates dummy variables for the full matrix of effects (in alpha order of the category names).  All levels are included, but the last is set to zero; use the / solution option to get parameter estimates.

6.2.4.  Estimable functions in a one-way classification.

Estimable functions are functions of model parameters (e.g. difference between two parameters, difference between a parameter and the difference of two others, etc.) that are invariant regardless of the generalized inverse used (i.e. regardless of which group is excluded).  The estimable function for a particular model can be obtained from SAS by adding the / e option to the model statement.  LS means produces the group means (adjusted group means, if there are other factors in the design).  Contrasts are linear functions such that the sum of the weights on the coefficient vector sum to zero.  Contrasts are set up using the parameter vector for all groups -- not the intercept (p. 178).  Means are reconstructed by  adding the intercept to the weighted coefficients for the entire vector.

return to chapter table of contents


6.3  Two-way classification: Unbalanced data

The two most common two-factor experimental designs are a two-way factorial (usually with interaction), and the randomized blocks design (e.g. also a full cross where each level of treatment occurs in each level of another factor, but the other factor is often a blocking factor -- which is also sometimes random).  The randomized blocks design frequently has one observation per cell; the factorial usually has multiple (but balanced) observations per cell.

6.3.1  General Considerations.

Yijk = u + ai + bj + (abij) + eijk.  The mean for any cell is the sum of u + the appropriate a, b, and ab.

In the presence of unbalanced data, the standard ANOVA calculations may be biased.  This is because differences in factor means may be correlated with other factors.  The solution is to adjust the means.

6.3.2  SS in PROC GLM

Four types of SS are provided by GLM:

Type I:  are the incremental SS by the regression method.  Sometimes useful for assessing whether interactions are useful, given main effects.

Type II:  adjusts the SS for all effects that it do not also contain it.  Easier to understand with an example.  In the two way factoral with interaction, A is adjusted for B, but not adjusted for AB (because AB  contains A); B is adjusted for A, but not for (AB); AB is adjusted for the main effects A and B.  Type II is most useful where there is no interaction -- called "fitting constants"

Type III:  is Yates weighted squares of means.  Useful for comparing main effects in the presence of interaction.  Each effect is partialled on all other effects (including interactions).

Type IV: are used mostly where there are zero cells.  

6.3.3.  Interpreting SS in reduction notation.

shows what is controlled for, for each effect and SS type using the reduction notation.

6.3.4  Interpreting SS in the u - Model notation.

The basic model is yijk = uij + eijk     individual score is cell mean plus individual errof.  Each cell mean is generated from the parameters u + ai + bj + abij.

6.3.5  An Example of an Unbalanced two-way classification.

The examples use the data sets  teachers.sas, twoway.sas, and twoway2.sas.

The remainder of this very long section is an extended example of all possible SS, lsmeans, etc. for two way data, interpreted in terms of estimable functions, showing where there may be problems with unbalanced data.

return to chapter table of contents


6.4  Mixed-model issues

This section largely replicates the earlier mixed model analysis of the grasses data set for the split-plot experiment.

return to chapter table of contents


6.5  ANOVA issues for unbalanced mixed models

Basic point is that MIXED is a more sound approach to getting parameter estimates than GLM.

return to chapter table of contents


6.6  GLS and likelihood methodology for mixed models

MIXED uses GLS methods.

The model is Y = XB + ZU =e   XB are fixed effects, ZU are random effects.

A key to the method is specifying the covariance matrix of Y -- that is, relationships of non-independence or covariation among the X observations.  This matrix depends on the structure of the random component.  

Not discussed at this point, but important are the various forms of imposing restrictions on this covariance matrix to deal with repeated observations, pooling, and other forms of blocking by random factors.  The SPSS programs for mixed models go explicitly to this issue by trying to provide some "wizards" to impose the correct restrictions.

return to chapter table of contents


Data sets

teachers.sas

twoway.sas

twoway2.sas

return to chapter table of contents


return to Linear Models home page