Linear Models
Analyzing unbalanced data, basic methods (Littell, chapter 5)

This site supports tutorial instruction on an linear models, based on Littell, et al. (2002) SAS for Linear Models.  The materials in this site are appropriate for someone who has a reasonable command of basic linear regression.  In a basic regression course, it is usually assumed that we are interesting in modeling effects based on observations of independent and identically distributed observations (e.g. a single cross section with simple random sampling).  In the materials in this site, we expand the application of the linear model to the analysis of data arising from more complex design and sampling scenarios (e.g. experimental and quasi-experimental designs, cases with nested or clustered samples).  The site is provided by Robert Hanneman, in the Department of Sociology at the University of California, Riverside.  Your comments and suggestions are welcome, as is your use of any materials you find here.
The materials in this page are parallel to Littell's text:
5.1  Introduction

"Unbalanced" is difficult to define precisely.  Here, it means unequal numbers of observations in cells of a design.  Most non-experimental and quasi-experimental designs (and some experimental) yield unbalanced data.  Experimental designs often yield unbalanced data because of subject mortality, non-response, etc.

The basic problems with unbalanced data are:  what is the grand mean, or the effect mean against which group means are compared to compute effects -- does the grand or effect mean weight, or not weight the scores across cells?  Secondarily, unequal cell sizes may give rise to heterogeneity of variance across cells, and make problems for valid standard error estimates.

In the example used in the text, there is a drug treatment factor with two levels (drug A, drug B).  Clinical trials (variable: study) are conducted on patients at a number of clinics.  Two measures of outcome (flush0 and flush) are taken -- only one depvar is analyzed in this chapter.

Where the effects are regarded as fixed, the common approach to unbalanced means comparisons involves using different types of SS.  SAS GLM can be used for this purpose.  Where some effects are regarded as random in the presence of unbalanced data, new complexities arise, and MIXED is called for.

return to the chapter table of contents


5.2  Applied concepts of analyzing unbalanced data

The data set, and the dependent variable Flush are introduced.

The cells of the design are defined by the crossing of the treatment factor TRT (which drug, A or B?) and the clinical trial factor (STUDY, or which clinic was used).  In this case, the data are (slightly) imbalanced because of problems of study administration, resulting in unequal numbers of subjects and some missing data.

5.2.1.  ANOVA for unbalanced data

There are four types of sums of squares.  Type IV is not illustrated here, because it is equal to type III in situations where there are observations in all cells of the design.  GLM is used to generate the three available SS for a treatment, study, treatment*study interaction model.

The goal is to compare treatments.  The idea is to pool observations across studies.  But, a significant interaction suggests that the effects of treatment vary across clinics.  This interaction calls for explanation, and ideally the inclusion of other variables describing clinic differences that account for the difference.  In the current case, no such information is available, so the best we can do is "average across populations" to see if there appears to be a main effect of drug.

The type III SS is based on estimating the overall treatment mean as the average of cell means -- that is, it gives equal weight to each cell, regardless of the number of observations in that cell in determining the grand mean and treatment effects.  This is a reasonable approach where the interest is solely in whether there is a treatment effect, but the power of the test can be very low if there are any cells with low frequencies.

5.2.2  Contrasts and Estimates with unbalanced data.

If no empty cells, contrast and estimate work the same way as with balanced data.  In most cases, tests of differences in means are equivalent with balanced and unbalanced data.

5.2.3  LSMEANS statement

The lsmeans statement computes the effect means across cells.  With balanced data, this is a simple mean.  With unbalanced data, it is the mean of group means.  Correct SEs are generated and correct tests for differences in means are generated.

5.2.4.  Other hypotheses and types of SS

In some cases, we might want to test differences in treatment means using data weighted by the observed cell sizes (e.g. giving more weight to cells with more observations).  This is awkward to do in GLM, but can be done by computing weights and testing a contrast of treatment means weighted by cell sizes (illustrated in text).  The SS for the contrast using weights proportional to cell size will differ from the standard contrast -- which assumes that the groups (clinics in this case) are to be weighted equally.

Point:  GLM, by default, calculates statistics as if the design were balanced, by using the mean of group means as the point of comparison.  It is possible to correct some comparisons and tests to reflect the differences in group sizes by using weighted contrasts.

return to the chapter table of contents


5.3   Issues associated with empty cells

In this section, the data set drugs1 is used, rather than the edited set (drugs) that was used above.  In the set drugs1, there is one clinic that had no patients who received drug A -- so there is a cell in the design that is zero (not clear whether this is a "design" zero or a "sampling" zero -- more likely the latter, in this case).

5.3.1  Effects of empty cells on SS

In this case, we see that all types of SS have the same degrees of freedom, but that the SSIII differs from SSIV.  SSIV is substantially smaller for clinic than is SSIII.  Estimated treatment and error SS are the same.  The text implies, but doesn't state, that type IV SS is probably more appropriate.  An explanation is promised in chapter 6.

5.3.2  Effects of empty cells on contrasts, estimates, and LSMEANS

Contrasts and Estimates in GLM will fail to be estimated in the presence of one or more zero cells.  The LS mean for the treatment that has a zero cell (treatment A, in this case) is also printed as non-estimable.

return to the chapter table of contents


5.4  Some problems with unbalanced mixed-model data

The key issue with unbalanced data is the construction of meaningful parameters (and hence contrasts and tests).  The key problem with this is what to use as an overall mean (which is the same issue as how to weight unequal cells).  This section examines a GLM approach and a MIXED approach to anova with unbalanced cells.

Assume that the clinics involved in the study (STUDY) are a random draw of possible clinics, and that we wish to generalize our test of TRT effects across a potential population of all possible STUDYs.  The model is the same as before (including interaction), but we now assume that STUDY and it's interactions are random factors.  We assume that the random effects are zero mean, normally distributed, and independent.

The objectives are the same as with balanced design analysis -- we want an estimate of treatment effects and contrasts that generalizes across the population average of STUDY.

return to the chapter table of contents


5.5  Using the GLM procedure to analyze unbalanced mixed-model data

This section analyzes the data with unequal, but no zero cells.

Anova mean squares for effects and errors were developed for, and best apply to fixed effects and balanced data.  There are problems in applying them to random effects and unbalanced data.  The mean square for an effect is somewhat problematic where data are unbalanced (is a weighted or unweighted effect mean appropriate?  what about zero cells and unequal cell variance?).  The error mean squares are problematic.  Expected mean squares can be used, as with balanced data, but variance components may not match up with mean squares (???).  Finally, ms effect and ms error are not uncorrelated with unbalanced data -- that is, the error component covaries with the effect due to differences in cell sizes -- so conventional F tests do not (strictly) apply.

5.5.1  Approximate F statistics

To test the effect (in this example, the effect of Treatment), the ms treatment needs to be selected.  The author suggests, without justification here, that the regular type III ss and df be used.  The denominator is more troublesome.  Run the model in GLM using the random statement, and get the expected mean squares.  The test command causes SAS to create a weighted approximation to the MS error providing a test for inference under random conditions.  This test will always be more conservative than the fixed effects test.

5.5.2  Contrast, estimate, and lsmeans

Be careful, the random statement must be used, but does not automatically create the right corrections to contrasts and means -- this is true for both balanced and unbalanced data.  In particular, the F test for contrasts of treatment means is most likely wrong.  The value of the contrast, obtained with ESTIMATE is wrong, and cannot be corrected.

In sum, there are complex problems in using GLM or other packages that assume fixed effects for getting correct tests where the model is, in fact, mixed.

return to the chapter table of contents


5.6  Using the MIXED procedure to analyze unbalanced mixed-model data

This section repeats the analysis of 5.5 of the data with unbalanced, but no zero cells.

PROC MIXED is run in the same way, whether data are balanced or unbalanced.  

The example shows that there are still problems in getting estimates of the random effect covariance component (or, in the GLM version, the SS for STUDY).  But, rather than a negative component, the REML result is zero.  Useful contrasts, estimates, and means are produced with approximate df estimates, and consistent tests.

return to the chapter table of contents


5.7  Using the GLM and MIXED procedures to analyze mixed-model data with empty cells

This section considers the data that does have a zero cell (no cases of one drug condition in one of the clinics).

With zero cells, GLM type III and type IV expected mean squares will differ.

MIXED results differ somewhat from the non-zero cell case.

There is one important difference.  PROC GLM will not produce estimates of LS means or contrasts in the presence of zero cells, because it treats all effects as fixed for this purpose.  MIXED looks at the estimability only of the fixed effects -- hence, in this case, produces valid estimates.  In this case, there is missing data in one cell of the whole design (one drug not done at one clinic), but if we consider only the fixed effect (drug) data (albeit unbalanced) are present for both cells, and MIXED can estimate means and contrasts.

return to the chapter table of contents


5.8  Summary and conclusions about using the GLM and MIXED procedures to analyze unbalanced mixed-model data

Mostly the chapter is an advertisement for why MIXED is better than GLM in the presence of unbalance data and data with zero cells.

In general, F tests need to be used cautiously with any kind of unbalanced data -- the F tests are only approximate.  The more unbalanced, the worse the approximation.

In general, MIXED is superior where effects are conceptualized as such.  This is because pseudo GLS methods are used to estimate the random effects, separately -- and these are then used to correct the fixed effects estimates.  

return to the chapter table of contents


Data sets

drugs1.sas   (includes code to create data set drugs.sas)

return to the chapter table of contents


return to Linear Models home page