Linear Models
Basic linear regression (Littell, chapter 2)

This site supports tutorial instruction on an linear models, based on Littell, et al. (2002) SAS for Linear Models.  The materials in this site are appropriate for someone who has a reasonable command of basic linear regression.  In a basic regression course, it is usually assumed that we are interesting in modeling effects based on observations of independent and identically distributed observations (e.g. a single cross section with simple random sampling).  In the materials in this site, we expand the application of the linear model to the analysis of data arising from more complex design and sampling scenarios (e.g. experimental and quasi-experimental designs, cases with nested or clustered samples).  The site is provided by Robert Hanneman, in the Department of Sociology at the University of California, Riverside.  Your comments and suggestions are welcome, as is your use of any materials you find here.
The notes below are organized parallel to Littell's text:
2.1  Introduction

linear regression equation: independent and dependent variables, intercept and slope, simple and multiple linear regression.

equation with error, along with assumptions about distributions is the "statistical model"

equation with sample estimates of population parameters is the "estimating equation"

main SAS programs for doing regression are REG, GLM, and MIXED

2.2  The REG Procedure

2.2  In SAS:
proc reg;
model  dependents = independents/ options;
additional options; (e.g. noint, P, CLM, CLI)

2.2.1  run proc reg on:
Cost of cattle market operations = B0 + B1(number of cattle sold) + e  using data MARKET

be able to interpret:
degrees of freedom, sums of squares, F test
root MSE, Rsq, adj. Rsq, Coefficient of variation
prediction equation
standard errors of coefficients, t for null, and probability
how to use standard errors to construct confidence intervals

assumptions of linearity and homoskedasticity can be roughly evaluated by graphing
proc reg data=market;
model cost = cattle;
plot cost*cattle:
run;

2.2.2  adding options to label data and examine confidence intervals:
model cost=cattle / p clm cli
p produces predicted values for each case as well as the estimated standard errors for Y at each observed level of X
clm produces the 95% confidence interval lower and upper bounds for the mean predicted value calculated across all cases with the same value of x (think of this group of cases with the same score on x as a subpopulation).
cli produces the 95% confidence interval for the mean predicted y for each population sharing the same x to be used for future predictions

Important ideas:
1.  each value of x can be thought of as defining a population of cases (even if there happens to be only one observation, as in this case); one can estimate values, such as the mean and confidence interval of the predicted value, for each of these populations
2.  clm gives the confidence interval for the predicted mean of the subpopulation defined by each value of x; cli limits are wider, because they include future sampling variability in y observed, plus possible future sampling variability in y

2.2.3  Several independent variables
using the data set AUCTION, add the numbers of other animals sold in the market to create a multiple regression model
proc reg data=auction;
model cost=cattle calves hogs sheep;

understand and interpret:
the H0 and meaning of the F test and df
parameter estimates, predicted values, estimated standard errors of coefficients, t and p

2.2.4  Sequential (SS1) and Partial (SS2) sums of squares
model cost=cattle calves hogs sheep / ss1 ss2;
reduction notation for the two ss types  R(b1|b0) means ss due to b1, controlling b0
F tests can be based on any type 1 or type 2 SS using MSE as the denominator

2.2.5  Tests for subsets and linear combinations
The test statement in REG after the model allows testing of hypotheses about single coefficients or combinations, for example
model  cost=cattle calves hogs sheep;
label_for_this_test test hogs=0;   tests the hypothesis that the b for hogs is zero
lable2 test hogs= 0, sheep=0;  performs two tests
label3 test intercept=0;  is a test that b0 is 0
label4 test hogs=1;  tests if hogs coefficient differs from 1
lable5  test hogs-calves=.5; tests if hogs is greater than calves+.5

2.2.6  Restricted models
like tests for particular coefficients, test hypotheses about parameters
model cost=cattle calves hogs sheep;
restrict intercept=0, hogs-sheep=0;
estimates model with no intercept and with two coefficients forced to be equal.
output provides a single degree of freedom test for the significance of the difference due to restriction, as opposed to freeing the parameter(s)
caution:  when using the  / noint option in PROC REG, R2 and other statistics are not correctly estimated.

2.2.7.  Exact linear dependency
In the AUCTION dataset, volume is the sum or the four livestock types.  Entering it, along with the four component parts creates exact dependency, SAS will print a warning message and set one of the variables to zero.  This does not bias the remaining coefficients.

2.3  The GLM Procedure

2.3.1.  GLM for linear regression
proc glm;

model cost=cattle calves hogs sheep;
run;
produces the same output as REG

there are differences in the meaning of the SS types in some anova models.  Type III ss in glm corresponds to type II (partial) in reg.

2.3.2  Using contrast statements to test regression parameters
Contrasts test hypotheses about linear combinations of regression parameters
model ...
contrast 'contrast_name'  effect values;  for example:
contrast 'hogcost=0' intercept 0 cattle 0 calves 0 hogs 1 sheep 0; tests whether the coefficient for hogs is zero, ignoring the other coefficients
contrast 'hogcost=sheepcost' hogs 1 sheep -1; tests whether the difference between coefficient for hogs and coefficient for sheep is zero.
contrast 'hogcost=sheepcost=0' hogs 1, sheep 1;  tests two contrasts: are hogs zero, are sheep zero.
output identifies each contrast by name, and provides a proper F test for it.

2.3.3.  Using ESTIMATE to estimate linear combinations of parameters
ESTIMATE works the same as contrast, but creates linear parameter estimates and tests them for significant difference from zero, for example...
model  ....;
estimate 'hogcost=sheepcost' hogs 1 sheep -1;
essentially takes the coefficient for hogs, subtracts the coefficient for sheep, prints the difference and tests whether this difference is significant.

2.4  Statistical Background

2.4.1  the linear regression model
terminology for representing the equation in matrix form (Y = column vector of Y scores for individuals; X = matrix of cases by x variables; e as the vector of cases by the residual).  beta prime is the vector of regression parameters.  Part of the ols solution is 1 over Xinverse*X  often called C, or the matrix of variances and covariances of regression parameter estimates.

2.4.2  partitioning sums of squares
total, model, and error SS.  MSE = error SS / (n-m-1)
tests are often done with reductions in sums of squares of a model compared to another, when nested, with df equal to difference

2.4.3  hypothesis tests about confidence intervals
three most common tests:  all slopes are simultaneously zero (standard F test); test that some sub-set of parameters are zero (done by comparing the full model to a reduced model); confidence intervals about a particular sub-population defined by simultaneous scores on the several X variables.

2.4.4. using the generalized inverse
The generalized inverse is used when the model is under-identifed by using all values of X as independent variables.  There are any number of parameterizations possible in this circumstance.  But, the SS error is not affected by parameterization.  coefficients, and linear combinations of them are affected.