Linear Models

Basic linear regression (Littell, chapter 2)

This site supports tutorial instruction on an linear models, based on Littell, et al. (2002)

The notes below are organized parallel to Littell's text:

2.1 Introduction

linear regression equation: independent and dependent variables, intercept and slope, simple and multiple linear regression.

equation with error, along with assumptions about distributions is the "statistical model"

equation with sample estimates of population parameters is the "estimating equation"

main SAS programs for doing regression are REG, GLM, and MIXED

Return to the chapter table of contents

2.2 The REG Procedure

2.2 In SAS:

model dependents = independents/ options;

2.2.1 run proc reg on:

Cost of cattle market operations = B0 + B1(number of cattle sold) + e using data MARKET

be able to interpret:

degrees of freedom, sums of squares, F test

root MSE, Rsq, adj. Rsq, Coefficient of variation

prediction equation

standard errors of coefficients, t for null, and probability

how to use standard errors to construct confidence intervals

assumptions of linearity and homoskedasticity can be roughly evaluated by graphing

model cost = cattle;

plot cost*cattle:

run;

2.2.2 adding options to label data and examine confidence intervals:

clm produces the 95% confidence interval lower and upper bounds for the mean predicted value calculated across all cases with the same value of x (think of this group of cases with the same score on x as a subpopulation).

cli produces the 95% confidence interval for the mean predicted y for each population sharing the same x to be used for future predictions

Important ideas:

1. each value of x can be thought of as defining a population of cases (even if there happens to be only one observation, as in this case); one can estimate values, such as the mean and confidence interval of the predicted value, for each of these populations

2. clm gives the confidence interval for the predicted mean of the subpopulation defined by each value of x; cli limits are wider, because they include future sampling variability in y observed, plus possible future sampling variability in y

2.2.3 Several independent variables

using the data set AUCTION, add the numbers of other
animals sold in the market to create a multiple regression model

*proc reg data=auction;
model cost=cattle calves hogs sheep;
*

understand and interpret:

the H0 and meaning of the F test and df

Rsq and adjustment to Rsq

parameter estimates, predicted values, estimated standard errors of coefficients, t and p

2.2.4 Sequential (SS1) and Partial (SS2) sums of squares

F tests can be based on any type 1 or type 2 SS using MSE as the denominator

2.2.5 Tests for subsets and linear combinations

The test statement in REG after the model allows testing of hypotheses about
single coefficients or combinations, for example

*model cost=cattle calves hogs sheep;
label_for_this_test test hogs=0; tests the hypothesis that the b for
hogs is zero
lable2 test hogs= 0, sheep=0; performs two tests
label3 test intercept=0; is a test that b0 is 0
label4 test hogs=1; tests if hogs coefficient differs from 1
lable5 test hogs-calves=.5; tests if hogs is greater than calves+.5*

2.2.6 Restricted models

like tests for particular coefficients, test hypotheses about parameters

*model cost=cattle calves hogs sheep;
restrict intercept=0, hogs-sheep=0;* estimates model with no intercept
and with two coefficients forced to be equal.

output provides a single degree of freedom test for the significance of the difference due to restriction, as opposed to freeing the parameter(s)

caution: when using the / noint option in PROC REG, R2 and other statistics are not correctly estimated.

2.2.7. Exact linear dependency

In the AUCTION dataset, volume is the sum or the four livestock types.
Entering it, along with the four component parts creates exact dependency, SAS
will print a warning message and set one of the variables to zero. This
does not bias the remaining coefficients.

Return to the chapter table of contents

2.3 The GLM Procedure

2.3.1. GLM for linear regression

proc glm;

model cost=cattle calves hogs sheep;

run;

produces the same output as REG

there are differences in the meaning of the SS types in some anova models. Type III ss in glm corresponds to type II (partial) in reg.

2.3.2 Using contrast statements to test regression parameters

Contrasts test hypotheses about linear combinations of regression parameters

model ...

contrast 'contrast_name' effect values; for
example:

*contrast 'hogcost=0' intercept 0 cattle 0 calves 0 hogs 1
sheep 0;* tests whether the coefficient for hogs is zero, ignoring the other
coefficients

*contrast 'hogcost=sheepcost' hogs 1 sheep -1;* tests whether
the difference between coefficient for hogs and coefficient for sheep is zero.

*contrast 'hogcost=sheepcost=0' hogs 1, sheep 1; * tests two
contrasts: are hogs zero, are sheep zero.

output identifies each contrast by name, and provides a proper F test for it.

2.3.3. Using ESTIMATE to estimate linear combinations of parameters

ESTIMATE works the same as contrast, but creates linear parameter estimates and
tests them for significant difference from zero, for example...

*model ....;
estimate 'hogcost=sheepcost' hogs 1 sheep -1;*
essentially takes the coefficient for hogs, subtracts the coefficient for sheep,
prints the difference and tests whether this difference is significant.

Return to the chapter table of contents

2.4 Statistical Background

2.4.1 the linear regression model

terminology for representing the equation in matrix form (**Y** = column
vector of Y scores for individuals; **X** = matrix of cases by x variables; **e
**as the vector of cases by the residual). beta prime is the vector of
regression parameters. Part of the ols solution is 1 over Xinverse*X
often called **C**, or the **matrix of variances and covariances of
regression parameter estimates**.

2.4.2 partitioning sums of squares

total, model, and error SS. MSE = error SS / (n-m-1)

tests are often done with reductions in sums of squares of a model compared to
another, when nested, with df equal to difference

2.4.3 hypothesis tests about confidence intervals

three most common tests: all slopes are simultaneously zero (standard F
test); test that some sub-set of parameters are zero (done by comparing the full
model to a reduced model); confidence intervals about a particular
sub-population defined by simultaneous scores on the several X variables.

2.4.4. using the generalized inverse

The generalized inverse is used when the model is under-identifed by using all
values of X as independent variables. There are any number of
parameterizations possible in this circumstance. But, the SS error is not
affected by parameterization. coefficients, and linear combinations of
them are affected.

Return to the chapter table of contents

Data sets

MARKET.sav

market.sas

AUCTION.sav

Return to the chapter table of contents

return to Linear Models home page