STAT 203: Lecture 15

Agenda

Linear Regression
Bank Salaries Example
Likelihood
Reference and Conjugate Priors

Linear Regression

Linear regression is about predicting a continuous response variable $y$ from one or more predictor variables $x$.
The linear regression model specifies the mean of the response variable as a linear combination of the predictor variables, i.e. $E[y | x] = x′\beta$.
Although $x$ is often random, we always condition on it, $r$ hence we treat it as fixed.
Since covariate combinations can consist of continuous variables or categorical variables, standard ANOVA and ACOVA models are special cases of these linear models.
Theoretical results rely heavily on multivariate normality.

Bank Salaries Example

Schafer (1987) reported data from 1977 on 93 bank employees.
Response variable is beginning salary in dollars.
Along with the intercept, there are four covariates:
- $x_2$: sex, an indicator variable for male.
- $x_3$: years of education.
- $x_4$: months of experience.
- $x_5$: time at hiring as measured in months after January 1, 1969.

Bank Salaries Example

Bankdata <- read.table(file="http://www.ics.uci.edu/~wjohnson/BIDA/Ch9/BankSalaryData.txt",
                       header=T, sep="")
attach(Bankdata)
options(contrasts=c("contr.treatment","contr.poly"))
head(Bankdata)

##   Sex BegSal Educ Exper Time
## 1   1   3900   12   0.0    1
## 2   2   4620   12  11.5   22
## 3   1   4020   10  44.0    7
## 4   2   5040   15  14.0    3
## 5   1   4290   12   5.0   30
## 6   2   5100   12 180.0   15

Bank Salaries Example

The Sampling Model

Typical linear model specifies that for known predictor variables $x_i′ = (1, x_{i2}, \dots, x_{ir})$ and unknown regression coefficients $\beta = (\beta_1, \beta_2, \dots, \beta_r)′$, \[ \begin{aligned} y_i & = \beta_1 + \beta_2 x_{i2} + \dots + \beta_r x_{ir} + \epsilon_i \\ & = x_i'\beta + \epsilon_i \\ \epsilon_i | \beta , \tau & \stackrel{ind}{\sim} N(0, 1 / \tau), \end{aligned} \] for $i = 1, \dots, n$.
Equivalently, \[ y_i | \beta , \tau \stackrel{ind}{\sim} N(x_i'\beta, 1 / \tau). \]

The Sampling Model

Model can be written completely in matrix form as \[ \begin{aligned} \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} & = \begin{bmatrix} x_{11} & x_{12} & x_{13} & \dots & x_{1r} \\ x_{21} & x_{22} & x_{23} & \dots & x_{2r} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & x_{n3} & \dots & x_{nr} \end{bmatrix} \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{r} \end{bmatrix} & + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{n} \end{bmatrix} \\ \\ Y_{n \times 1} & = \quad \quad \quad \quad X_{n \times r} \quad \quad \quad \quad \quad \beta_{r \times 1} & + \epsilon_{n \times 1}. \end{aligned} \]
Common for the first column of $X$ to consist of 1s in order to accommodate an intercept parameter but there is no requirement for it.

The Sampling Model

Conditions on the $\epsilon$s imply that $E[\epsilon]=0$ and $\text{Cov} [\epsilon]= \tau^{-1} I_n$, where $I_n$ is the $n \times n$ identity matrix.
Moreover \[ \epsilon | \tau \sim N_n(0, \tau^{-1} I_n). \]
Then in matrix notation the multiple linear regression model is \[ Y = X \beta + \epsilon \quad \text{ and } \quad \epsilon | \tau \sim N_n(0, \tau^{-1} I_n). \]
What makes this specifically a regression model (rather than an arbitrary linear model) is that $X$ has full column rank, i.e., $\text{rank}(X) = r$.
Unless $n \ge r$, the problem degenerates because there are more parameters than observations and $\text{rank}(X) < r$.
Further, it is easy to show \[ Y | \beta, \tau \sim N_n(X \beta, \tau^{-1} I_n). \]

Likelihood

The likelihood as a function of $\beta$ and $\tau$ is given by \[ \begin{aligned} L(\beta, \tau) & \propto \prod_{i=1}^{n} \tau^{1/2} \exp \left\{ \frac{\tau}{2} (y_i - x_i'\beta)^2 \right\} \\ & = \tau^{n/2} \exp \left\{ \frac{\tau}{2} \sum_{i=1}^{n} (y_i - x_i'\beta)^2 \right\} \\ & = \tau^{n/2} \exp \left\{ \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\} . \end{aligned} \]
One can show the maximum likelihood estimate (MLE) of $\beta$ is \[ \hat{\beta} = (X'X)^{-1} X'Y \] and that the MLE of $\tau$ is $\hat{\tau} = n / SSE$ where \[ SSE = (Y - X\hat{\beta})'(Y - X\hat{\beta}) \] is the sum of squares for the error.

Likelihood

Standard unbiased (frequentist) estimator of $\sigma^2 = 1/\tau$ is the mean squared error (MSE) defined by \[ MSE = SSE / dfe \text{ where } dfe = (n-r) \] is the degrees of freedom.
Simple linear regression predicts a continuous (measurement) response using only an intercept and one predictor variable, so for simple linear regression the likelihood is a function of the 3 parameters $\beta_1$, $\beta_2$, and $\tau$.
Multiple linear regression simply has more than one predictor variable in which case the likelihood is a function of the $r+1$ parameters $\theta = (\beta_1, \dots, \beta_r, \tau)'$.

Prior Specification

Need to specify a prior for $\theta' = ( \beta' , \tau)$.

Standard improper reference (SIR) prior.
Diffuse proper reference prior.
Conjugate normal-gamma prior.
Informative prior with $\beta$ and $\tau$ independent.
Partially informative prior.

Reference Priors

The standard improper reference prior (SIR) for linear regression analysis is \[ p(\beta, \tau) \propto 1 / \tau \]
Least squares estimates are also the estimates that correspond to this prior, i.e. the frequentist regression estimates given by virtually all computer packages.
However, there is clearly a difference in interpretation.

Bank Salaries Example

Posterior results using a SIR prior for a regression analysis of the salary data of bank employees in 1977 are given below.
In addition to inferences for $\beta$ and $\tau$, we include predictive inference for a new observation $y_f$ with covariate vector corresponding to a male with 16 years of school, 54.5 months of experience, and hired 33 months after the start date for the study, i.e., $x_f′ = (1, 1, 16, 54.5, 33.0)$ with corresponding mean $x_f′ \beta$.

Bank Salaries Example

Parameters	Mean	Scale	2.5%	97.5%
Constant	3526.4	327.2	2875.2	4177.6
Sex	722.3	117.8	488.2	956.4
Educ	90.02	24.69	40.95	139.09
Exp	1.27	0.59	0.10	2.43
Time	23.43	5.20	13.09	33.76
$x_f′ \beta$	6531.2	145.7	6241.6	6820.8
$y_f$	6531.2	527.9	5482.0	7580.4
$\tau$	3.88e-6	0.585e-6	2.82e-6	5.11e-6

Bank Salaries Example

All the parameters, except $\tau$, have marginal posterior $t(88)$ distributions and the posterior means and scales are exactly as in least squares output.
For parameters with $t$ distributions, report the scale parameter.
Standard deviation is $\sqrt{dfe / (dfe - 2)}$ times the scale parameter.
Percentiles are obtained as Mean $\pm 1.987 (Scale)$ where $1.987 = t(0.975,88)$.
$R^2 = 0.51 = 51\%$ meaning 51% of the variability in the salaries is explained by this regression.

Bank Salaries Example

A superficial interpretation of the output suggests that we are 95% sure a male makes, on average, between $488 and $956 more than his female counterpart who has the same education, experience, and month of hiring.
We are also 95% sure that each additional year of education corresponds to a higher starting salary, on average, of between $41 to $139 with other variables fixed, and so on.
From the 95% prediction interval, the future individual will make a salary between $5,482 and $7,580.
For inference about $\sigma$, we start with the 95% PI for $\tau$ and convert it into a 95% PI for $\sigma$ by taking inverse square roots.

Posterior Analysis

Recall \[ \hat{\beta} = (X'X)^{-1} X'Y . \]
Then one can rewrite the likelihood as \[ \begin{aligned} L(\beta, \tau) & \propto \tau^{n/2} \exp \left\{ \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\} \\ & \propto \tau^{n/2} \exp \left\{ \frac{\tau}{2} \left[ (X \hat{\beta} - X\beta)'(X \hat{\beta} - X\beta) + (Y - X\hat{\beta})'(Y - X\hat{\beta}) \right] \right\} \\ & = \tau^{r/2} \exp \left\{ \frac{\tau}{2} (\hat{\beta} - \beta)' (X'X) (\hat{\beta} - \beta) \right\} \\ & \quad \times \tau^{(n-r)/2} \exp \left\{ \frac{\tau}{2} (Y - X\hat{\beta})'(Y - X\hat{\beta}) \right\}. \end{aligned} \]

Posterior Analysis

Using the SIR prior $p(\beta, \tau) \propto 1/\tau$, the posterior is of the form \[ \begin{aligned} L(\beta, \tau) & \propto \tau^{r/2} \exp \left\{ \frac{\tau}{2} (\hat{\beta} - \beta)' (X'X) (\hat{\beta} - \beta) \right\} \\ & \quad \times \tau^{(n-r)/2 - 1} \exp \left\{ \frac{\tau}{2} (Y - X\hat{\beta})'(Y - X\hat{\beta}) \right\}. \end{aligned} \]
First term is proportional to a multivariate normal density for $\beta$ conditional on $\tau$ and the second term is proportional to a Gamma density for $\tau$.
In other words, \[ p(\beta, \tau | Y) = p(\beta | \tau, Y) p(\tau | Y). \]

Posterior Analysis

The full conditionals are known, i.e. \[ \begin{aligned} \beta | \tau, Y & \sim N_r \left( \hat{\beta} , (\tau X'X)^{-1} \right) = N_r \left( \hat{\beta} , \sigma^2 (X'X)^{-1} \right) \\ \tau | Y & \sim \text{Gamma} \left( \frac{n-r}{2} , \frac{(Y - X\hat{\beta})'(Y - X\hat{\beta})}{2} \right). \end{aligned} \]
By standardization it follows that \[ \frac{c'\beta - c'\hat{\beta}}{\sqrt{\sigma^2 c' (X'X)^{-1}c}} \Big\vert \tau , Y \sim N(0,1) \] and \[ \frac{(Y - X\hat{\beta})'(Y - X\hat{\beta})}{\sigma^2} \Big\vert Y = (SSE) \tau \Big\vert Y \sim \text{Gamma} \left( \frac{n-r}{2} , \frac{1}{2} \right) = \chi^2_{n-r}. \]

Posterior Analysis

Classical arguments yield \[ \frac{c'\beta - c'\hat{\beta}}{\sqrt{MSE c' (X'X)^{-1}c}} \Big\vert Y \sim t(n-r) . \]
When predicting a new observation $y_f$ with covariate vector $x_f$, we have \[ \frac{y_f - x_f'\hat{\beta}}{\sqrt{MSE [1 + x_f' (X'X)^{-1}x_f ]}} \Big\vert Y \sim t(n-r) . \]
This prior gives posterior means and posterior probability intervals for linear functions of the regression coefficients and for new observations that agree numerically with the least squares estimates and frequentist confidence interval.

Posterior Analysis

Probability intervals for the variance or precision are numerically equivalent to the frequentist intervals.
As far as standard point or interval estimation is concerned, Bayesians have little problem with frequentist normal theory regression results.
However, Bayesian testing results are typically quite different from Neyman-Pearson testing results.
Inference for nonlinear functions of regression coefficients, e.g., $c_1′ \beta / c_2′ \beta$, is more difficult for frequentists and is typically based on large sample approximations. Bayesian estimation is straightforward if it is based on a Monte Carlo sample from the posterior distribution.
Simulation from this posterior can be accomplished using the method of composition, i.e. generate a value $\tau_k$ from its marginal Gamma posterior and then sampling $\beta_k$ from the conditional normal posterior.

Proper Reference Prior

Commonly used reference prior that is not improper is an approximation to the SIR prior, namely \[ \beta_j \stackrel{iid}{\sim} N(0,b) \perp \tau \sim \text{Gamma}(c,c), \] where $b$ is large and $c$ is small.
A standard choice for $b$ has been $10^6$ with $c = 10^{−3}$.
Prior approximates the SIR since the kernel of a $\text{Gamma}(c,c)$ is $\tau^{c-1}c^{-\tau c}$, so for small $c$ we have $p(\tau) \approx C/\tau$ where $C$ is a constant. Similarly, $p(\beta)$ is approximately uniform when $b$ is large.
Actual values of $b$ and $c$ in the normal and gamma distributions may be varied depending on the scale of the measurements being considered.
This prior can be used in the primary data analysis or in a sensitivity analysis when an informative prior is available.
Posterior analysis must be obtained by simulation. In particular, with the MCMC techniques discussed later.

Conjugate Priors

Do not actually recommend conjugate priors for linear regression because they are hard to elicit.
Nonetheless, the mathematical results are pretty, and they have historical and practical significance.
Practical significance arises in Bayesian nonparametric analysis involving Dirichlet process mixture models (Chapter 15 of our text).

Conjugate Priors

With sampling distribution \[ Y | \beta, \tau \sim N_n (X\beta, (1/\tau)I_n) \] the conjugate prior is \[ \beta|\tau \sim N_r(\beta_0, (1/\tau)C_0), \quad \tau \sim \text{Gamma}(a,b). \]
Equivalently we can specify the prior \[ \tilde{X} \beta | \tau \sim N_r (\tilde{Y}, (1/\tau)D(\tilde{w})), \] with known $r \times r$ matrix $\tilde{X}$, $r$ vector $\tilde{Y}$, and diagonal matrix $D(\tilde{w})$.
The two forms for the prior are actually equivalent with \[ \beta_0 = \tilde{X}^{-1} \tilde{Y}, \quad C_0 = \{ \tilde{X}' [D(\tilde{w})]^{-1} \tilde{X} \}^{-1} \] and a somewhat more complicated and non-unique procedure for determining $\tilde{X}$, $\tilde{Y}$, and $D(\tilde{w})$ from $\beta_0$ and $C_0$.

Conjugate Priors

With the second form of prior for $\beta$, one can establish \[ \beta | \tau,Y \sim N_r \left( \hat{\beta}, \tau(X'X + \tilde{X}'D^{-1}(\tilde{w}) \tilde{X})^{-1} \right), \] where \[ \hat{\beta} = (X'X + \tilde{X}'D^{-1}(\tilde{w}) \tilde{X})^{-1} [X'Y + \tilde{X}'D^{-1}(\tilde{w}) \tilde{Y}]. \]

Conjugate Priors

One can also show that, marginally, \[ \tau | Y \sim \text{Gamma}(Bdfe/2, (Bdfe)BMSE/2) \] where $Bdfe = n+2a$ and \[ BMSE = \frac{(Y-X \hat{\beta})'(Y-X \hat{\beta}) + (\tilde{Y}-\tilde{X} \hat{\beta})' D^{-1}(\tilde{w}) (\tilde{Y}-\tilde{X} \hat{\beta}) + 2b}{Bdfe} \]
One can regard $2a$ as the number of prior observations for $\tau$ because of its role relative to $n$ in $Bdfe$.

Conjugate Priors

Given these results, the marginal posterior is \[ \frac{c'\beta - c' \hat{\beta}}{\sqrt{BMSE c'[X'X + \tilde{X}'D^{-1}(\tilde{w}) \tilde{X}]^{-1} c}} \Big\vert Y \sim t(Bdfe). \]
For predicting a new observation $y_f$ with covariate vector $x_f$, we get the predictive distribution \[ \frac{y_f - x_f' \hat{\beta}}{\sqrt{BMSE \left( 1 + x_f'[X'X + \tilde{X}'D^{-1}(\tilde{w}) \tilde{X}]^{-1} x_f \right)}} \Big\vert Y \sim t(Bdfe). \]
Much of the analysis can be obtained by making minor changes to the output from a weighted least squares regression program.

Bank Salaries Example

Specified five covariate vectors to generate a conjugate prior. Combined they are \[ \tilde{X} = \begin{bmatrix} \tilde{x}_1' \\ \tilde{x}_2' \\ \tilde{x}_3' \\ \tilde{x}_4' \\ \tilde{x}_5' \end{bmatrix} = \begin{bmatrix} 1 & 0 & 12 & 0 & 0 \\ 1 & 1 & 12 & 0 & 0 \\ 1 & 0 & 15 & 0 & 0 \\ 1 & 0 & 15 & 100 & 0 \\ 1 & 0 & 12 & 0 & 24 \\ \end{bmatrix} . \]
The vector $\tilde{x}_1' = (1,0,12,0,0)$ corresponds to a female with 12 years of education, no previous experience, and starting work on January 1, 1969.
From this point, $\tilde{x}_2$ changes the sex, $\tilde{x}_3$ changes the education, $\tilde{x}_4$ changes the experience, and $\tilde{x}_5$ changes the time hired.

Bank Salaries Example

Thinking about the mean salary for each set of covariates, we chose best guesses for the means of $\tilde{Y}' = (4000, 4500, 5000, 5500, 6000)$, which reflects a prior belief that starting salaries are higher for equally qualified men than women; a belief that salary is increasing as a function of education, experience, and time; and that three years of education at the beginning of the study has equivalent worth to having been employed at a two-year later date.
Weights $\tilde{w}_i$ are all chosen to be 0.4, so that the prior carries the same weight as 2 = 5(0.4) sampled observations.
In other words, \[ \begin{bmatrix} 1 & 0 & 12 & 0 & 0 \\ 1 & 1 & 12 & 0 & 0 \\ 1 & 0 & 15 & 0 & 0 \\ 1 & 0 & 15 & 100 & 0 \\ 1 & 0 & 12 & 0 & 24 \\ \end{bmatrix} \beta = N \left( \begin{bmatrix} 4000 \\ 4500 \\ 5000 \\ 5500 \\ 6000 \end{bmatrix}, \frac{1}{\tau} \begin{bmatrix} 0.4 & 0 & 0 & 0 & 0 \\ 0 & 0.4 & 0 & 0 & 0 \\ 0 & 0 & 0.4 & 0 & 0 \\ 0 & 0 & 0 & 0.4 & 0 \\ 0 & 0 & 0 & 0 & 0.4 \\ \end{bmatrix} \right) . \]

Bank Salaries Example

Combine this prior for $\beta$ given $\tau$ with an improper prior \[ p(\tau) = 1/\tau, \] posterior results are as if $a = b = 0$ in the Gamma prior.
Can obtain posterior results using a weighted least squares regression analysis on the 93 observations in the original data taken together with the 5 weighted prior observations as in model using weights from the diagonal elements of the covariance matrix.

Bank Salaries Example

Parameters	Mean	Scale	2.5%	97.5%
Constant	3470.4	320.1	2834.7	4106.0
Sex	708.5	114.4	481.4	935.7
Educ	93.35	24.08	45.54	141.17
Exp	1.35	0.57	0.21	2.49
Time	23.84	5.00	13.90	33.77
$x_f′ \beta$	6532.7	142.9	6248.9	6816.5
$y_f$	6532.7	519.3	5501.4	7564.0
$\tau$	4.01e-6	0.588e-6	2.94e-6	5.24e-6

Bank Salaries Example

For everything except $\tau$, the tabulated means and scales are exactly as in weighted least squares output.
Report posterior scale parameters for the $t(93)$ distributions rather than the posterior standard deviations (now $t(0.975,93) = 1.9858$).
Inferences are similar to those with the SIR prior. Slight changes in estimated regression coefficients.
Posterior scale factors are uniformly slightly smaller here. Implication is that the data and the prior information are not obviously inconsistent with one another and that the prior information actually reduces posterior uncertainty, although only slightly.

Next Time

Independence Priors
Gibbs Sampling
FEV Data Example