- Linear Regression
- Bank Salaries Example
- Likelihood
- Reference and Conjugate Priors

- Linear Regression
- Bank Salaries Example
- Likelihood
- Reference and Conjugate Priors

- Linear regression is about predicting a continuous response variable \(y\) from one or more predictor variables \(x\).
- The linear regression model specifies the mean of the response variable as a linear combination of the predictor variables, i.e. \(E[y | x] = x′\beta\).
- Although \(x\) is often random, we always condition on it, \(r\) hence we treat it as fixed.
- Since covariate combinations can consist of continuous variables or categorical variables, standard ANOVA and ACOVA models are special cases of these linear models.
- Theoretical results rely heavily on multivariate normality.

- Schafer (1987) reported data from 1977 on 93 bank employees.
- Response variable is beginning salary in dollars.
- Along with the intercept, there are four covariates:
- \(x_2\): sex, an indicator variable for
`male`

. - \(x_3\): years of education.
- \(x_4\): months of experience.
- \(x_5\): time at hiring as measured in months after January 1, 1969.

- \(x_2\): sex, an indicator variable for

Bankdata <- read.table(file="http://www.ics.uci.edu/~wjohnson/BIDA/Ch9/BankSalaryData.txt", header=T, sep="") attach(Bankdata) options(contrasts=c("contr.treatment","contr.poly")) head(Bankdata)

## Sex BegSal Educ Exper Time ## 1 1 3900 12 0.0 1 ## 2 2 4620 12 11.5 22 ## 3 1 4020 10 44.0 7 ## 4 2 5040 15 14.0 3 ## 5 1 4290 12 5.0 30 ## 6 2 5100 12 180.0 15

- Typical linear model specifies that for known predictor variables \(x_i′ = (1, x_{i2}, \dots, x_{ir})\) and unknown regression coefficients \(\beta = (\beta_1, \beta_2, \dots, \beta_r)′\), \[ \begin{aligned} y_i & = \beta_1 + \beta_2 x_{i2} + \dots + \beta_r x_{ir} + \epsilon_i \\ & = x_i'\beta + \epsilon_i \\ \epsilon_i | \beta , \tau & \stackrel{ind}{\sim} N(0, 1 / \tau), \end{aligned} \] for \(i = 1, \dots, n\).
- Equivalently, \[ y_i | \beta , \tau \stackrel{ind}{\sim} N(x_i'\beta, 1 / \tau). \]

- Model can be written completely in matrix form as \[ \begin{aligned} \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} & = \begin{bmatrix} x_{11} & x_{12} & x_{13} & \dots & x_{1r} \\ x_{21} & x_{22} & x_{23} & \dots & x_{2r} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & x_{n3} & \dots & x_{nr} \end{bmatrix} \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{r} \end{bmatrix} & + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{n} \end{bmatrix} \\ \\ Y_{n \times 1} & = \quad \quad \quad \quad X_{n \times r} \quad \quad \quad \quad \quad \beta_{r \times 1} & + \epsilon_{n \times 1}. \end{aligned} \]
- Common for the first column of \(X\) to consist of 1s in order to accommodate an intercept parameter but there is no requirement for it.

- Conditions on the \(\epsilon\)s imply that \(E[\epsilon]=0\) and \(\text{Cov} [\epsilon]= \tau^{-1} I_n\), where \(I_n\) is the \(n \times n\) identity matrix.
- Moreover \[ \epsilon | \tau \sim N_n(0, \tau^{-1} I_n). \]
- Then in matrix notation the multiple linear regression model is \[ Y = X \beta + \epsilon \quad \text{ and } \quad \epsilon | \tau \sim N_n(0, \tau^{-1} I_n). \]
- What makes this specifically a regression model (rather than an arbitrary linear model) is that \(X\) has full column rank, i.e., \(\text{rank}(X) = r\).
- Unless \(n \ge r\), the problem degenerates because there are more parameters than observations and \(\text{rank}(X) < r\).
- Further, it is easy to show \[ Y | \beta, \tau \sim N_n(X \beta, \tau^{-1} I_n). \]

- The likelihood as a function of \(\beta\) and \(\tau\) is given by \[ \begin{aligned} L(\beta, \tau) & \propto \prod_{i=1}^{n} \tau^{1/2} \exp \left\{ \frac{\tau}{2} (y_i - x_i'\beta)^2 \right\} \\ & = \tau^{n/2} \exp \left\{ \frac{\tau}{2} \sum_{i=1}^{n} (y_i - x_i'\beta)^2 \right\} \\ & = \tau^{n/2} \exp \left\{ \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\} . \end{aligned} \]
- One can show the maximum likelihood estimate (MLE) of \(\beta\) is \[
\hat{\beta} = (X'X)^{-1} X'Y
\] and that the MLE of \(\tau\) is \(\hat{\tau} = n / SSE\) where \[
SSE = (Y - X\hat{\beta})'(Y - X\hat{\beta})
\] is the
**sum of squares for the error**.

- Standard unbiased (frequentist) estimator of \(\sigma^2 = 1/\tau\) is the mean squared error (MSE) defined by \[
MSE = SSE / dfe \text{ where } dfe = (n-r)
\] is the
**degrees of freedom**. - Simple linear regression predicts a continuous (measurement) response using only an intercept and one predictor variable, so for simple linear regression the likelihood is a function of the 3 parameters \(\beta_1\), \(\beta_2\), and \(\tau\).
- Multiple linear regression simply has more than one predictor variable in which case the likelihood is a function of the \(r+1\) parameters \(\theta = (\beta_1, \dots, \beta_r, \tau)'\).

Need to specify a prior for \(\theta' = ( \beta' , \tau)\).

- Standard improper reference (SIR) prior.
- Diffuse proper reference prior.
- Conjugate normal-gamma prior.
- Informative prior with \(\beta\) and \(\tau\) independent.
- Partially informative prior.

- The standard improper reference prior (SIR) for linear regression analysis is \[ p(\beta, \tau) \propto 1 / \tau \]
- Least squares estimates are also the estimates that correspond to this prior, i.e. the frequentist regression estimates given by virtually all computer packages.
- However, there is clearly a difference in interpretation.

- Posterior results using a SIR prior for a regression analysis of the salary data of bank employees in 1977 are given below.
- In addition to inferences for \(\beta\) and \(\tau\), we include predictive inference for a new observation \(y_f\) with covariate vector corresponding to a male with 16 years of school, 54.5 months of experience, and hired 33 months after the start date for the study, i.e., \(x_f′ = (1, 1, 16, 54.5, 33.0)\) with corresponding mean \(x_f′ \beta\).

Parameters | Mean | Scale | 2.5% | 97.5% |
---|---|---|---|---|

Constant | 3526.4 | 327.2 | 2875.2 | 4177.6 |

Sex | 722.3 | 117.8 | 488.2 | 956.4 |

Educ | 90.02 | 24.69 | 40.95 | 139.09 |

Exp | 1.27 | 0.59 | 0.10 | 2.43 |

Time | 23.43 | 5.20 | 13.09 | 33.76 |

\(x_f′ \beta\) | 6531.2 | 145.7 | 6241.6 | 6820.8 |

\(y_f\) | 6531.2 | 527.9 | 5482.0 | 7580.4 |

\(\tau\) | 3.88e-6 | 0.585e-6 | 2.82e-6 | 5.11e-6 |

- All the parameters, except \(\tau\), have marginal posterior \(t(88)\) distributions and the posterior means and scales are exactly as in least squares output.
- For parameters with \(t\) distributions, report the scale parameter.
- Standard deviation is \(\sqrt{dfe / (dfe - 2)}\) times the scale parameter.
- Percentiles are obtained as Mean \(\pm 1.987 (Scale)\) where \(1.987 = t(0.975,88)\).
- \(R^2 = 0.51 = 51\%\) meaning 51% of the variability in the salaries is explained by this regression.

- A superficial interpretation of the output suggests that we are 95% sure a male makes, on average, between $488 and $956 more than his female counterpart who has the same education, experience, and month of hiring.
- We are also 95% sure that each additional year of education corresponds to a higher starting salary, on average, of between $41 to $139 with other variables fixed, and so on.
- From the 95% prediction interval, the future individual will make a salary between $5,482 and $7,580.
- For inference about \(\sigma\), we start with the 95% PI for \(\tau\) and convert it into a 95% PI for \(\sigma\) by taking inverse square roots.

- Recall \[ \hat{\beta} = (X'X)^{-1} X'Y . \]
- Then one can rewrite the likelihood as \[ \begin{aligned} L(\beta, \tau) & \propto \tau^{n/2} \exp \left\{ \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\} \\ & \propto \tau^{n/2} \exp \left\{ \frac{\tau}{2} \left[ (X \hat{\beta} - X\beta)'(X \hat{\beta} - X\beta) + (Y - X\hat{\beta})'(Y - X\hat{\beta}) \right] \right\} \\ & = \tau^{r/2} \exp \left\{ \frac{\tau}{2} (\hat{\beta} - \beta)' (X'X) (\hat{\beta} - \beta) \right\} \\ & \quad \times \tau^{(n-r)/2} \exp \left\{ \frac{\tau}{2} (Y - X\hat{\beta})'(Y - X\hat{\beta}) \right\}. \end{aligned} \]

- Using the SIR prior \(p(\beta, \tau) \propto 1/\tau\), the posterior is of the form \[ \begin{aligned} L(\beta, \tau) & \propto \tau^{r/2} \exp \left\{ \frac{\tau}{2} (\hat{\beta} - \beta)' (X'X) (\hat{\beta} - \beta) \right\} \\ & \quad \times \tau^{(n-r)/2 - 1} \exp \left\{ \frac{\tau}{2} (Y - X\hat{\beta})'(Y - X\hat{\beta}) \right\}. \end{aligned} \]
- First term is proportional to a multivariate normal density for \(\beta\) conditional on \(\tau\) and the second term is proportional to a Gamma density for \(\tau\).
- In other words, \[ p(\beta, \tau | Y) = p(\beta | \tau, Y) p(\tau | Y). \]

- The full conditionals are known, i.e. \[ \begin{aligned} \beta | \tau, Y & \sim N_r \left( \hat{\beta} , (\tau X'X)^{-1} \right) = N_r \left( \hat{\beta} , \sigma^2 (X'X)^{-1} \right) \\ \tau | Y & \sim \text{Gamma} \left( \frac{n-r}{2} , \frac{(Y - X\hat{\beta})'(Y - X\hat{\beta})}{2} \right). \end{aligned} \]
- By standardization it follows that \[ \frac{c'\beta - c'\hat{\beta}}{\sqrt{\sigma^2 c' (X'X)^{-1}c}} \Big\vert \tau , Y \sim N(0,1) \] and \[ \frac{(Y - X\hat{\beta})'(Y - X\hat{\beta})}{\sigma^2} \Big\vert Y = (SSE) \tau \Big\vert Y \sim \text{Gamma} \left( \frac{n-r}{2} , \frac{1}{2} \right) = \chi^2_{n-r}. \]

- Classical arguments yield \[ \frac{c'\beta - c'\hat{\beta}}{\sqrt{MSE c' (X'X)^{-1}c}} \Big\vert Y \sim t(n-r) . \]
- When predicting a new observation \(y_f\) with covariate vector \(x_f\), we have \[ \frac{y_f - x_f'\hat{\beta}}{\sqrt{MSE [1 + x_f' (X'X)^{-1}x_f ]}} \Big\vert Y \sim t(n-r) . \]
- This prior gives posterior means and posterior probability intervals for linear functions of the regression coefficients and for new observations that agree numerically with the least squares estimates and frequentist confidence interval.

- Probability intervals for the variance or precision are numerically equivalent to the frequentist intervals.
- As far as standard point or interval estimation is concerned, Bayesians have little problem with frequentist normal theory regression results.
- However, Bayesian testing results are typically quite different from Neyman-Pearson testing results.
- Inference for nonlinear functions of regression coefficients, e.g., \(c_1′ \beta / c_2′ \beta\), is more difficult for frequentists and is typically based on large sample approximations. Bayesian estimation is straightforward if it is based on a Monte Carlo sample from the posterior distribution.
- Simulation from this posterior can be accomplished using the method of composition, i.e. generate a value \(\tau_k\) from its marginal Gamma posterior and then sampling \(\beta_k\) from the conditional normal posterior.

- Commonly used reference prior that is not improper is an approximation to the SIR prior, namely \[ \beta_j \stackrel{iid}{\sim} N(0,b) \perp \tau \sim \text{Gamma}(c,c), \] where \(b\) is large and \(c\) is small.
- A standard choice for \(b\) has been \(10^6\) with \(c = 10^{−3}\).
- Prior approximates the SIR since the kernel of a \(\text{Gamma}(c,c)\) is \(\tau^{c-1}c^{-\tau c}\), so for small \(c\) we have \(p(\tau) \approx C/\tau\) where \(C\) is a constant. Similarly, \(p(\beta)\) is approximately uniform when \(b\) is large.
- Actual values of \(b\) and \(c\) in the normal and gamma distributions may be varied depending on the scale of the measurements being considered.
- This prior can be used in the primary data analysis or in a sensitivity analysis when an informative prior is available.
- Posterior analysis must be obtained by simulation. In particular, with the MCMC techniques discussed later.

- Do not actually recommend conjugate priors for linear regression because they are hard to elicit.
- Nonetheless, the mathematical results are pretty, and they have historical and practical significance.
- Practical significance arises in Bayesian nonparametric analysis involving Dirichlet process mixture models (Chapter 15 of our text).

- With sampling distribution \[ Y | \beta, \tau \sim N_n (X\beta, (1/\tau)I_n) \] the conjugate prior is \[ \beta|\tau \sim N_r(\beta_0, (1/\tau)C_0), \quad \tau \sim \text{Gamma}(a,b). \]
- Equivalently we can specify the prior \[ \tilde{X} \beta | \tau \sim N_r (\tilde{Y}, (1/\tau)D(\tilde{w})), \] with known \(r \times r\) matrix \(\tilde{X}\), \(r\) vector \(\tilde{Y}\), and diagonal matrix \(D(\tilde{w})\).
- The two forms for the prior are actually equivalent with \[ \beta_0 = \tilde{X}^{-1} \tilde{Y}, \quad C_0 = \{ \tilde{X}' [D(\tilde{w})]^{-1} \tilde{X} \}^{-1} \] and a somewhat more complicated and non-unique procedure for determining \(\tilde{X}\), \(\tilde{Y}\), and \(D(\tilde{w})\) from \(\beta_0\) and \(C_0\).

- With the second form of prior for \(\beta\), one can establish \[ \beta | \tau,Y \sim N_r \left( \hat{\beta}, \tau(X'X + \tilde{X}'D^{-1}(\tilde{w}) \tilde{X})^{-1} \right), \] where \[ \hat{\beta} = (X'X + \tilde{X}'D^{-1}(\tilde{w}) \tilde{X})^{-1} [X'Y + \tilde{X}'D^{-1}(\tilde{w}) \tilde{Y}]. \]

- One can also show that, marginally, \[ \tau | Y \sim \text{Gamma}(Bdfe/2, (Bdfe)BMSE/2) \] where \(Bdfe = n+2a\) and \[ BMSE = \frac{(Y-X \hat{\beta})'(Y-X \hat{\beta}) + (\tilde{Y}-\tilde{X} \hat{\beta})' D^{-1}(\tilde{w}) (\tilde{Y}-\tilde{X} \hat{\beta}) + 2b}{Bdfe} \]
- One can regard \(2a\) as the number of
*prior*observations for \(\tau\) because of its role relative to \(n\) in \(Bdfe\).

- Given these results, the marginal posterior is \[ \frac{c'\beta - c' \hat{\beta}}{\sqrt{BMSE c'[X'X + \tilde{X}'D^{-1}(\tilde{w}) \tilde{X}]^{-1} c}} \Big\vert Y \sim t(Bdfe). \]
- For predicting a new observation \(y_f\) with covariate vector \(x_f\), we get the predictive distribution \[ \frac{y_f - x_f' \hat{\beta}}{\sqrt{BMSE \left( 1 + x_f'[X'X + \tilde{X}'D^{-1}(\tilde{w}) \tilde{X}]^{-1} x_f \right)}} \Big\vert Y \sim t(Bdfe). \]
- Much of the analysis can be obtained by making minor changes to the output from a weighted least squares regression program.

- Specified five covariate vectors to generate a conjugate prior. Combined they are \[ \tilde{X} = \begin{bmatrix} \tilde{x}_1' \\ \tilde{x}_2' \\ \tilde{x}_3' \\ \tilde{x}_4' \\ \tilde{x}_5' \end{bmatrix} = \begin{bmatrix} 1 & 0 & 12 & 0 & 0 \\ 1 & 1 & 12 & 0 & 0 \\ 1 & 0 & 15 & 0 & 0 \\ 1 & 0 & 15 & 100 & 0 \\ 1 & 0 & 12 & 0 & 24 \\ \end{bmatrix} . \]
- The vector \(\tilde{x}_1' = (1,0,12,0,0)\) corresponds to a female with 12 years of education, no previous experience, and starting work on January 1, 1969.
- From this point, \(\tilde{x}_2\) changes the sex, \(\tilde{x}_3\) changes the education, \(\tilde{x}_4\) changes the experience, and \(\tilde{x}_5\) changes the time hired.

- Thinking about the mean salary for each set of covariates, we chose best guesses for the means of \(\tilde{Y}' = (4000, 4500, 5000, 5500, 6000)\), which reflects a prior belief that starting salaries are higher for equally qualified men than women; a belief that salary is increasing as a function of education, experience, and time; and that three years of education at the beginning of the study has equivalent worth to having been employed at a two-year later date.
- Weights \(\tilde{w}_i\) are all chosen to be 0.4, so that the prior carries the same weight as 2 = 5(0.4) sampled observations.
- In other words, \[ \begin{bmatrix} 1 & 0 & 12 & 0 & 0 \\ 1 & 1 & 12 & 0 & 0 \\ 1 & 0 & 15 & 0 & 0 \\ 1 & 0 & 15 & 100 & 0 \\ 1 & 0 & 12 & 0 & 24 \\ \end{bmatrix} \beta = N \left( \begin{bmatrix} 4000 \\ 4500 \\ 5000 \\ 5500 \\ 6000 \end{bmatrix}, \frac{1}{\tau} \begin{bmatrix} 0.4 & 0 & 0 & 0 & 0 \\ 0 & 0.4 & 0 & 0 & 0 \\ 0 & 0 & 0.4 & 0 & 0 \\ 0 & 0 & 0 & 0.4 & 0 \\ 0 & 0 & 0 & 0 & 0.4 \\ \end{bmatrix} \right) . \]

- Combine this prior for \(\beta\) given \(\tau\) with an improper prior \[ p(\tau) = 1/\tau, \] posterior results are as if \(a = b = 0\) in the Gamma prior.
- Can obtain posterior results using a weighted least squares regression analysis on the 93 observations in the original data taken together with the 5 weighted prior observations as in model using weights from the diagonal elements of the covariance matrix.

Parameters | Mean | Scale | 2.5% | 97.5% |
---|---|---|---|---|

Constant | 3470.4 | 320.1 | 2834.7 | 4106.0 |

Sex | 708.5 | 114.4 | 481.4 | 935.7 |

Educ | 93.35 | 24.08 | 45.54 | 141.17 |

Exp | 1.35 | 0.57 | 0.21 | 2.49 |

Time | 23.84 | 5.00 | 13.90 | 33.77 |

\(x_f′ \beta\) | 6532.7 | 142.9 | 6248.9 | 6816.5 |

\(y_f\) | 6532.7 | 519.3 | 5501.4 | 7564.0 |

\(\tau\) | 4.01e-6 | 0.588e-6 | 2.94e-6 | 5.24e-6 |

- For everything except \(\tau\), the tabulated means and scales are exactly as in weighted least squares output.
- Report posterior scale parameters for the \(t(93)\) distributions rather than the posterior standard deviations (now \(t(0.975,93) = 1.9858\)).
- Inferences are similar to those with the SIR prior. Slight changes in estimated regression coefficients.
- Posterior scale factors are uniformly slightly smaller here. Implication is that the data and the prior information are not obviously inconsistent with one another and that the prior information actually reduces posterior uncertainty, although only slightly.

- Independence Priors
- Gibbs Sampling
- FEV Data Example