---
title: "STAT 203: Lecture 18"
author: "James M. Flegal"
output:
  ioslides_presentation:
    smaller: yes
---

## Agenda

- FEV Data Example
- Independence Priors
- Gibbs Sampling
- Group Exercise

## Independence Priors

- **Independence priors** used most often in practice, that is we assume prior independence of $\beta$ and $\tau$ by letting $p(\beta, \tau) = p(\beta)p(\tau)$.
- The method of Bedrick, Christensen, and Johnson (1997), or BCJ method, of constructing an informative prior for β analogous to the method used for logistic regression.
- Goal is not to find the perfect prior to characterize someone's beliefs but to find a sensible prior that incorporates some basic scientific or experiential knowledge. 
- Frame our discussion of prior elicitation around the FEV study from Rosner (2006) on pulmonary function (lung capacity) in adolescents. 

## FEV Data Example

- Response $y$ is forced expiratory volume (FEV), which measures the volume of air in liters expelled in 1 second of a forceful breath. 
- Lung function is expected to increase during adolescence, but smoking may slow its progression. 
- The association between smoking and lung capacity in adolescents is investigated using data from 345 adolescents between the ages of 10 and 19. 
- The predictor variables include a constant, age in years ($x_2$), a indicator variable for smoking status ($x_3$), and the interaction term $x_4 = x_2 x_3$. A predictor vector is $x' = (1,x_2,x_3,x_4)$.

## FEV Data Example

```{r}
FEVdata <- read.table("http://www.ics.uci.edu/~wjohnson/BIDA/Ch7/FEVdataAge10to19.txt"
                      ,header=T, sep="\t")
attach(FEVdata)
options(contrasts=c("contr.treatment","contr.poly"))
head(FEVdata)
```

## Interaction Plot with LOWESS Curve

```{r, echo=FALSE}
## Interaction plot.  LOWESS plot for the FEV data  
plot(Age,FEV,xlab="Age", ylab="FEV")
lines(lowess(Age[Smoke==1],FEV[Smoke==1],f=.95),lwd=2,cex=1.2)
lines(lowess(Age[Smoke==0],FEV[Smoke==0],f=.95),lty=2,lwd=2,cex=1.2)
legend("bottomright", c("Smoker", "Nonsmoker"), lty=c(1,2))
```

## Bayesian Posterior Means Plot

```{r, echo=FALSE}
BayesSmoker <- c(3.058,3.124,3.191,3.258,3.324,3.391,3.458,3.525,3.591,3.658)
BayesNonsmoker <- c(2.772,2.982,3.192,3.402,3.612,3.821,4.031,4.241,4.451,4.661)
Age1 <- seq(10,19,1)
plot(Age,FEV,xlab="Age", ylab="FEV", lwd=1.5)
lines(Age1,BayesSmoker,type='l', lty=1, lwd=2,cex=1.2)
lines(Age1,BayesNonsmoker,type='l', lty=2, lwd=2,cex=1.2)
legend("bottomright", c("Smoker", "Nonsmoker"), lty=c(1,2), lwd=2)
```

## Sampling Model

- With sampling model
\[
Y|\beta ,\tau \sim N_n (X\beta, \tau^{-1} I_n),
\]
our standard informative prior has independent components
\[
\beta \sim N_r ( \beta_0, C_0) \perp \tau \sim \text{Gamma}(a,b).
\]
- Our task is to determine values $a$, $b$, $\beta_0$, and $C_0$ that accurately reflect available prior information. 

## Prior on $\beta$

- Similar to before, for $\beta$ we specify an equivalent prior
\[
\tilde{X} \beta \sim N_r (\tilde{Y}, D(\tilde{w})),
\]
with known $r \times r$ matrix $\tilde{X}$, $r$ vector $\tilde{Y}$, and diagonal matrix $D(\tilde{w})$.
- We then induce a prior on $\beta$.
- As with the conjugagte prior, multiply $\tilde{X} \beta$ on the left by $\tilde{X}^{-1}$.  This gives us
\[
\beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime} ).
\]
- This BCJ method of constructing an informative prior above is analogous to the method used for logistic regression and similar to that used in the conjugate prior example.

## Prior on $\beta$

- Motivation is that individual regression coefficients $\beta_j$ are difficult to think about, since they are only indirectly related to anything observable. 
- Instead of specifying a prior directly on $\beta$, we specify a prior on values that are more closely related to observables and induce the prior for $\beta$.
- Relatively easy to elicit a prior for the mean value of something that is potentially observable. 
- We elicit a prior for the mean of the observations corresponding to a particular set of predictor variables.

## Prior on $\beta$

- Technically, we elicit priors on conditional means
\[
\tilde{m}_i = E[y|\tilde{x}_i] = \tilde{x}_i' \beta
\]
for $r$ subpopulations defined by different predictor vectors $\tilde{x}_i, i = 1,\dots,r$.
- Resulting prior has the form 
\[
\tilde{X} \beta \sim N_r (\tilde{Y}, D(\tilde{w})),
\]
and the induced prior on $\beta$ has the form
\[
\beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime}).
\]

## FEV Model Priors

- With four regression parameters in the FEV model, we pick four covariate combinations $\tilde{x}_i, i = 1,2,3,4$, corresponding to circumstances that the expert can assess independently. In particular, we use prior experience to specify information about the subpopulation of 11-year-old nonsmokers 13-year-old smokers, 16-year-old nonsmokers, and 18-year-old smokers.
- In matrix form we have
\[
\tilde{X} = \begin{bmatrix} \tilde{x}_1' \\ \tilde{x}_2' \\ \tilde{x}_3' \\ \tilde{x}_4' \end{bmatrix} = \begin{bmatrix} 
                                1 & 11 & 0 & 0 \\
                                1 & 13 & 1 & 13 \\
                                1 & 16 & 0 & 0 \\
                                1 & 18 & 1 & 18 \\
\end{bmatrix} .
\]
- FEV values display variability within subpopulations. We are interested in the mean FEV value $\tilde{m}_i$ for the four circumstances in $\tilde{X}$ and ultimately in placing normal priors on the mean FEVs for the four types of adolescents defined by these covariate combinations. 

## FEV Model Priors

- As an example, suppose that our medical collaborator expects the average FEV among all 18-year-old smokers in the sampled population to be 3.3, and is 99% sure that the mean FEV is less than 4.0 in this group. We take $\tilde{y}_4 = 3.3$ to be the mean of $\tilde{m}_4$. The prior variance reflects our uncertainty about where the mean lies, and if a normal prior is assumed then the 99th percentile of $\tilde{m}_4$ satisfies
\[
4.0 = 3.3 + 2.33 \sqrt{\tilde{w}_4},
\]
so
\[
\tilde{m}_4 \sim N(3.3, 0.09).
\]
- Similar steps yeild
\[
\tilde{m}_1 \sim N(2.8, 0.04),\\
\tilde{m}_2 \sim N(3.0, 0.04),\\
\tilde{m}_3 \sim N(4.0, 0.04).
\]

## FEV Model Priors

- Important to choose the $\tilde{x}_i$s so that the $\tilde{m}_i$s can be regarded independently. 
- Then we have a prior of the form,
\[
\tilde{X} \beta = \tilde{m} = \begin{bmatrix} \tilde{m}_1 \\ \tilde{m}_2 \\ \tilde{m}_3 \\ \tilde{m}_3 \end{bmatrix} \sim N \left( \begin{bmatrix} 2.8 \\ 3.0 \\ 4.0 \\ 3.3 \end{bmatrix},
\begin{bmatrix} 0.04 & 0 & 0 & 0 \\
                0 & 0.04 & 0 & 0 \\
                0 & 0 & 0.04 & 0 \\
                0 & 0 & 0 & 0.09
\end{bmatrix} 
\right) .
\]
- We need to obtain $p(\beta)$, the prior density on $\beta$, from the density of $\tilde{X} \beta$.

## FEV Model Priors

- Naturally, $p(\beta)$ has a multivariate normal distribution with appropriately transformed mean vector and covariance matrix. In particular,
\[
\beta = \tilde{X}^{-1} \tilde{m} \sim N \left( \tilde{X}^{-1} \begin{bmatrix} 2.8 \\ 3.0 \\ 4.0 \\ 3.3 \end{bmatrix},
\tilde{X}^{-1} \begin{bmatrix} 0.04 & 0 & 0 & 0 \\
                0 & 0.04 & 0 & 0 \\
                0 & 0 & 0.04 & 0 \\
                0 & 0 & 0 & 0.09
\end{bmatrix} \tilde{X}^{-1 \prime}
\right) 
\]
so 
\[
\beta \sim N \left( \begin{bmatrix} 0.16 \\ 0.24 \\ 2.06 \\ -0.18 \end{bmatrix},
\begin{bmatrix} 0.603 & -0.043 & -0.603 & 0.043 \\
                -0.043 & 0.003 & 0.043 & -0.003 \\
                -0.603 & 0.043 & 1.73 & -0.119 \\
                0.043 & -0.003 & -0.119 & 0.008
\end{bmatrix}
\right) 
\]

## Independence Priors

- Conjugate priors we also used an $\tilde{X}$ to elicit prior means $\tilde{Y}$ but there we treated the $\tilde{w}$s rather cavalierly. 
- The independent priors approach makes it easier to elicit realistic information about variability. 
- With a conjugate prior, information about variability related to regression coefficients must be conditional on $\tau$. For example, when asking the expert for, say, a 99th percentile that reflects uncertainty about the mean FEV for 18-year-old smokers, that value must be conditional on the variance $\sigma^2$.

## Prior on $\tau$

- Now construct an informative prior on $\tau$ or $\sigma$ using the FEV example. 
- To choose $a$ and $b$ for a Gammma$(a,b)$ prior on $\tau$ we again consider 18-year-old smokers.
- Elicit the largest observation the expert would expect to see from this group, given our best guess for $\tilde{m}_4$.
- This provides a best guess for the 95th percentile of FEVs among 18-year-old smokers. 
- Under normality, the 95th percentile is $\tilde{m}_4 + 1.645 \sigma$. But the best guess is conditional on $\tilde{m}_4 = \tilde{y}_4$, so with the expert's best guess for the 95th percentile being 5 and with $\tilde{y}_4 = 3.3$ , our best guess for $\sigma$, say $\sigma_0$, satisfies $5 = 3.3+1.645 \sigma_0$.
- Hence $\sigma_0 = 1.7/1.645$ or $\tau_0 = (1.645/1.7)^2 = 0.94$ for $\tau$. Take this value to be the mode of the gamma prior for $\tau$, that is
\[
0.94 = \frac{a-1}{b} \text{ or } a(b) = 0.94b+1.
\]

## Prior on $\tau$

- To complete our specification of the gamma prior, we elicit an uncertainty about the 95th percentile. 
- In addition to a best guess for the 95th percentile, we give an operative lower bound to it by specifying that we are 95% certain that the 95th percentile is greater than 4. It follows that
\[
\begin{aligned}
0.95 & = P(\mu + 1.645 \sigma > 4 | \mu = 3.3) \\
     & = P(3.3 + 1.645 \sigma > 4 | \mu = 3.3) \\
     & = P(1.645 \sigma > 0.7) = P(\sigma > 0.7/1.645 ) \\
     & = P(\tau < (1.645/0.7)^2 ) = P(\tau < 5.52 ).
\end{aligned}
\]
- Take a grid of $b$ values and find the Gamma$(a(b),b)$ distribution that has 95th percentile 5.52.
- Numerically via `R`, we can see that 
\[
\tau \stackrel{approx}{\sim} \text{Gamma}(1.73, 0.78).
\]

## Partial Prior Information

- In the absence of substantive prior knowledge, the SIR prior will likely suffice. 
- However, if substantive prior information is available, we would be loathe to ignore it. 
- Sometimes the available information is insufficient to specify a complete BCJ prior on $r$ regression coefficients, especially when $r$ is moderate or large. Eliciting real prior information can be time consuming.
- Possible to place a partial prior on regression coefficients when we are only able to specify information for $r_1 < r$ mean values corresponding to $r_1$ predictor vectors $\tilde{x}_i$.
- The simplest case is $r_1 = 1$ and prior information is available for one **typical** individual in the population. 
- The remaining parameters receive improper flat priors or independent normal priors with mean 0 and large variances.

## Gibbs Sampling

- For normal-gamma independence priors, the posterior distribution $p(\beta, \tau | Y)$ is analytically intractable, but the full conditionals needed for Gibbs sampling are standard distributions. 
- Recall we have for linear regression the likelihood
\[
L(\beta, \tau) \propto  \tau^{n/2} \exp \left\{ - \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\}
\]
and the independence prior specified by
\[
\beta \sim N_r ( \beta_0, C_0) \perp \tau \sim \text{Gamma}(a,b),
\]
or equivalently
\[
\beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime} ).
\]

## Gibbs Sampling

- Thus we have
\[
\begin{aligned}
p(\beta, \tau | Y) & = L(\beta, \tau) p(\tau) p(\beta) \\
    &  \propto  \tau^{n/2} \exp \left\{ - \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\} \\
    & \quad \times \tau^{a-1} e^{-b\tau} \exp \left\{ - \frac{1}{2} (\tilde{X} \beta - \tilde{Y}) D(\tilde{w})^{-1} (\tilde{X} \beta - \tilde{Y})' \right\}.
\end{aligned}
\]
- Dropping multiplicative terms that do not involve $\tau$, the full conditional for $\tau$ has density
\[
p(\tau | \beta,Y) \propto \tau^{\frac{n+2a}{2}-1} \exp \left\{ - \frac{\tau}{2} [2b + (Y - X\beta)'(Y - X\beta)] \right\},
\]
or 
\[
\tau | \beta, Y \sim \text{Gamma}\left( \frac{n+2a}{2}, \frac{1}{2} [2b + (Y - X\beta)'(Y - X\beta)] \right).
\]

## Gibbs Sampling

- Because Gibbs sampling involves sampling repeatedly from this distribution for successive $\beta$ iterates, it may be more computationally efficient to write
\[
(Y - X\beta)'(Y - X\beta) = (Y - X\hat{\beta})'(Y - X\hat{\beta}) + (\hat{\beta} - \beta)' (X'X) (\hat{\beta} - \beta)
\]
with $\hat{\beta}$ the least squares estimate. 
- The first term on the right-hand side is a constant in $\beta$ so the repeated computations involve only the second term, which consists of $r$ dimensional matrix products rather than $n$ dimensional products.

## Gibbs Sampling

- To obtain the full conditional for $\beta$, rewrite the joint density as
\[
\begin{aligned}
p&(\beta, \tau | Y)  \propto \tau^{n/2 + a-1} e^{-b\tau} \times \\
 & \exp \left\{ - \frac{1}{2} (X\beta - Y)' (\tau^{-1}I_n)^{-1} (X\beta - Y) - \frac{1}{2} (\tilde{X}\beta - \tilde{Y})' D(\tilde{w})^{-1} (\tilde{X}\beta - \tilde{Y})\right\}.
\end{aligned}
\]
- Dropping multiplicative terms that do not involve $\beta$ gives
\[
p(\beta | \tau , Y)  \propto 
 \exp \left\{ - \frac{1}{2} \begin{bmatrix} X\beta - Y \\ \tilde{X}\beta - \tilde{Y} \end{bmatrix}' 
      \begin{bmatrix} \tau^{-1}I_n & 0 \\ 0 & D(\tilde{w}) \end{bmatrix}^{-1} 
      \begin{bmatrix} X\beta - Y \\ \tilde{X}\beta - \tilde{Y} \end{bmatrix} \right\}.
\]

## Gibbs Sampling

- Define 
\[
\tilde{\beta} = \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right]^{-1} \left( \tau X'Y + \tilde{X}' D(\tilde{w})^{-1} \tilde{Y} \right).
\]
- Using some matrix algebra and dropping terms not involving $\beta$, posterior density becomes
\[
p(\beta | \tau , Y)  \propto \exp \left\{ - \frac{1}{2} (\beta - \tilde{\beta}) ' \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right] (\beta - \tilde{\beta}) \right\} ,
\]
which implies
\[
\beta | \tau , Y \sim N_r \left( \tilde{\beta}, \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right]^{-1} \right).
\]

## Exercise

1. Analyze the FEV data with the BCJ prior presented in these slides using `R` with a Gibbs sampler.
2. Describe, with words and appropriate plots, the mixing behavior of the Gibbs sampler.
3. Determine a suitable number of Markov chain iterations to use in you analysis.  Describe and justify your selection.
4. Create a table with posterior means, standard deviations, medians, and a 90% credible interval for $\beta_0, \beta_1, \beta_2, \beta_3, \tau$.  Include MCSEs for each of your estimates.
5. Compare estimates with the BCJ prior to those obtained using the SIR prior.