STAT 203: Lecture 18

Agenda

FEV Data Example
Independence Priors
Gibbs Sampling
Group Exercise

Independence Priors

Independence priors used most often in practice, that is we assume prior independence of \(\beta\) and \(\tau\) by letting \(p(\beta, \tau) = p(\beta)p(\tau)\).
The method of Bedrick, Christensen, and Johnson (1997), or BCJ method, of constructing an informative prior for β analogous to the method used for logistic regression.
Goal is not to find the perfect prior to characterize someone's beliefs but to find a sensible prior that incorporates some basic scientific or experiential knowledge.
Frame our discussion of prior elicitation around the FEV study from Rosner (2006) on pulmonary function (lung capacity) in adolescents.

FEV Data Example

Response \(y\) is forced expiratory volume (FEV), which measures the volume of air in liters expelled in 1 second of a forceful breath.
Lung function is expected to increase during adolescence, but smoking may slow its progression.
The association between smoking and lung capacity in adolescents is investigated using data from 345 adolescents between the ages of 10 and 19.
The predictor variables include a constant, age in years (\(x_2\)), a indicator variable for smoking status (\(x_3\)), and the interaction term \(x_4 = x_2 x_3\). A predictor vector is \(x' = (1,x_2,x_3,x_4)\).

FEV Data Example

FEVdata <- read.table("http://www.ics.uci.edu/~wjohnson/BIDA/Ch7/FEVdataAge10to19.txt"
                      ,header=T, sep="\t")
attach(FEVdata)
options(contrasts=c("contr.treatment","contr.poly"))
head(FEVdata)

##   Age   FEV Smoke
## 1  19 5.102     0
## 2  19 3.519     1
## 3  19 3.345     1
## 4  18 2.906     0
## 5  18 3.082     0
## 6  18 4.220     0

Interaction Plot with LOWESS Curve

Bayesian Posterior Means Plot

Sampling Model

With sampling model \[ Y|\beta ,\tau \sim N_n (X\beta, \tau^{-1} I_n), \] our standard informative prior has independent components \[ \beta \sim N_r ( \beta_0, C_0) \perp \tau \sim \text{Gamma}(a,b). \]
Our task is to determine values \(a\), \(b\), \(\beta_0\), and \(C_0\) that accurately reflect available prior information.

Prior on \(\beta\)

Similar to before, for \(\beta\) we specify an equivalent prior \[ \tilde{X} \beta \sim N_r (\tilde{Y}, D(\tilde{w})), \] with known \(r \times r\) matrix \(\tilde{X}\), \(r\) vector \(\tilde{Y}\), and diagonal matrix \(D(\tilde{w})\).
We then induce a prior on \(\beta\).
As with the conjugagte prior, multiply \(\tilde{X} \beta\) on the left by \(\tilde{X}^{-1}\). This gives us \[ \beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime} ). \]
This BCJ method of constructing an informative prior above is analogous to the method used for logistic regression and similar to that used in the conjugate prior example.

Prior on \(\beta\)

Motivation is that individual regression coefficients \(\beta_j\) are difficult to think about, since they are only indirectly related to anything observable.
Instead of specifying a prior directly on \(\beta\), we specify a prior on values that are more closely related to observables and induce the prior for \(\beta\).
Relatively easy to elicit a prior for the mean value of something that is potentially observable.
We elicit a prior for the mean of the observations corresponding to a particular set of predictor variables.

Prior on \(\beta\)

Technically, we elicit priors on conditional means \[ \tilde{m}_i = E[y|\tilde{x}_i] = \tilde{x}_i' \beta \] for \(r\) subpopulations defined by different predictor vectors \(\tilde{x}_i, i = 1,\dots,r\).
Resulting prior has the form \[ \tilde{X} \beta \sim N_r (\tilde{Y}, D(\tilde{w})), \] and the induced prior on \(\beta\) has the form \[ \beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime}). \]

FEV Model Priors

With four regression parameters in the FEV model, we pick four covariate combinations \(\tilde{x}_i, i = 1,2,3,4\), corresponding to circumstances that the expert can assess independently. In particular, we use prior experience to specify information about the subpopulation of 11-year-old nonsmokers 13-year-old smokers, 16-year-old nonsmokers, and 18-year-old smokers.
In matrix form we have \[ \tilde{X} = \begin{bmatrix} \tilde{x}_1' \\ \tilde{x}_2' \\ \tilde{x}_3' \\ \tilde{x}_4' \end{bmatrix} = \begin{bmatrix} 1 & 11 & 0 & 0 \\ 1 & 13 & 1 & 13 \\ 1 & 16 & 0 & 0 \\ 1 & 18 & 1 & 18 \\ \end{bmatrix} . \]
FEV values display variability within subpopulations. We are interested in the mean FEV value \(\tilde{m}_i\) for the four circumstances in \(\tilde{X}\) and ultimately in placing normal priors on the mean FEVs for the four types of adolescents defined by these covariate combinations.

FEV Model Priors

As an example, suppose that our medical collaborator expects the average FEV among all 18-year-old smokers in the sampled population to be 3.3, and is 99% sure that the mean FEV is less than 4.0 in this group. We take \(\tilde{y}_4 = 3.3\) to be the mean of \(\tilde{m}_4\). The prior variance reflects our uncertainty about where the mean lies, and if a normal prior is assumed then the 99th percentile of \(\tilde{m}_4\) satisfies \[ 4.0 = 3.3 + 2.33 \sqrt{\tilde{w}_4}, \] so \[ \tilde{m}_4 \sim N(3.3, 0.09). \]
Similar steps yeild \[ \tilde{m}_1 \sim N(2.8, 0.04),\\ \tilde{m}_2 \sim N(3.0, 0.04),\\ \tilde{m}_3 \sim N(4.0, 0.04). \]

FEV Model Priors

Important to choose the \(\tilde{x}_i\)s so that the \(\tilde{m}_i\)s can be regarded independently.
Then we have a prior of the form, \[ \tilde{X} \beta = \tilde{m} = \begin{bmatrix} \tilde{m}_1 \\ \tilde{m}_2 \\ \tilde{m}_3 \\ \tilde{m}_3 \end{bmatrix} \sim N \left( \begin{bmatrix} 2.8 \\ 3.0 \\ 4.0 \\ 3.3 \end{bmatrix}, \begin{bmatrix} 0.04 & 0 & 0 & 0 \\ 0 & 0.04 & 0 & 0 \\ 0 & 0 & 0.04 & 0 \\ 0 & 0 & 0 & 0.09 \end{bmatrix} \right) . \]
We need to obtain \(p(\beta)\), the prior density on \(\beta\), from the density of \(\tilde{X} \beta\).

FEV Model Priors

Naturally, \(p(\beta)\) has a multivariate normal distribution with appropriately transformed mean vector and covariance matrix. In particular, \[ \beta = \tilde{X}^{-1} \tilde{m} \sim N \left( \tilde{X}^{-1} \begin{bmatrix} 2.8 \\ 3.0 \\ 4.0 \\ 3.3 \end{bmatrix}, \tilde{X}^{-1} \begin{bmatrix} 0.04 & 0 & 0 & 0 \\ 0 & 0.04 & 0 & 0 \\ 0 & 0 & 0.04 & 0 \\ 0 & 0 & 0 & 0.09 \end{bmatrix} \tilde{X}^{-1 \prime} \right) \] so \[ \beta \sim N \left( \begin{bmatrix} 0.16 \\ 0.24 \\ 2.06 \\ -0.18 \end{bmatrix}, \begin{bmatrix} 0.603 & -0.043 & -0.603 & 0.043 \\ -0.043 & 0.003 & 0.043 & -0.003 \\ -0.603 & 0.043 & 1.73 & -0.119 \\ 0.043 & -0.003 & -0.119 & 0.008 \end{bmatrix} \right) \]

Independence Priors

Conjugate priors we also used an \(\tilde{X}\) to elicit prior means \(\tilde{Y}\) but there we treated the \(\tilde{w}\)s rather cavalierly.
The independent priors approach makes it easier to elicit realistic information about variability.
With a conjugate prior, information about variability related to regression coefficients must be conditional on \(\tau\). For example, when asking the expert for, say, a 99th percentile that reflects uncertainty about the mean FEV for 18-year-old smokers, that value must be conditional on the variance \(\sigma^2\).

Prior on \(\tau\)

Now construct an informative prior on \(\tau\) or \(\sigma\) using the FEV example.
To choose \(a\) and \(b\) for a Gammma\((a,b)\) prior on \(\tau\) we again consider 18-year-old smokers.
Elicit the largest observation the expert would expect to see from this group, given our best guess for \(\tilde{m}_4\).
This provides a best guess for the 95th percentile of FEVs among 18-year-old smokers.
Under normality, the 95th percentile is \(\tilde{m}_4 + 1.645 \sigma\). But the best guess is conditional on \(\tilde{m}_4 = \tilde{y}_4\), so with the expert's best guess for the 95th percentile being 5 and with \(\tilde{y}_4 = 3.3\) , our best guess for \(\sigma\), say \(\sigma_0\), satisfies \(5 = 3.3+1.645 \sigma_0\).
Hence \(\sigma_0 = 1.7/1.645\) or \(\tau_0 = (1.645/1.7)^2 = 0.94\) for \(\tau\). Take this value to be the mode of the gamma prior for \(\tau\), that is \[ 0.94 = \frac{a-1}{b} \text{ or } a(b) = 0.94b+1. \]

Prior on \(\tau\)

To complete our specification of the gamma prior, we elicit an uncertainty about the 95th percentile.
In addition to a best guess for the 95th percentile, we give an operative lower bound to it by specifying that we are 95% certain that the 95th percentile is greater than 4. It follows that \[ \begin{aligned} 0.95 & = P(\mu + 1.645 \sigma > 4 | \mu = 3.3) \\ & = P(3.3 + 1.645 \sigma > 4 | \mu = 3.3) \\ & = P(1.645 \sigma > 0.7) = P(\sigma > 0.7/1.645 ) \\ & = P(\tau < (1.645/0.7)^2 ) = P(\tau < 5.52 ). \end{aligned} \]
Take a grid of \(b\) values and find the Gamma\((a(b),b)\) distribution that has 95th percentile 5.52.
Numerically via R, we can see that \[ \tau \stackrel{approx}{\sim} \text{Gamma}(1.73, 0.78). \]

Partial Prior Information

In the absence of substantive prior knowledge, the SIR prior will likely suffice.
However, if substantive prior information is available, we would be loathe to ignore it.
Sometimes the available information is insufficient to specify a complete BCJ prior on \(r\) regression coefficients, especially when \(r\) is moderate or large. Eliciting real prior information can be time consuming.
Possible to place a partial prior on regression coefficients when we are only able to specify information for \(r_1 < r\) mean values corresponding to \(r_1\) predictor vectors \(\tilde{x}_i\).
The simplest case is \(r_1 = 1\) and prior information is available for one typical individual in the population.
The remaining parameters receive improper flat priors or independent normal priors with mean 0 and large variances.

Gibbs Sampling

For normal-gamma independence priors, the posterior distribution \(p(\beta, \tau | Y)\) is analytically intractable, but the full conditionals needed for Gibbs sampling are standard distributions.
Recall we have for linear regression the likelihood \[ L(\beta, \tau) \propto \tau^{n/2} \exp \left\{ - \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\} \] and the independence prior specified by \[ \beta \sim N_r ( \beta_0, C_0) \perp \tau \sim \text{Gamma}(a,b), \] or equivalently \[ \beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime} ). \]

Gibbs Sampling

Thus we have \[ \begin{aligned} p(\beta, \tau | Y) & = L(\beta, \tau) p(\tau) p(\beta) \\ & \propto \tau^{n/2} \exp \left\{ - \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\} \\ & \quad \times \tau^{a-1} e^{-b\tau} \exp \left\{ - \frac{1}{2} (\tilde{X} \beta - \tilde{Y}) D(\tilde{w})^{-1} (\tilde{X} \beta - \tilde{Y})' \right\}. \end{aligned} \]
Dropping multiplicative terms that do not involve \(\tau\), the full conditional for \(\tau\) has density \[ p(\tau | \beta,Y) \propto \tau^{\frac{n+2a}{2}-1} \exp \left\{ - \frac{\tau}{2} [2b + (Y - X\beta)'(Y - X\beta)] \right\}, \] or \[ \tau | \beta, Y \sim \text{Gamma}\left( \frac{n+2a}{2}, \frac{1}{2} [2b + (Y - X\beta)'(Y - X\beta)] \right). \]

Gibbs Sampling

Because Gibbs sampling involves sampling repeatedly from this distribution for successive \(\beta\) iterates, it may be more computationally efficient to write \[ (Y - X\beta)'(Y - X\beta) = (Y - X\hat{\beta})'(Y - X\hat{\beta}) + (\hat{\beta} - \beta)' (X'X) (\hat{\beta} - \beta) \] with \(\hat{\beta}\) the least squares estimate.
The first term on the right-hand side is a constant in \(\beta\) so the repeated computations involve only the second term, which consists of \(r\) dimensional matrix products rather than \(n\) dimensional products.

Gibbs Sampling

To obtain the full conditional for \(\beta\), rewrite the joint density as \[ \begin{aligned} p&(\beta, \tau | Y) \propto \tau^{n/2 + a-1} e^{-b\tau} \times \\ & \exp \left\{ - \frac{1}{2} (X\beta - Y)' (\tau^{-1}I_n)^{-1} (X\beta - Y) - \frac{1}{2} (\tilde{X}\beta - \tilde{Y})' D(\tilde{w})^{-1} (\tilde{X}\beta - \tilde{Y})\right\}. \end{aligned} \]
Dropping multiplicative terms that do not involve \(\beta\) gives \[ p(\beta | \tau , Y) \propto \exp \left\{ - \frac{1}{2} \begin{bmatrix} X\beta - Y \\ \tilde{X}\beta - \tilde{Y} \end{bmatrix}' \begin{bmatrix} \tau^{-1}I_n & 0 \\ 0 & D(\tilde{w}) \end{bmatrix}^{-1} \begin{bmatrix} X\beta - Y \\ \tilde{X}\beta - \tilde{Y} \end{bmatrix} \right\}. \]

Gibbs Sampling

Define \[ \tilde{\beta} = \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right]^{-1} \left( \tau X'Y + \tilde{X}' D(\tilde{w})^{-1} \tilde{Y} \right). \]
Using some matrix algebra and dropping terms not involving \(\beta\), posterior density becomes \[ p(\beta | \tau , Y) \propto \exp \left\{ - \frac{1}{2} (\beta - \tilde{\beta}) ' \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right] (\beta - \tilde{\beta}) \right\} , \] which implies \[ \beta | \tau , Y \sim N_r \left( \tilde{\beta}, \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right]^{-1} \right). \]

Exercise

Analyze the FEV data with the BCJ prior presented in these slides using R with a Gibbs sampler.
Describe, with words and appropriate plots, the mixing behavior of the Gibbs sampler.
Determine a suitable number of Markov chain iterations to use in you analysis. Describe and justify your selection.
Create a table with posterior means, standard deviations, medians, and a 90% credible interval for \(\beta_0, \beta_1, \beta_2, \beta_3, \tau\). Include MCSEs for each of your estimates.
Compare estimates with the BCJ prior to those obtained using the SIR prior.