## Agenda

• FEV Data Example
• Independence Priors
• Gibbs Sampling
• Group Exercise

## Independence Priors

• Independence priors used most often in practice, that is we assume prior independence of $$\beta$$ and $$\tau$$ by letting $$p(\beta, \tau) = p(\beta)p(\tau)$$.
• The method of Bedrick, Christensen, and Johnson (1997), or BCJ method, of constructing an informative prior for β analogous to the method used for logistic regression.
• Goal is not to find the perfect prior to characterize someone's beliefs but to find a sensible prior that incorporates some basic scientific or experiential knowledge.
• Frame our discussion of prior elicitation around the FEV study from Rosner (2006) on pulmonary function (lung capacity) in adolescents.

## FEV Data Example

• Response $$y$$ is forced expiratory volume (FEV), which measures the volume of air in liters expelled in 1 second of a forceful breath.
• Lung function is expected to increase during adolescence, but smoking may slow its progression.
• The association between smoking and lung capacity in adolescents is investigated using data from 345 adolescents between the ages of 10 and 19.
• The predictor variables include a constant, age in years ($$x_2$$), a indicator variable for smoking status ($$x_3$$), and the interaction term $$x_4 = x_2 x_3$$. A predictor vector is $$x' = (1,x_2,x_3,x_4)$$.

## FEV Data Example

FEVdata <- read.table("http://www.ics.uci.edu/~wjohnson/BIDA/Ch7/FEVdataAge10to19.txt"
attach(FEVdata)
options(contrasts=c("contr.treatment","contr.poly"))
head(FEVdata)
##   Age   FEV Smoke
## 1  19 5.102     0
## 2  19 3.519     1
## 3  19 3.345     1
## 4  18 2.906     0
## 5  18 3.082     0
## 6  18 4.220     0

## Sampling Model

• With sampling model $Y|\beta ,\tau \sim N_n (X\beta, \tau^{-1} I_n),$ our standard informative prior has independent components $\beta \sim N_r ( \beta_0, C_0) \perp \tau \sim \text{Gamma}(a,b).$
• Our task is to determine values $$a$$, $$b$$, $$\beta_0$$, and $$C_0$$ that accurately reflect available prior information.

## Prior on $$\beta$$

• Similar to before, for $$\beta$$ we specify an equivalent prior $\tilde{X} \beta \sim N_r (\tilde{Y}, D(\tilde{w})),$ with known $$r \times r$$ matrix $$\tilde{X}$$, $$r$$ vector $$\tilde{Y}$$, and diagonal matrix $$D(\tilde{w})$$.
• We then induce a prior on $$\beta$$.
• As with the conjugagte prior, multiply $$\tilde{X} \beta$$ on the left by $$\tilde{X}^{-1}$$. This gives us $\beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime} ).$
• This BCJ method of constructing an informative prior above is analogous to the method used for logistic regression and similar to that used in the conjugate prior example.

## Prior on $$\beta$$

• Motivation is that individual regression coefficients $$\beta_j$$ are difficult to think about, since they are only indirectly related to anything observable.
• Instead of specifying a prior directly on $$\beta$$, we specify a prior on values that are more closely related to observables and induce the prior for $$\beta$$.
• Relatively easy to elicit a prior for the mean value of something that is potentially observable.
• We elicit a prior for the mean of the observations corresponding to a particular set of predictor variables.

## Prior on $$\beta$$

• Technically, we elicit priors on conditional means $\tilde{m}_i = E[y|\tilde{x}_i] = \tilde{x}_i' \beta$ for $$r$$ subpopulations defined by different predictor vectors $$\tilde{x}_i, i = 1,\dots,r$$.
• Resulting prior has the form $\tilde{X} \beta \sim N_r (\tilde{Y}, D(\tilde{w})),$ and the induced prior on $$\beta$$ has the form $\beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime}).$

## FEV Model Priors

• With four regression parameters in the FEV model, we pick four covariate combinations $$\tilde{x}_i, i = 1,2,3,4$$, corresponding to circumstances that the expert can assess independently. In particular, we use prior experience to specify information about the subpopulation of 11-year-old nonsmokers 13-year-old smokers, 16-year-old nonsmokers, and 18-year-old smokers.
• In matrix form we have $\tilde{X} = \begin{bmatrix} \tilde{x}_1' \\ \tilde{x}_2' \\ \tilde{x}_3' \\ \tilde{x}_4' \end{bmatrix} = \begin{bmatrix} 1 & 11 & 0 & 0 \\ 1 & 13 & 1 & 13 \\ 1 & 16 & 0 & 0 \\ 1 & 18 & 1 & 18 \\ \end{bmatrix} .$
• FEV values display variability within subpopulations. We are interested in the mean FEV value $$\tilde{m}_i$$ for the four circumstances in $$\tilde{X}$$ and ultimately in placing normal priors on the mean FEVs for the four types of adolescents defined by these covariate combinations.

## FEV Model Priors

• As an example, suppose that our medical collaborator expects the average FEV among all 18-year-old smokers in the sampled population to be 3.3, and is 99% sure that the mean FEV is less than 4.0 in this group. We take $$\tilde{y}_4 = 3.3$$ to be the mean of $$\tilde{m}_4$$. The prior variance reflects our uncertainty about where the mean lies, and if a normal prior is assumed then the 99th percentile of $$\tilde{m}_4$$ satisfies $4.0 = 3.3 + 2.33 \sqrt{\tilde{w}_4},$ so $\tilde{m}_4 \sim N(3.3, 0.09).$
• Similar steps yeild $\tilde{m}_1 \sim N(2.8, 0.04),\\ \tilde{m}_2 \sim N(3.0, 0.04),\\ \tilde{m}_3 \sim N(4.0, 0.04).$

## FEV Model Priors

• Important to choose the $$\tilde{x}_i$$s so that the $$\tilde{m}_i$$s can be regarded independently.
• Then we have a prior of the form, $\tilde{X} \beta = \tilde{m} = \begin{bmatrix} \tilde{m}_1 \\ \tilde{m}_2 \\ \tilde{m}_3 \\ \tilde{m}_3 \end{bmatrix} \sim N \left( \begin{bmatrix} 2.8 \\ 3.0 \\ 4.0 \\ 3.3 \end{bmatrix}, \begin{bmatrix} 0.04 & 0 & 0 & 0 \\ 0 & 0.04 & 0 & 0 \\ 0 & 0 & 0.04 & 0 \\ 0 & 0 & 0 & 0.09 \end{bmatrix} \right) .$
• We need to obtain $$p(\beta)$$, the prior density on $$\beta$$, from the density of $$\tilde{X} \beta$$.

## FEV Model Priors

• Naturally, $$p(\beta)$$ has a multivariate normal distribution with appropriately transformed mean vector and covariance matrix. In particular, $\beta = \tilde{X}^{-1} \tilde{m} \sim N \left( \tilde{X}^{-1} \begin{bmatrix} 2.8 \\ 3.0 \\ 4.0 \\ 3.3 \end{bmatrix}, \tilde{X}^{-1} \begin{bmatrix} 0.04 & 0 & 0 & 0 \\ 0 & 0.04 & 0 & 0 \\ 0 & 0 & 0.04 & 0 \\ 0 & 0 & 0 & 0.09 \end{bmatrix} \tilde{X}^{-1 \prime} \right)$ so $\beta \sim N \left( \begin{bmatrix} 0.16 \\ 0.24 \\ 2.06 \\ -0.18 \end{bmatrix}, \begin{bmatrix} 0.603 & -0.043 & -0.603 & 0.043 \\ -0.043 & 0.003 & 0.043 & -0.003 \\ -0.603 & 0.043 & 1.73 & -0.119 \\ 0.043 & -0.003 & -0.119 & 0.008 \end{bmatrix} \right)$

## Independence Priors

• Conjugate priors we also used an $$\tilde{X}$$ to elicit prior means $$\tilde{Y}$$ but there we treated the $$\tilde{w}$$s rather cavalierly.
• The independent priors approach makes it easier to elicit realistic information about variability.
• With a conjugate prior, information about variability related to regression coefficients must be conditional on $$\tau$$. For example, when asking the expert for, say, a 99th percentile that reflects uncertainty about the mean FEV for 18-year-old smokers, that value must be conditional on the variance $$\sigma^2$$.

## Prior on $$\tau$$

• Now construct an informative prior on $$\tau$$ or $$\sigma$$ using the FEV example.
• To choose $$a$$ and $$b$$ for a Gammma$$(a,b)$$ prior on $$\tau$$ we again consider 18-year-old smokers.
• Elicit the largest observation the expert would expect to see from this group, given our best guess for $$\tilde{m}_4$$.
• This provides a best guess for the 95th percentile of FEVs among 18-year-old smokers.
• Under normality, the 95th percentile is $$\tilde{m}_4 + 1.645 \sigma$$. But the best guess is conditional on $$\tilde{m}_4 = \tilde{y}_4$$, so with the expert's best guess for the 95th percentile being 5 and with $$\tilde{y}_4 = 3.3$$ , our best guess for $$\sigma$$, say $$\sigma_0$$, satisfies $$5 = 3.3+1.645 \sigma_0$$.
• Hence $$\sigma_0 = 1.7/1.645$$ or $$\tau_0 = (1.645/1.7)^2 = 0.94$$ for $$\tau$$. Take this value to be the mode of the gamma prior for $$\tau$$, that is $0.94 = \frac{a-1}{b} \text{ or } a(b) = 0.94b+1.$

## Prior on $$\tau$$

• To complete our specification of the gamma prior, we elicit an uncertainty about the 95th percentile.
• In addition to a best guess for the 95th percentile, we give an operative lower bound to it by specifying that we are 95% certain that the 95th percentile is greater than 4. It follows that \begin{aligned} 0.95 & = P(\mu + 1.645 \sigma > 4 | \mu = 3.3) \\ & = P(3.3 + 1.645 \sigma > 4 | \mu = 3.3) \\ & = P(1.645 \sigma > 0.7) = P(\sigma > 0.7/1.645 ) \\ & = P(\tau < (1.645/0.7)^2 ) = P(\tau < 5.52 ). \end{aligned}
• Take a grid of $$b$$ values and find the Gamma$$(a(b),b)$$ distribution that has 95th percentile 5.52.
• Numerically via R, we can see that $\tau \stackrel{approx}{\sim} \text{Gamma}(1.73, 0.78).$

## Partial Prior Information

• In the absence of substantive prior knowledge, the SIR prior will likely suffice.
• However, if substantive prior information is available, we would be loathe to ignore it.
• Sometimes the available information is insufficient to specify a complete BCJ prior on $$r$$ regression coefficients, especially when $$r$$ is moderate or large. Eliciting real prior information can be time consuming.
• Possible to place a partial prior on regression coefficients when we are only able to specify information for $$r_1 < r$$ mean values corresponding to $$r_1$$ predictor vectors $$\tilde{x}_i$$.
• The simplest case is $$r_1 = 1$$ and prior information is available for one typical individual in the population.
• The remaining parameters receive improper flat priors or independent normal priors with mean 0 and large variances.

## Gibbs Sampling

• For normal-gamma independence priors, the posterior distribution $$p(\beta, \tau | Y)$$ is analytically intractable, but the full conditionals needed for Gibbs sampling are standard distributions.
• Recall we have for linear regression the likelihood $L(\beta, \tau) \propto \tau^{n/2} \exp \left\{ - \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\}$ and the independence prior specified by $\beta \sim N_r ( \beta_0, C_0) \perp \tau \sim \text{Gamma}(a,b),$ or equivalently $\beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime} ).$

## Gibbs Sampling

• Thus we have \begin{aligned} p(\beta, \tau | Y) & = L(\beta, \tau) p(\tau) p(\beta) \\ & \propto \tau^{n/2} \exp \left\{ - \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\} \\ & \quad \times \tau^{a-1} e^{-b\tau} \exp \left\{ - \frac{1}{2} (\tilde{X} \beta - \tilde{Y}) D(\tilde{w})^{-1} (\tilde{X} \beta - \tilde{Y})' \right\}. \end{aligned}
• Dropping multiplicative terms that do not involve $$\tau$$, the full conditional for $$\tau$$ has density $p(\tau | \beta,Y) \propto \tau^{\frac{n+2a}{2}-1} \exp \left\{ - \frac{\tau}{2} [2b + (Y - X\beta)'(Y - X\beta)] \right\},$ or $\tau | \beta, Y \sim \text{Gamma}\left( \frac{n+2a}{2}, \frac{1}{2} [2b + (Y - X\beta)'(Y - X\beta)] \right).$

## Gibbs Sampling

• Because Gibbs sampling involves sampling repeatedly from this distribution for successive $$\beta$$ iterates, it may be more computationally efficient to write $(Y - X\beta)'(Y - X\beta) = (Y - X\hat{\beta})'(Y - X\hat{\beta}) + (\hat{\beta} - \beta)' (X'X) (\hat{\beta} - \beta)$ with $$\hat{\beta}$$ the least squares estimate.
• The first term on the right-hand side is a constant in $$\beta$$ so the repeated computations involve only the second term, which consists of $$r$$ dimensional matrix products rather than $$n$$ dimensional products.

## Gibbs Sampling

• To obtain the full conditional for $$\beta$$, rewrite the joint density as \begin{aligned} p&(\beta, \tau | Y) \propto \tau^{n/2 + a-1} e^{-b\tau} \times \\ & \exp \left\{ - \frac{1}{2} (X\beta - Y)' (\tau^{-1}I_n)^{-1} (X\beta - Y) - \frac{1}{2} (\tilde{X}\beta - \tilde{Y})' D(\tilde{w})^{-1} (\tilde{X}\beta - \tilde{Y})\right\}. \end{aligned}
• Dropping multiplicative terms that do not involve $$\beta$$ gives $p(\beta | \tau , Y) \propto \exp \left\{ - \frac{1}{2} \begin{bmatrix} X\beta - Y \\ \tilde{X}\beta - \tilde{Y} \end{bmatrix}' \begin{bmatrix} \tau^{-1}I_n & 0 \\ 0 & D(\tilde{w}) \end{bmatrix}^{-1} \begin{bmatrix} X\beta - Y \\ \tilde{X}\beta - \tilde{Y} \end{bmatrix} \right\}.$

## Gibbs Sampling

• Define $\tilde{\beta} = \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right]^{-1} \left( \tau X'Y + \tilde{X}' D(\tilde{w})^{-1} \tilde{Y} \right).$
• Using some matrix algebra and dropping terms not involving $$\beta$$, posterior density becomes $p(\beta | \tau , Y) \propto \exp \left\{ - \frac{1}{2} (\beta - \tilde{\beta}) ' \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right] (\beta - \tilde{\beta}) \right\} ,$ which implies $\beta | \tau , Y \sim N_r \left( \tilde{\beta}, \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right]^{-1} \right).$

## Exercise

1. Analyze the FEV data with the BCJ prior presented in these slides using R with a Gibbs sampler.
2. Describe, with words and appropriate plots, the mixing behavior of the Gibbs sampler.
3. Determine a suitable number of Markov chain iterations to use in you analysis. Describe and justify your selection.
4. Create a table with posterior means, standard deviations, medians, and a 90% credible interval for $$\beta_0, \beta_1, \beta_2, \beta_3, \tau$$. Include MCSEs for each of your estimates.
5. Compare estimates with the BCJ prior to those obtained using the SIR prior.