- FEV Data Example
- Independence Priors
- Gibbs Sampling
- Group Exercise

- FEV Data Example
- Independence Priors
- Gibbs Sampling
- Group Exercise

**Independence priors**used most often in practice, that is we assume prior independence of \(\beta\) and \(\tau\) by letting \(p(\beta, \tau) = p(\beta)p(\tau)\).- The method of Bedrick, Christensen, and Johnson (1997), or BCJ method, of constructing an informative prior for Î² analogous to the method used for logistic regression.
- Goal is not to find the perfect prior to characterize someone's beliefs but to find a sensible prior that incorporates some basic scientific or experiential knowledge.
- Frame our discussion of prior elicitation around the FEV study from Rosner (2006) on pulmonary function (lung capacity) in adolescents.

- Response \(y\) is forced expiratory volume (FEV), which measures the volume of air in liters expelled in 1 second of a forceful breath.
- Lung function is expected to increase during adolescence, but smoking may slow its progression.
- The association between smoking and lung capacity in adolescents is investigated using data from 345 adolescents between the ages of 10 and 19.
- The predictor variables include a constant, age in years (\(x_2\)), a indicator variable for smoking status (\(x_3\)), and the interaction term \(x_4 = x_2 x_3\). A predictor vector is \(x' = (1,x_2,x_3,x_4)\).

FEVdata <- read.table("http://www.ics.uci.edu/~wjohnson/BIDA/Ch7/FEVdataAge10to19.txt" ,header=T, sep="\t") attach(FEVdata) options(contrasts=c("contr.treatment","contr.poly")) head(FEVdata)

## Age FEV Smoke ## 1 19 5.102 0 ## 2 19 3.519 1 ## 3 19 3.345 1 ## 4 18 2.906 0 ## 5 18 3.082 0 ## 6 18 4.220 0

- With sampling model \[ Y|\beta ,\tau \sim N_n (X\beta, \tau^{-1} I_n), \] our standard informative prior has independent components \[ \beta \sim N_r ( \beta_0, C_0) \perp \tau \sim \text{Gamma}(a,b). \]
- Our task is to determine values \(a\), \(b\), \(\beta_0\), and \(C_0\) that accurately reflect available prior information.

- Similar to before, for \(\beta\) we specify an equivalent prior \[ \tilde{X} \beta \sim N_r (\tilde{Y}, D(\tilde{w})), \] with known \(r \times r\) matrix \(\tilde{X}\), \(r\) vector \(\tilde{Y}\), and diagonal matrix \(D(\tilde{w})\).
- We then induce a prior on \(\beta\).
- As with the conjugagte prior, multiply \(\tilde{X} \beta\) on the left by \(\tilde{X}^{-1}\). This gives us \[ \beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime} ). \]
- This BCJ method of constructing an informative prior above is analogous to the method used for logistic regression and similar to that used in the conjugate prior example.

- Motivation is that individual regression coefficients \(\beta_j\) are difficult to think about, since they are only indirectly related to anything observable.
- Instead of specifying a prior directly on \(\beta\), we specify a prior on values that are more closely related to observables and induce the prior for \(\beta\).
- Relatively easy to elicit a prior for the mean value of something that is potentially observable.
- We elicit a prior for the mean of the observations corresponding to a particular set of predictor variables.

- Technically, we elicit priors on conditional means \[ \tilde{m}_i = E[y|\tilde{x}_i] = \tilde{x}_i' \beta \] for \(r\) subpopulations defined by different predictor vectors \(\tilde{x}_i, i = 1,\dots,r\).
- Resulting prior has the form \[ \tilde{X} \beta \sim N_r (\tilde{Y}, D(\tilde{w})), \] and the induced prior on \(\beta\) has the form \[ \beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime}). \]

- With four regression parameters in the FEV model, we pick four covariate combinations \(\tilde{x}_i, i = 1,2,3,4\), corresponding to circumstances that the expert can assess independently. In particular, we use prior experience to specify information about the subpopulation of 11-year-old nonsmokers 13-year-old smokers, 16-year-old nonsmokers, and 18-year-old smokers.
- In matrix form we have \[ \tilde{X} = \begin{bmatrix} \tilde{x}_1' \\ \tilde{x}_2' \\ \tilde{x}_3' \\ \tilde{x}_4' \end{bmatrix} = \begin{bmatrix} 1 & 11 & 0 & 0 \\ 1 & 13 & 1 & 13 \\ 1 & 16 & 0 & 0 \\ 1 & 18 & 1 & 18 \\ \end{bmatrix} . \]
- FEV values display variability within subpopulations. We are interested in the mean FEV value \(\tilde{m}_i\) for the four circumstances in \(\tilde{X}\) and ultimately in placing normal priors on the mean FEVs for the four types of adolescents defined by these covariate combinations.

- As an example, suppose that our medical collaborator expects the average FEV among all 18-year-old smokers in the sampled population to be 3.3, and is 99% sure that the mean FEV is less than 4.0 in this group. We take \(\tilde{y}_4 = 3.3\) to be the mean of \(\tilde{m}_4\). The prior variance reflects our uncertainty about where the mean lies, and if a normal prior is assumed then the 99th percentile of \(\tilde{m}_4\) satisfies \[ 4.0 = 3.3 + 2.33 \sqrt{\tilde{w}_4}, \] so \[ \tilde{m}_4 \sim N(3.3, 0.09). \]
- Similar steps yeild \[ \tilde{m}_1 \sim N(2.8, 0.04),\\ \tilde{m}_2 \sim N(3.0, 0.04),\\ \tilde{m}_3 \sim N(4.0, 0.04). \]

- Important to choose the \(\tilde{x}_i\)s so that the \(\tilde{m}_i\)s can be regarded independently.
- Then we have a prior of the form, \[ \tilde{X} \beta = \tilde{m} = \begin{bmatrix} \tilde{m}_1 \\ \tilde{m}_2 \\ \tilde{m}_3 \\ \tilde{m}_3 \end{bmatrix} \sim N \left( \begin{bmatrix} 2.8 \\ 3.0 \\ 4.0 \\ 3.3 \end{bmatrix}, \begin{bmatrix} 0.04 & 0 & 0 & 0 \\ 0 & 0.04 & 0 & 0 \\ 0 & 0 & 0.04 & 0 \\ 0 & 0 & 0 & 0.09 \end{bmatrix} \right) . \]
- We need to obtain \(p(\beta)\), the prior density on \(\beta\), from the density of \(\tilde{X} \beta\).

- Naturally, \(p(\beta)\) has a multivariate normal distribution with appropriately transformed mean vector and covariance matrix. In particular, \[ \beta = \tilde{X}^{-1} \tilde{m} \sim N \left( \tilde{X}^{-1} \begin{bmatrix} 2.8 \\ 3.0 \\ 4.0 \\ 3.3 \end{bmatrix}, \tilde{X}^{-1} \begin{bmatrix} 0.04 & 0 & 0 & 0 \\ 0 & 0.04 & 0 & 0 \\ 0 & 0 & 0.04 & 0 \\ 0 & 0 & 0 & 0.09 \end{bmatrix} \tilde{X}^{-1 \prime} \right) \] so \[ \beta \sim N \left( \begin{bmatrix} 0.16 \\ 0.24 \\ 2.06 \\ -0.18 \end{bmatrix}, \begin{bmatrix} 0.603 & -0.043 & -0.603 & 0.043 \\ -0.043 & 0.003 & 0.043 & -0.003 \\ -0.603 & 0.043 & 1.73 & -0.119 \\ 0.043 & -0.003 & -0.119 & 0.008 \end{bmatrix} \right) \]

- Conjugate priors we also used an \(\tilde{X}\) to elicit prior means \(\tilde{Y}\) but there we treated the \(\tilde{w}\)s rather cavalierly.
- The independent priors approach makes it easier to elicit realistic information about variability.
- With a conjugate prior, information about variability related to regression coefficients must be conditional on \(\tau\). For example, when asking the expert for, say, a 99th percentile that reflects uncertainty about the mean FEV for 18-year-old smokers, that value must be conditional on the variance \(\sigma^2\).

- Now construct an informative prior on \(\tau\) or \(\sigma\) using the FEV example.
- To choose \(a\) and \(b\) for a Gammma\((a,b)\) prior on \(\tau\) we again consider 18-year-old smokers.
- Elicit the largest observation the expert would expect to see from this group, given our best guess for \(\tilde{m}_4\).
- This provides a best guess for the 95th percentile of FEVs among 18-year-old smokers.
- Under normality, the 95th percentile is \(\tilde{m}_4 + 1.645 \sigma\). But the best guess is conditional on \(\tilde{m}_4 = \tilde{y}_4\), so with the expert's best guess for the 95th percentile being 5 and with \(\tilde{y}_4 = 3.3\) , our best guess for \(\sigma\), say \(\sigma_0\), satisfies \(5 = 3.3+1.645 \sigma_0\).
- Hence \(\sigma_0 = 1.7/1.645\) or \(\tau_0 = (1.645/1.7)^2 = 0.94\) for \(\tau\). Take this value to be the mode of the gamma prior for \(\tau\), that is \[ 0.94 = \frac{a-1}{b} \text{ or } a(b) = 0.94b+1. \]

- To complete our specification of the gamma prior, we elicit an uncertainty about the 95th percentile.
- In addition to a best guess for the 95th percentile, we give an operative lower bound to it by specifying that we are 95% certain that the 95th percentile is greater than 4. It follows that \[ \begin{aligned} 0.95 & = P(\mu + 1.645 \sigma > 4 | \mu = 3.3) \\ & = P(3.3 + 1.645 \sigma > 4 | \mu = 3.3) \\ & = P(1.645 \sigma > 0.7) = P(\sigma > 0.7/1.645 ) \\ & = P(\tau < (1.645/0.7)^2 ) = P(\tau < 5.52 ). \end{aligned} \]
- Take a grid of \(b\) values and find the Gamma\((a(b),b)\) distribution that has 95th percentile 5.52.
- Numerically via
`R`

, we can see that \[ \tau \stackrel{approx}{\sim} \text{Gamma}(1.73, 0.78). \]

- In the absence of substantive prior knowledge, the SIR prior will likely suffice.
- However, if substantive prior information is available, we would be loathe to ignore it.
- Sometimes the available information is insufficient to specify a complete BCJ prior on \(r\) regression coefficients, especially when \(r\) is moderate or large. Eliciting real prior information can be time consuming.
- Possible to place a partial prior on regression coefficients when we are only able to specify information for \(r_1 < r\) mean values corresponding to \(r_1\) predictor vectors \(\tilde{x}_i\).
- The simplest case is \(r_1 = 1\) and prior information is available for one
**typical**individual in the population. - The remaining parameters receive improper flat priors or independent normal priors with mean 0 and large variances.

- For normal-gamma independence priors, the posterior distribution \(p(\beta, \tau | Y)\) is analytically intractable, but the full conditionals needed for Gibbs sampling are standard distributions.
- Recall we have for linear regression the likelihood \[ L(\beta, \tau) \propto \tau^{n/2} \exp \left\{ - \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\} \] and the independence prior specified by \[ \beta \sim N_r ( \beta_0, C_0) \perp \tau \sim \text{Gamma}(a,b), \] or equivalently \[ \beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime} ). \]

- Thus we have \[ \begin{aligned} p(\beta, \tau | Y) & = L(\beta, \tau) p(\tau) p(\beta) \\ & \propto \tau^{n/2} \exp \left\{ - \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\} \\ & \quad \times \tau^{a-1} e^{-b\tau} \exp \left\{ - \frac{1}{2} (\tilde{X} \beta - \tilde{Y}) D(\tilde{w})^{-1} (\tilde{X} \beta - \tilde{Y})' \right\}. \end{aligned} \]
- Dropping multiplicative terms that do not involve \(\tau\), the full conditional for \(\tau\) has density \[ p(\tau | \beta,Y) \propto \tau^{\frac{n+2a}{2}-1} \exp \left\{ - \frac{\tau}{2} [2b + (Y - X\beta)'(Y - X\beta)] \right\}, \] or \[ \tau | \beta, Y \sim \text{Gamma}\left( \frac{n+2a}{2}, \frac{1}{2} [2b + (Y - X\beta)'(Y - X\beta)] \right). \]

- Because Gibbs sampling involves sampling repeatedly from this distribution for successive \(\beta\) iterates, it may be more computationally efficient to write \[ (Y - X\beta)'(Y - X\beta) = (Y - X\hat{\beta})'(Y - X\hat{\beta}) + (\hat{\beta} - \beta)' (X'X) (\hat{\beta} - \beta) \] with \(\hat{\beta}\) the least squares estimate.
- The first term on the right-hand side is a constant in \(\beta\) so the repeated computations involve only the second term, which consists of \(r\) dimensional matrix products rather than \(n\) dimensional products.

- To obtain the full conditional for \(\beta\), rewrite the joint density as \[ \begin{aligned} p&(\beta, \tau | Y) \propto \tau^{n/2 + a-1} e^{-b\tau} \times \\ & \exp \left\{ - \frac{1}{2} (X\beta - Y)' (\tau^{-1}I_n)^{-1} (X\beta - Y) - \frac{1}{2} (\tilde{X}\beta - \tilde{Y})' D(\tilde{w})^{-1} (\tilde{X}\beta - \tilde{Y})\right\}. \end{aligned} \]
- Dropping multiplicative terms that do not involve \(\beta\) gives \[ p(\beta | \tau , Y) \propto \exp \left\{ - \frac{1}{2} \begin{bmatrix} X\beta - Y \\ \tilde{X}\beta - \tilde{Y} \end{bmatrix}' \begin{bmatrix} \tau^{-1}I_n & 0 \\ 0 & D(\tilde{w}) \end{bmatrix}^{-1} \begin{bmatrix} X\beta - Y \\ \tilde{X}\beta - \tilde{Y} \end{bmatrix} \right\}. \]

- Define \[ \tilde{\beta} = \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right]^{-1} \left( \tau X'Y + \tilde{X}' D(\tilde{w})^{-1} \tilde{Y} \right). \]
- Using some matrix algebra and dropping terms not involving \(\beta\), posterior density becomes \[ p(\beta | \tau , Y) \propto \exp \left\{ - \frac{1}{2} (\beta - \tilde{\beta}) ' \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right] (\beta - \tilde{\beta}) \right\} , \] which implies \[ \beta | \tau , Y \sim N_r \left( \tilde{\beta}, \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right]^{-1} \right). \]

- Analyze the FEV data with the BCJ prior presented in these slides using
`R`

with a Gibbs sampler. - Describe, with words and appropriate plots, the mixing behavior of the Gibbs sampler.
- Determine a suitable number of Markov chain iterations to use in you analysis. Describe and justify your selection.
- Create a table with posterior means, standard deviations, medians, and a 90% credible interval for \(\beta_0, \beta_1, \beta_2, \beta_3, \tau\). Include MCSEs for each of your estimates.
- Compare estimates with the BCJ prior to those obtained using the SIR prior.