--- title: "STAT 203: Lecture 18" author: "James M. Flegal" output: ioslides_presentation: smaller: yes --- ## Agenda - FEV Data Example - Independence Priors - Gibbs Sampling - Group Exercise ## Independence Priors - **Independence priors** used most often in practice, that is we assume prior independence of $\beta$ and $\tau$ by letting $p(\beta, \tau) = p(\beta)p(\tau)$. - The method of Bedrick, Christensen, and Johnson (1997), or BCJ method, of constructing an informative prior for β analogous to the method used for logistic regression. - Goal is not to find the perfect prior to characterize someone's beliefs but to find a sensible prior that incorporates some basic scientific or experiential knowledge. - Frame our discussion of prior elicitation around the FEV study from Rosner (2006) on pulmonary function (lung capacity) in adolescents. ## FEV Data Example - Response $y$ is forced expiratory volume (FEV), which measures the volume of air in liters expelled in 1 second of a forceful breath. - Lung function is expected to increase during adolescence, but smoking may slow its progression. - The association between smoking and lung capacity in adolescents is investigated using data from 345 adolescents between the ages of 10 and 19. - The predictor variables include a constant, age in years ($x_2$), a indicator variable for smoking status ($x_3$), and the interaction term $x_4 = x_2 x_3$. A predictor vector is $x' = (1,x_2,x_3,x_4)$. ## FEV Data Example ```{r} FEVdata <- read.table("http://www.ics.uci.edu/~wjohnson/BIDA/Ch7/FEVdataAge10to19.txt" ,header=T, sep="\t") attach(FEVdata) options(contrasts=c("contr.treatment","contr.poly")) head(FEVdata) ``` ## Interaction Plot with LOWESS Curve ```{r, echo=FALSE} ## Interaction plot. LOWESS plot for the FEV data plot(Age,FEV,xlab="Age", ylab="FEV") lines(lowess(Age[Smoke==1],FEV[Smoke==1],f=.95),lwd=2,cex=1.2) lines(lowess(Age[Smoke==0],FEV[Smoke==0],f=.95),lty=2,lwd=2,cex=1.2) legend("bottomright", c("Smoker", "Nonsmoker"), lty=c(1,2)) ``` ## Bayesian Posterior Means Plot ```{r, echo=FALSE} BayesSmoker <- c(3.058,3.124,3.191,3.258,3.324,3.391,3.458,3.525,3.591,3.658) BayesNonsmoker <- c(2.772,2.982,3.192,3.402,3.612,3.821,4.031,4.241,4.451,4.661) Age1 <- seq(10,19,1) plot(Age,FEV,xlab="Age", ylab="FEV", lwd=1.5) lines(Age1,BayesSmoker,type='l', lty=1, lwd=2,cex=1.2) lines(Age1,BayesNonsmoker,type='l', lty=2, lwd=2,cex=1.2) legend("bottomright", c("Smoker", "Nonsmoker"), lty=c(1,2), lwd=2) ``` ## Sampling Model - With sampling model \[ Y|\beta ,\tau \sim N_n (X\beta, \tau^{-1} I_n), \] our standard informative prior has independent components \[ \beta \sim N_r ( \beta_0, C_0) \perp \tau \sim \text{Gamma}(a,b). \] - Our task is to determine values $a$, $b$, $\beta_0$, and $C_0$ that accurately reflect available prior information. ## Prior on $\beta$ - Similar to before, for $\beta$ we specify an equivalent prior \[ \tilde{X} \beta \sim N_r (\tilde{Y}, D(\tilde{w})), \] with known $r \times r$ matrix $\tilde{X}$, $r$ vector $\tilde{Y}$, and diagonal matrix $D(\tilde{w})$. - We then induce a prior on $\beta$. - As with the conjugagte prior, multiply $\tilde{X} \beta$ on the left by $\tilde{X}^{-1}$. This gives us \[ \beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime} ). \] - This BCJ method of constructing an informative prior above is analogous to the method used for logistic regression and similar to that used in the conjugate prior example. ## Prior on $\beta$ - Motivation is that individual regression coefficients $\beta_j$ are difficult to think about, since they are only indirectly related to anything observable. - Instead of specifying a prior directly on $\beta$, we specify a prior on values that are more closely related to observables and induce the prior for $\beta$. - Relatively easy to elicit a prior for the mean value of something that is potentially observable. - We elicit a prior for the mean of the observations corresponding to a particular set of predictor variables. ## Prior on $\beta$ - Technically, we elicit priors on conditional means \[ \tilde{m}_i = E[y|\tilde{x}_i] = \tilde{x}_i' \beta \] for $r$ subpopulations defined by different predictor vectors $\tilde{x}_i, i = 1,\dots,r$. - Resulting prior has the form \[ \tilde{X} \beta \sim N_r (\tilde{Y}, D(\tilde{w})), \] and the induced prior on $\beta$ has the form \[ \beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime}). \] ## FEV Model Priors - With four regression parameters in the FEV model, we pick four covariate combinations $\tilde{x}_i, i = 1,2,3,4$, corresponding to circumstances that the expert can assess independently. In particular, we use prior experience to specify information about the subpopulation of 11-year-old nonsmokers 13-year-old smokers, 16-year-old nonsmokers, and 18-year-old smokers. - In matrix form we have \[ \tilde{X} = \begin{bmatrix} \tilde{x}_1' \\ \tilde{x}_2' \\ \tilde{x}_3' \\ \tilde{x}_4' \end{bmatrix} = \begin{bmatrix} 1 & 11 & 0 & 0 \\ 1 & 13 & 1 & 13 \\ 1 & 16 & 0 & 0 \\ 1 & 18 & 1 & 18 \\ \end{bmatrix} . \] - FEV values display variability within subpopulations. We are interested in the mean FEV value $\tilde{m}_i$ for the four circumstances in $\tilde{X}$ and ultimately in placing normal priors on the mean FEVs for the four types of adolescents defined by these covariate combinations. ## FEV Model Priors - As an example, suppose that our medical collaborator expects the average FEV among all 18-year-old smokers in the sampled population to be 3.3, and is 99% sure that the mean FEV is less than 4.0 in this group. We take $\tilde{y}_4 = 3.3$ to be the mean of $\tilde{m}_4$. The prior variance reflects our uncertainty about where the mean lies, and if a normal prior is assumed then the 99th percentile of $\tilde{m}_4$ satisfies \[ 4.0 = 3.3 + 2.33 \sqrt{\tilde{w}_4}, \] so \[ \tilde{m}_4 \sim N(3.3, 0.09). \] - Similar steps yeild \[ \tilde{m}_1 \sim N(2.8, 0.04),\\ \tilde{m}_2 \sim N(3.0, 0.04),\\ \tilde{m}_3 \sim N(4.0, 0.04). \] ## FEV Model Priors - Important to choose the $\tilde{x}_i$s so that the $\tilde{m}_i$s can be regarded independently. - Then we have a prior of the form, \[ \tilde{X} \beta = \tilde{m} = \begin{bmatrix} \tilde{m}_1 \\ \tilde{m}_2 \\ \tilde{m}_3 \\ \tilde{m}_3 \end{bmatrix} \sim N \left( \begin{bmatrix} 2.8 \\ 3.0 \\ 4.0 \\ 3.3 \end{bmatrix}, \begin{bmatrix} 0.04 & 0 & 0 & 0 \\ 0 & 0.04 & 0 & 0 \\ 0 & 0 & 0.04 & 0 \\ 0 & 0 & 0 & 0.09 \end{bmatrix} \right) . \] - We need to obtain $p(\beta)$, the prior density on $\beta$, from the density of $\tilde{X} \beta$. ## FEV Model Priors - Naturally, $p(\beta)$ has a multivariate normal distribution with appropriately transformed mean vector and covariance matrix. In particular, \[ \beta = \tilde{X}^{-1} \tilde{m} \sim N \left( \tilde{X}^{-1} \begin{bmatrix} 2.8 \\ 3.0 \\ 4.0 \\ 3.3 \end{bmatrix}, \tilde{X}^{-1} \begin{bmatrix} 0.04 & 0 & 0 & 0 \\ 0 & 0.04 & 0 & 0 \\ 0 & 0 & 0.04 & 0 \\ 0 & 0 & 0 & 0.09 \end{bmatrix} \tilde{X}^{-1 \prime} \right) \] so \[ \beta \sim N \left( \begin{bmatrix} 0.16 \\ 0.24 \\ 2.06 \\ -0.18 \end{bmatrix}, \begin{bmatrix} 0.603 & -0.043 & -0.603 & 0.043 \\ -0.043 & 0.003 & 0.043 & -0.003 \\ -0.603 & 0.043 & 1.73 & -0.119 \\ 0.043 & -0.003 & -0.119 & 0.008 \end{bmatrix} \right) \] ## Independence Priors - Conjugate priors we also used an $\tilde{X}$ to elicit prior means $\tilde{Y}$ but there we treated the $\tilde{w}$s rather cavalierly. - The independent priors approach makes it easier to elicit realistic information about variability. - With a conjugate prior, information about variability related to regression coefficients must be conditional on $\tau$. For example, when asking the expert for, say, a 99th percentile that reflects uncertainty about the mean FEV for 18-year-old smokers, that value must be conditional on the variance $\sigma^2$. ## Prior on $\tau$ - Now construct an informative prior on $\tau$ or $\sigma$ using the FEV example. - To choose $a$ and $b$ for a Gammma$(a,b)$ prior on $\tau$ we again consider 18-year-old smokers. - Elicit the largest observation the expert would expect to see from this group, given our best guess for $\tilde{m}_4$. - This provides a best guess for the 95th percentile of FEVs among 18-year-old smokers. - Under normality, the 95th percentile is $\tilde{m}_4 + 1.645 \sigma$. But the best guess is conditional on $\tilde{m}_4 = \tilde{y}_4$, so with the expert's best guess for the 95th percentile being 5 and with $\tilde{y}_4 = 3.3$ , our best guess for $\sigma$, say $\sigma_0$, satisfies $5 = 3.3+1.645 \sigma_0$. - Hence $\sigma_0 = 1.7/1.645$ or $\tau_0 = (1.645/1.7)^2 = 0.94$ for $\tau$. Take this value to be the mode of the gamma prior for $\tau$, that is \[ 0.94 = \frac{a-1}{b} \text{ or } a(b) = 0.94b+1. \] ## Prior on $\tau$ - To complete our specification of the gamma prior, we elicit an uncertainty about the 95th percentile. - In addition to a best guess for the 95th percentile, we give an operative lower bound to it by specifying that we are 95% certain that the 95th percentile is greater than 4. It follows that \[ \begin{aligned} 0.95 & = P(\mu + 1.645 \sigma > 4 | \mu = 3.3) \\ & = P(3.3 + 1.645 \sigma > 4 | \mu = 3.3) \\ & = P(1.645 \sigma > 0.7) = P(\sigma > 0.7/1.645 ) \\ & = P(\tau < (1.645/0.7)^2 ) = P(\tau < 5.52 ). \end{aligned} \] - Take a grid of $b$ values and find the Gamma$(a(b),b)$ distribution that has 95th percentile 5.52. - Numerically via `R`, we can see that \[ \tau \stackrel{approx}{\sim} \text{Gamma}(1.73, 0.78). \] ## Partial Prior Information - In the absence of substantive prior knowledge, the SIR prior will likely suffice. - However, if substantive prior information is available, we would be loathe to ignore it. - Sometimes the available information is insufficient to specify a complete BCJ prior on $r$ regression coefficients, especially when $r$ is moderate or large. Eliciting real prior information can be time consuming. - Possible to place a partial prior on regression coefficients when we are only able to specify information for $r_1 < r$ mean values corresponding to $r_1$ predictor vectors $\tilde{x}_i$. - The simplest case is $r_1 = 1$ and prior information is available for one **typical** individual in the population. - The remaining parameters receive improper flat priors or independent normal priors with mean 0 and large variances. ## Gibbs Sampling - For normal-gamma independence priors, the posterior distribution $p(\beta, \tau | Y)$ is analytically intractable, but the full conditionals needed for Gibbs sampling are standard distributions. - Recall we have for linear regression the likelihood \[ L(\beta, \tau) \propto \tau^{n/2} \exp \left\{ - \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\} \] and the independence prior specified by \[ \beta \sim N_r ( \beta_0, C_0) \perp \tau \sim \text{Gamma}(a,b), \] or equivalently \[ \beta \sim N_r (\tilde{X}^{-1} \tilde{Y}, \tilde{X}^{-1} D(\tilde{w}) \tilde{X}^{-1 \prime} ). \] ## Gibbs Sampling - Thus we have \[ \begin{aligned} p(\beta, \tau | Y) & = L(\beta, \tau) p(\tau) p(\beta) \\ & \propto \tau^{n/2} \exp \left\{ - \frac{\tau}{2} (Y - X\beta)'(Y - X\beta) \right\} \\ & \quad \times \tau^{a-1} e^{-b\tau} \exp \left\{ - \frac{1}{2} (\tilde{X} \beta - \tilde{Y}) D(\tilde{w})^{-1} (\tilde{X} \beta - \tilde{Y})' \right\}. \end{aligned} \] - Dropping multiplicative terms that do not involve $\tau$, the full conditional for $\tau$ has density \[ p(\tau | \beta,Y) \propto \tau^{\frac{n+2a}{2}-1} \exp \left\{ - \frac{\tau}{2} [2b + (Y - X\beta)'(Y - X\beta)] \right\}, \] or \[ \tau | \beta, Y \sim \text{Gamma}\left( \frac{n+2a}{2}, \frac{1}{2} [2b + (Y - X\beta)'(Y - X\beta)] \right). \] ## Gibbs Sampling - Because Gibbs sampling involves sampling repeatedly from this distribution for successive $\beta$ iterates, it may be more computationally efficient to write \[ (Y - X\beta)'(Y - X\beta) = (Y - X\hat{\beta})'(Y - X\hat{\beta}) + (\hat{\beta} - \beta)' (X'X) (\hat{\beta} - \beta) \] with $\hat{\beta}$ the least squares estimate. - The first term on the right-hand side is a constant in $\beta$ so the repeated computations involve only the second term, which consists of $r$ dimensional matrix products rather than $n$ dimensional products. ## Gibbs Sampling - To obtain the full conditional for $\beta$, rewrite the joint density as \[ \begin{aligned} p&(\beta, \tau | Y) \propto \tau^{n/2 + a-1} e^{-b\tau} \times \\ & \exp \left\{ - \frac{1}{2} (X\beta - Y)' (\tau^{-1}I_n)^{-1} (X\beta - Y) - \frac{1}{2} (\tilde{X}\beta - \tilde{Y})' D(\tilde{w})^{-1} (\tilde{X}\beta - \tilde{Y})\right\}. \end{aligned} \] - Dropping multiplicative terms that do not involve $\beta$ gives \[ p(\beta | \tau , Y) \propto \exp \left\{ - \frac{1}{2} \begin{bmatrix} X\beta - Y \\ \tilde{X}\beta - \tilde{Y} \end{bmatrix}' \begin{bmatrix} \tau^{-1}I_n & 0 \\ 0 & D(\tilde{w}) \end{bmatrix}^{-1} \begin{bmatrix} X\beta - Y \\ \tilde{X}\beta - \tilde{Y} \end{bmatrix} \right\}. \] ## Gibbs Sampling - Define \[ \tilde{\beta} = \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right]^{-1} \left( \tau X'Y + \tilde{X}' D(\tilde{w})^{-1} \tilde{Y} \right). \] - Using some matrix algebra and dropping terms not involving $\beta$, posterior density becomes \[ p(\beta | \tau , Y) \propto \exp \left\{ - \frac{1}{2} (\beta - \tilde{\beta}) ' \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right] (\beta - \tilde{\beta}) \right\} , \] which implies \[ \beta | \tau , Y \sim N_r \left( \tilde{\beta}, \left[ \tau X'X + \tilde{X}' D(\tilde{w})^{-1} \tilde{X} \right]^{-1} \right). \] ## Exercise 1. Analyze the FEV data with the BCJ prior presented in these slides using `R` with a Gibbs sampler. 2. Describe, with words and appropriate plots, the mixing behavior of the Gibbs sampler. 3. Determine a suitable number of Markov chain iterations to use in you analysis. Describe and justify your selection. 4. Create a table with posterior means, standard deviations, medians, and a 90% credible interval for $\beta_0, \beta_1, \beta_2, \beta_3, \tau$. Include MCSEs for each of your estimates. 5. Compare estimates with the BCJ prior to those obtained using the SIR prior.