STAT 203: Lecture 11

Agenda

Motivating Example
Binomial Regression
Trauma Data Example
Posterior Sampling with mcmc Package
Exercise on O-Ring Data

O-Ring Data

On January 28, 1986, the space shuttle Challenger crashed after takeoff.
Explosion was caused by field O-rings that failed due to the low atmospheric temperature at takeoff, 31 degrees Fahrenheit.
Data on O-ring failure \(y\) along with temperature at takeoff \(\check{x}\) from the 23 pre-Challenger space shuttle launches. Each flight is viewed as an independent trial.
Result of a trial is 1 if any field O-rings failed on the flight and 0 if all the O-rings functioned properly.
We want to estimate the probability of O-ring failure as a function of temperature.

O-Ring Data

Data available here.

# Oring <- read.table("http://www.ics.uci.edu/~wjohnson/BIDA/Ch8/OringData.txt", 
#                      yheader=TRUE)
Oring <- read.table("OringData.txt", header=TRUE)
head(Oring)

##   Flight Failure Temperature
## 1     14       1          53
## 2      9       1          57
## 3     23       1          58
## 4     10       1          63
## 5      1       0          66
## 6      5       0          67

O-Ring Data

Let \(\theta_i\) be the probability that any O-ring fails on launch \(i\).
Model the probability of O-ring failure as a simple function of temperature, i.e. use a straight line to model a function of the probabilities \[ \text{logit}(\theta_i) = \log \left[ \theta_i / (1-\theta_i) \right] = \beta_1 + \beta_2 \check{x}_i, \] where \(\check{x}_i\) is the temperature corresponding to launch \(i\).
Define \[ x_i = \left[ \begin{matrix} 1 \\ \check{x}_i \end{matrix} \right] \text{ and } \beta = \left[ \begin{matrix} \beta_1 \\ \beta_2 \end{matrix} \right], \] then write the model as \(\text{logit}(\theta_i) = x_i ' \beta\).

Standardization

Often convenient in regression problems to standardize continuous covariates so as to improve the performance of computational methods.
Simplest standardization is to subtract the mean from any covariates.
Writing \(\bar{x}\) as the sample mean, O-ring model becomes \[ y_i | \theta_i \sim \text{Bin}(1, \theta_i), \quad \text{logit}(\theta_i) = \alpha + \beta_2 \left( \check{x}_i - \bar{x} \right). \]
For the O-ring data, standardization is helpful but not necessary.
In more complicated problems, numerical procedures may be difficult to perform without standardization.
Standardization can also affect the specification of prior information.

Binomial Regression

Consider data \(y_i\), \(i = 1, \dots, n\) with \[ y_i | \theta_i \stackrel{ind}{\sim} \text{Bin}(N_i, \theta_i). \]
Associated with each observation \(y_i\) are \(r\) fixed predictor variables \(x_{i1}, \dots, x_{ir}\).
Models use linear functions of the predictor variables but we need to take some function of the probabilities that transforms values between 0 and 1 into values on the real line.

Binomial Regression

A convenient class of functions that does this consists of the inverses of cumulative distribution functions.
If \(F\) is a cdf, its inverse is \(F^{-1}\), then the general binomial regression model specifies \[ F^{-1} (\theta_i) = \sum_{j=1}^r \beta_j x_{ij} = x_i' \beta, \] where \(x_i' = \left( x_{i1}, \dots, x_{ir} \right)\) and \(\beta = \left( \beta_1, \dots, \beta_r \right)'\).
Equivalently, \[ \theta_i = F(x_i' \beta). \]
Usually \(x_{i1}=1\) so \(x_i' = \left( 1, x_{i2}, \dots, x_{ir} \right)\) and \(\beta_1\) is an intercept.

Link Functions

Three common selections for \(F\) correspond to logistic, probit, and complementary log-log regression models in which \(F\) is taken as one of \[ F(x'\beta) = \begin{cases} e^{x'\beta} / \left[ 1 + e^{x'\beta} \right] & \text{Logistic} \\ \Phi \left( x'\beta \right) & \text{Probit} \\ 1 - \exp \left[ - e^{x'\beta} \right] & \text{Complementary log-log}. \end{cases} \]
All three of these functions are strictly increasing and invertible.
In this context, \(F^{-1}\) is often called a link function.

Link Functions

A complete analysis requires the use of both \(F\) and its inverse.
The inverse of the logistic transform \(F(\eta) = e^{\eta} / [1+e^{\eta}]\) is \(F^{-1}(\theta) = \log \{ \theta / (1-\theta)\}\), the logit transformation.
For complementary log-log models, \(F^{-1}(\theta) =\log \{ - \log (1-\theta) \}\), hence its name.
For probit models, one can only write \(F^{-1}(\theta) = \Phi^{-1} (\theta)\). Since \(\Phi^{-1}\) cannot be written as a formula, it must be evaluated numerically.

Likelihood

Likelihood function for independent binomial observations written as a vector \(y = (y_1, \dots, y_n)'\) is \[ \begin{aligned} L (\beta | y) & = \prod_{i=1}^n L (\beta | y_i) \\ & = \prod_{i=1}^n {N_i \choose y_i} \theta_i^{y_i} \left( 1 - \theta_i \right)^{N_i - y_i} \\ & = \prod_{i=1}^n {N_i \choose y_i} \left[ F (x_i' \beta ) \right]^{y_i} \left[ 1 - F (x_i' \beta ) \right]^{N_i - y_i} . \end{aligned} \]

O-Ring Data MLE

model1 <- glm(Failure ~ Temperature, data = Oring,
     family = binomial(), x = TRUE)
summary(model1)

## 
## Call:
## glm(formula = Failure ~ Temperature, family = binomial(), data = Oring, 
##     x = TRUE)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0611  -0.7613  -0.3783   0.4524   2.2175  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  15.0429     7.3786   2.039   0.0415 *
## Temperature  -0.2322     0.1082  -2.145   0.0320 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 20.315  on 21  degrees of freedom
## AIC: 24.315
## 
## Number of Fisher Scoring iterations: 5

Trauma Data

Dr. Turner Osler, a trauma surgeon and former head of the Burn Unit at the University of New Mexico Trauma Center, provided data on survival of patients admitted to the University of New Mexico Trauma Center between 1991 and 1994.

The predictor variables are injury severity score (ISS), revised trauma score (RTS), AGE, and type of injury (TI). TI is either blunt (\(TI = 0\)), e.g., the result of a car crash, or penetrating (\(TI = 1\)), e.g., gunshot or stab wounds.
ISS is an overall index of a patient's injuries based on the approximately 1,300 injuries cataloged in the Abbreviated Injury Scale. Takes on values from 0 for a patient with no injuries to 75 for a patient with severe injuries in three or more body areas.
RTS is an index of physiologic status derived from the Glasgow Coma Scale. It is a weighted average of a number of measurements on an incoming patient such as systolic blood pressure and respiratory rate. Takes on values from 0 for a patient with no vital signs to 7.84 for a patient with normal vital signs.

Trauma Data

Data available here.

# trauma <- read.table("http://www.ics.uci.edu/~wjohnson/BIDA/Ch8/trauma300.txt, 
#                      yheader=TRUE)
traumadata <- read.table("trauma300.txt", header=TRUE)
attach(traumadata)
head(traumadata)

##     ID death ISS TI    RTS AGE
## 1 2979     0   1  1 7.8408  25
## 2 1167     0   9  0 7.8408  63
## 3  116     0  29  0 2.9304  32
## 4 1118     0  10  0 7.8408  61
## 5 3126     0   2  1 7.8408  43
## 6 2173     0   9  0 7.8408  27

Trauma Data

Dr. Osler proposed a logistic regression model to estimate the probability of a patient's death using an intercept; the predictors ISS, RTS, AGE, and TI; and an interaction between AGE and TI. AGE is viewed as a surrogate for a patient's physiologic reserve, i.e., their ability to withstand trauma.

The model \[ \text{logit}(\theta_i) = \beta_1 + \beta_2 ISS_i + \beta_3 RTS_i + \beta_4 AGE_i + \beta_5 TI_i + \beta_6 (AGE*TI)_i \] is similar to logistic models used by trauma centers throughout the United States.

Introduce \(x_{i6} = (AGE*TI)_i\) and correct for the mean values of all the continuous predictor variables, then the model becomes \[ \text{logit}(\theta_i) = \alpha + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5} + \beta_6 x_{i6}. \]

Trauma Data

x.check <- cbind(ISS-mean(ISS), RTS-mean(RTS), AGE-mean(AGE), TI, AGE*TI-mean(AGE*TI))
colnames(x.check) <- c("x2", "x3", "x4", "x5", "x6")
head(x.check)

##          x2        x3    x4 x5        x6
## [1,] -13.28  0.553934 -6.41  1 17.633333
## [2,]  -5.28  0.553934 31.59  0 -7.366667
## [3,]  14.72 -4.356466  0.59  0 -7.366667
## [4,]  -4.28  0.553934 29.59  0 -7.366667
## [5,] -12.28  0.553934 11.59  1 35.633333
## [6,]  -5.28  0.553934 -4.41  0 -7.366667

model2 <- glm(death ~ x.check, family = binomial(), x = TRUE)

Trauma Data

## 
## Call:
## glm(formula = death ~ x.check, family = binomial(), x = TRUE)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5388  -0.2921  -0.1912  -0.1135   2.9019  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.934673   0.596240  -6.599 4.14e-11 ***
## x.checkx2    0.082666   0.027508   3.005  0.00265 ** 
## x.checkx3   -0.552683   0.170824  -3.235  0.00121 ** 
## x.checkx4    0.053456   0.016382   3.263  0.00110 ** 
## x.checkx5    1.338003   1.333584   1.003  0.31571    
## x.checkx6   -0.005013   0.031696  -0.158  0.87433    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 157.31  on 299  degrees of freedom
## Residual deviance: 100.03  on 294  degrees of freedom
## AIC: 112.03
## 
## Number of Fisher Scoring iterations: 7

Binomial Regression Analysis

But this class isn’t about that frequentist analysis, we want a Bayesian analysis.
For our Bayesian analysis we assume the same data model as the frequentist, i.e. we continue to use the logit link function.
Bayesian analysis requires specification of a prior distribution.
We assume the prior distribution of the parameters (the regression coefficients) makes them independent and identically normally distributed with mean 0 and variance \(\sigma^2 = 25\).

Unnormalized Posterior

Suppose \(\eta = x'\beta\) where \(x\) is the design matrix.
The log unnormalized posterior (log likelihood plus log prior) density for this model is \[ \sum_{i=1}^n y_i \log \left[ \frac{e^{\eta}}{1+e^{\eta}} \right] + \sum_{i=1}^n (N_i - y_i) \log \left[ \frac{1}{1+e^{\eta}} \right] + \frac{1}{2 \sigma^2} \sum_{j=1}^r \beta_j^2. \]

Unnormalized Posterior

The log unnormalized posterior density calculated by the following R function.

x <- model2$x
y <- model2$y
lupost <- function(beta, x, y) {
     eta <- as.numeric(x %*% beta)
     logp <- ifelse(eta < 0, eta - log1p(exp(eta)), - log1p(exp(- eta)))
     logq <- ifelse(eta < 0, - log1p(exp(eta)), - eta - log1p(exp(- eta)))
     logl <- sum(logp[y == 1]) + sum(logq[y == 0])
     return(logl - sum(beta^2) / 50)
}

Tricky Calculation

The tricky calculation of the log likelihood avoids overflow and catastrophic cancellation in calculation of \(\log(p)\) and \(\log(q)\) where \[ \begin{aligned} p & = \frac{e^{\eta}}{1+e^{\eta}} = \frac{1}{1+e^{-\eta}} \\ q & = \frac{1}{1+e^{\eta}} = \frac{e^{-\eta}}{1+e^{-\eta}}. \end{aligned} \]
So taking logs gives \[ \begin{aligned} \log (p) & = \eta - \log (1+e^{\eta}) = - \log (1+e^{-\eta}) \\ \log (q) & = - \log (1+e^{\eta}) = -\eta - \log (1+e^{-\eta}). \end{aligned} \]
To avoid overflow, we always chose the case where the argument of \(\exp\) is negative.

Tricky Calculation

We have also avoided catastrophic cancellation when \(| \eta |\) is large.
If \(\eta\) is large and positive, then \[ \begin{aligned} p & \approx 1 \\ q & \approx 0 \\ \log (p) & \approx - e^{-\eta} \\ \log (q) & \approx -\eta - e^{-\eta} \end{aligned} \] and our use of the R function log1p, which calculates the function correctly for small \(x\) avoids all problems.
Case where \(\eta\) is large and negative is similar.

Posterior Simulations

Use mcmc package and metrop function to obtain posterior simulations.
Following code runs the Metropolis algorithm to simulate the posterior.

library(mcmc)
set.seed(1)
beta.init <- as.numeric(coefficients(model2))
out <- metrop(lupost, beta.init, 1e3, x = x, y = y)
names(out)

##  [1] "accept"       "batch"        "initial"      "final"       
##  [5] "accept.batch" "initial.seed" "final.seed"   "time"        
##  [9] "lud"          "nbatch"       "blen"         "nspac"       
## [13] "scale"        "debug"

out$accept

## [1] 0

Posterior Simulations

Acceptance rate is 0! We need to adjust the scale input.
Generally accepted that an acceptance rate of about 20% is right, although this recommendation is based on the asymptotic analysis of a toy problem (simulating a multivariate normal distribution) for which one would never use MCMC and is very unrepresentative of difficult MCMC applications.
The 20% magic number must be considered like other rules of thumb. That is, there are examples where they do fail, but it is something simple to tell beginners that works all right for many problems.

Posterior Simulations

out <- metrop(out, scale=0.1, x = x, y = y)
out$accept

## [1] 0.01

out <- metrop(out, scale=0.03, x = x, y = y)
out$accept

## [1] 0.223

Here the first argument to each instance of the metrop function is the output of a previous invocation. The Markov chain continues where the previous run stopped, doing just what it would have done if it had kept going, the initial state and random seed being the final state and random seed of the previous invocation. Everything stays the same except for the arguments supplied (here scale).

Posterior Simulations

Argument scale controls the size of the Metropolis normal random walk proposal.
Default is scale = 1.
Big steps give lower acceptance rates. Small steps give higher. We want something about 20%.
Also possible to make scale a vector or a matrix.
Because each run starts where the last one stopped (when the first argument to metrop is the output of the previous invocation), each run serves as “burn-in” for its successor.

Diagnostics

That does it for the acceptance rate. So let’s do a longer run and look at the results.

out <- metrop(out, nbatch = 1e4, x = x, y = y)
out$accept

## [1] 0.1994

out$time

##    user  system elapsed 
##   1.224   0.058   1.295

Diagnostics

plot(ts(out$batch))

Diagnostics

acf(out$batch[,1:3])

Diagnostics

acf(out$batch[,4:6])

Diagnostics

acf(out$batch)

Estimate and MCSE

Very high correlation!
Need extremely long run with large batch sizes to ensure reasonable Monte Carlo estimates and MCSEs.
Your turn to work on a simpler problem.

Exercise

Write a Metropolis normal random walk sampler using the metrop function with the O-Ring Data.
Find a reasonable scale and illustrate the correlation in your sampler.
How long did you run your sampler? Why?
Group 1: Find predictive probabilities of O-ring failure along with 90% probability intervals. What do you observe for temperatures near the freezing point of 32 degrees?
Group 2: Perform a prior sensitivity analysis. Be sure to include the informative prior similar to that suggested in the text and at least one non-informative prior. How much impact does you prior selection have on posterior means, posterior medians, and credible intervals for parameters in the model?

Presentations

You will have 10-15 minutes to present your results on Thursday. Group members can decide who will be presenting.
Need to identify a group leader prior to the end of class today.