STAT 203: Lecture 13

Agenda

Mixed Models
Toenail Fungus Example
Gibbs Sampling and Centering
Group Exercise

Mixed Models

Introduce hierarchical models into logistic regression.
Specifically, allow some regression coefficients to be random in the sampling model.
Two primary reasons for introducing random effects models are that they can introduce correlation among Bernoulli observations when appropriate and that they can adjust for missing or unknown covariates/predictors that are common to a group of observations while not fundamentally changing the nature of the statistical inference.

Motivating Example

Suppose we sample 100 hospitals and then randomly sample 20 patients from each hospital.
At the end of each patient's stay, we ask them if they were satisfied with their care in the hospital (binary response).
In addition, we track certain hospital characteristics like the number of nursing staff per patient, the hospital's average number of patient contacts with doctors, etc.
Might also track patient-specific items like their age and their specific disease or hospital unit admission.
No matter how much covariate information is collected, there will always be variables that were omitted. To the extent that these are hospital variables and not patient variables, we can accommodate them by incorporating an overall hospital effect.

Motivating Example

Let \(x_{ij}\) denote the vector of covariate information for the \(j\)th patient in the \(i\)th hospital (excluding an intercept), let \(y_{ij}\) be the indicator of the satisfaction of patient \(j\) in hospital \(i\), and let \(\theta_{ij} = P(y_{ij}=1 | x_{ij})\).
The model \[ \text{logit} (\theta_{ij}) = \mu + x'_{ij}\beta \] allows for the effects of measured hospital and patient characteristics on patient satisfaction.
Allows us to estimate a patient's satisfaction for any hospital in the data and to predict patient satisfaction for any new hospital on which we have the appropriate covariates.

Motivating Example

If we suspect that important hospital variables have been omitted, a natural surrogate for them is to add an effect for each hospital, namely \[ \text{logit} (\theta_{ij}) = x'_{ij}\beta + \gamma_{i}. \]
If \(\gamma_{i}\) is treated as a fixed effect, the model does not allow us to discuss hospitals in general. Every hospital has its own unique effect. There is no possibility of predicting patient satisfaction for a new hospital because there is no way to estimate \(\gamma\) for a new hospital.
Moreover, the hospital variables incorporated in \(x_{ij}\) would be redundant. (Hospital variables are elements of the vector \(x_{ij}\) that never vary with \(j\).)

Motivating Example

However, if we assume that the \(\gamma_{i}\)s are random effects, e.g., \[ \gamma_{i} | \mu, \tau \stackrel{iid}{\sim} N (\mu, 1 / \tau), \] each hospital is sampled from a population of hospitals and we can discuss both the mean and variability of this population.
Satisfaction probabilities for patients in each hospital go up or down depending on the sign of \(\gamma_{i} - \mu\).
Hospital with \(\gamma_{i} = \mu\) is a typical hospital, which allows us to make point predictions of the probability of satisfaction for a patient with covariates \(x\) from a new hospital.
Moreover, the variability in the \(\gamma_{i}\)s affects the variability in the probabilities of satisfaction for patients from a new hospital.
This model also correlates responses within each hospital, i.e., \(\{ y_{ij}: j = 1,\dots,20 \}\) are all correlated, while responses from different hospitals are independent. The fact that all patients from hospital \(i\) share the same random \(\gamma_{i}\) effect induces the correlation.

Binomial Regression Mixed Model

Usual binomial regression sampling model is for \(k = 1,\dots,n\), \[ y_k | \theta_k \stackrel{ind}{\sim} \text{Bin}(N_k, \theta_k)\\ \text{logit} (\theta_{k}) = x'_{k}\beta. \]
In a slight change of notation, write the predictor variables as two vectors \(x_k\) and \(z_k\) of dimensions \(r\) and \(q\), respectively.
If we similarly write the regression coefficients as two vectors \(\beta\) and \(\gamma\), we can write a binomial regression model as \[ y_k | \theta_k \stackrel{ind}{\sim} \text{Bin}(N_k, \theta_k)\\ \text{logit} (\theta_{k}) = x'_{k}\beta + z'_{k} \gamma. \]
Point is simply to have the regression coefficients, and their associated predictor variables, separated into two parts.

Binomial Regression Mixed Model

For a mixed model, we change the assumptions on one part of the regression coefficients.
In standard regression, the coefficients are fixed unknown parameters.
In a mixed model, some of the regression coefficients are assumed to be random.
In particular, we assume for models with a fixed intercept \[ \gamma | \xi \sim N_q (0, \Sigma_{\gamma}(\xi)) \] with \(\Sigma_{\gamma}(\xi)\) a known positive definite matrix given the value of some parameter vector \(\xi\).
If the model \(x'_{k}\beta\) does not include an intercept, the elements of \(\gamma\) would have a common mean \(\mu\).

Binomial Regression Mixed Model

In most applications, the random effects are due to the observations falling into groups, so the columns of \(z'_{k}\) consist of indicator variables for the groups (like hospitals).
Moreover, the random group effects are viewed as exchangeable.
For the various groups \(i\) and observations within groups \(j\), consider a model \[ y_{ij} | \theta_{ij} \stackrel{ind}{\sim} \text{Bin} (N_i, \theta_{ij}) \\ \text{logit} (\theta_{ij}) = x'_{ij}\beta + \gamma_{i} \\ \gamma_i | \tau \stackrel{iid}{\sim} N(0, 1/\tau) . \]
Let \(i = 1, \dots, a\) and \(j = 1, \dots, n_i\).

Toenail Fungus

Toenail fungus, a fairly common problem, can disfigure and sometimes destroy the nail.
Afflicts between 2% and 18% of people world-wide.
Following the text, consider data from a clinical trial on toenail fungus reported by De Backer et al. (1996).
Randomized study compares two oral treatments for dermatophyte onychomycosis infection: terbinafine and intraconazole. These are well-known commercially available treatments.

Toenail Fungus

Specifically consider data on an unpleasant side effect: the degree of separation of the nail plate from the nail bed.
Scored in four categories (0, absent; 1, mild; 2, moderate; 3, severe).
For the \(a = 294\) patients, the response was evaluated at seven visits (approximately on weeks 0, 4, 8, 12, 24, 36, and 48).
Total of 937 measurements were made on the 146 intraconazole, \(Trti = 0\), patients and 971 measurements were made on 148 terbinafine, \(Trti = 1\), patients.

Toenail Fungus

Data available through the beneficence of Novoartis, Belgium on Wes Johnson's website.

toenail <- read.table("http://www.ics.uci.edu/~wjohnson/BIDA/Ch8/ToenailData.txt", 
                      header=TRUE)
attach(toenail)
head(toenail)

##   ID y Trt       Time Visit
## 1  1 1   1  0.0000000     1
## 2  1 1   1  0.8571429     2
## 3  1 1   1  3.5357143     3
## 4  1 0   1  4.5357143     4
## 5  1 0   1  7.5357143     5
## 6  1 0   1 10.0357143     6

Modeling

We examine a logit mixed effects model for the toenail data.
Let \(y_{ij} = 1\) if individual \(i\) has moderate or severe toenail separation at time \(j\), with \(y_{ij} = 0\) if toenail separation is absent or mild.
Consider the sampling model, \[ \begin{aligned} y_{ij} | \theta_{ij} & \stackrel{ind}{\sim} \text{Bern}(\theta_{ij}) , \\ \text{logit} (\theta_{ij}) & = \gamma_i + \beta_1 Trt_i + \beta_2 Time_{ij} + \beta_3 Trt_i \times Time_{ij}, \\ \gamma_i | \mu, \tau & \stackrel{iid}{\sim} N(\mu , 1/\tau), \end{aligned} \] for \(i = 1, \dots, 294\) and \(j = 1, \dots, n_i \le 7\).

Modeling

Here \(\gamma_i\) is a random effect for each subject, \(Trt\) is the binary treatment indicator, \(Time\) is the visit time as measured in four-week periods.
Model also includes an interaction, \(Trt \times Time\).
The \(\beta_k\)s are fixed effects and do not include an intercept, so the random effects \(\gamma_i\) have a non-zero mean \(\mu\).
Random effects \(gamma_i\) are included to introduce correlation among the repeated observations on the same individual.

Modeling

To look at a typical individual, consider a case with \(\gamma_i = \mu\).
The log-odds are then straight line functions of \(Time\), one line for each treatment, i.e. \[ \text{logit} (\theta) = \begin{cases} \mu + \beta_2 Time, & \text{ if } Trt=0 , \\ (\mu + \beta_1) + (\beta_2 + \beta_3) Time , & \text{ if } Trt=1. \end{cases} \]
Thus, \(\beta_1\) is the differential effect between the treatments at \(Time = 0\).
Since this is a randomized experiment, we have good reason to believe that \(\beta_1\) should be 0.
Presuming that these treatments actually work, we should see the log-odds decreasing with time, thus we expect \(\beta_2\) and \(\beta_2 + \beta_3\) both to be negative.
The effect \(\beta_3\) is the difference in slopes between the two treatments. A negative differential slope indicates that treatment 1 is better because the log-odds are decreasing faster with time.

Priors

Require priors for both the regression parameters \(\beta = (\beta_1, \beta_2, \beta_3)'\) and the parameters \(\mu\) and \(\tau\) of the normal distribution for the \(\gamma_i\)s.
Can use independent normal and gamma reference priors. Specifically, \(N_3(0, 100 \times I_3)\), \(N(0,100)\), and \(\text{Gamma}(2, 0.5)\) priors for \(\beta\), \(\mu\), and \(\tau\), respectively.
Since the sample size is large, the choice of priors for \(\beta\) and \(\mu\) have little effect; however, that is not always the case for \(\tau\).

Mixed Model Likelihood

Likelihood function of a general binomial mixed model is not too different from the model without random effects. In this special case, the likelihood is computed directly as \[ L(\beta, \mu, \tau) = \prod_{i} \int \prod_{j} \left( \frac{ e^{x_{ij}' \beta + \gamma_{i}} }{1 + e^{x_{ij}' \beta + \gamma_{i}}} \right)^{y_{ij}} \left( \frac{ 1 }{1 + e^{x_{ij}' \beta + \gamma_{i}}} \right)^{1-y_{ij}} \tau^{1/2} e^{-\tau(\gamma_i - \mu)^2 / 2} d \gamma_i . \]
Integral over \(\gamma_i\) is the likelihood contribution for the \(i\)th group, e.g., the \(i\)th hospital or individual.
Alternatively, we can treat the \(\gamma_i\)s as parameters. Putting them into a vector \(\gamma\) leads to \[ L(\beta, \gamma, \mu, \tau) = \prod_{i} \prod_{j} \left( \frac{ e^{x_{ij}' \beta + \gamma_{i}} }{1 + e^{x_{ij}' \beta + \gamma_{i}}} \right)^{y_{ij}} \left( \frac{ 1 }{1 + e^{x_{ij}' \beta + \gamma_{i}}} \right)^{1-y_{ij}}. \]

Mixed Model Likelihood

Then treating the \(\gamma_i\)s as parameters, as a step in a Bayesian analysis to find the posterior of \(\beta\), \(\mu\), and \(\tau\), we equivalently obtain \[ \begin{aligned} L(\beta, & \mu, \tau) = \int f(y | \beta, \gamma, \mu, \tau) p(\gamma | \beta, \mu, \tau) d \gamma = \int L(\beta, \gamma, \mu, \tau) p(\gamma | \beta, \mu, \tau) d \gamma \\ & = \int \prod_{i} \prod_{j} \left( \frac{ e^{x_{ij}' \beta + \gamma_{i}} }{1 + e^{x_{ij}' \beta + \gamma_{i}}} \right)^{y_{ij}} \left( \frac{ 1 }{1 + e^{x_{ij}' \beta + \gamma_{i}}} \right)^{1-y_{ij}} \left[ \prod_{i} \tau^{1/2} e^{-\tau(\gamma_i - \mu)^2 / 2} d \gamma_i \right] \\ & = \prod_{i} \int \prod_{j} \left( \frac{ e^{x_{ij}' \beta + \gamma_{i}} }{1 + e^{x_{ij}' \beta + \gamma_{i}}} \right)^{y_{ij}} \left( \frac{ 1 }{1 + e^{x_{ij}' \beta + \gamma_{i}}} \right)^{1-y_{ij}} \tau^{1/2} e^{-\tau(\gamma_i - \mu)^2 / 2} d \gamma_i . \end{aligned} \]
Integrand of the penultimate expression is the augmented data likelihood.
This times the prior \(p(\beta, \mu, \tau)\) is used to obtain full conditional distributions.

Gibbs Sampling and Centering

Complicated models, such as binomial mixed models, typically have many alternative parameterizations.
Not clear whether \(\gamma\) should be treated as a parameter vector or whether it should be integrated out prior to the analysis.
Effective sampling from the posterior often depends on the exact parameterization.
Illustrate how two parameterizations for binomial mixed models affect the Gibbs sampler. Both treat \(\gamma\) as a parameter rather than integrating it out.

Gibbs Sampling

Suppose \[ y_1, \dots, y_n | \theta_i \stackrel{ind}{\sim} \text{Bin} (N_i , \theta_i), \quad \text{logit}(\theta_i) = x_i' \beta + \gamma_i , \] and \[ \gamma_1, \dots, \gamma_n | \tau \stackrel{iid}{\sim} N(0, 1/\tau). \]
Note this is a special case of our model with \(j=1\)
Consider priors (text does not typically recommend normal priors for \(\beta\)) \[ \beta \sim N (\beta_0, \Sigma_0) \perp \tau \sim \text{Gamma} \left( \frac{a}{2}, \frac{b}{2} \right). \]

Gibbs Sampling

The joint density has the form \[ p(y, \beta, \gamma, \tau) = f_{*} (y | \beta, \gamma) f_0 (\gamma | \tau) p(\beta) p(\tau). \]
Specifically, the joint density is \[ \begin{aligned} p(y, & \beta, \gamma, \tau) \\ & \propto \left[ \prod_{i=1}^{n} \left( \frac{ e^{x_{i}' \beta + \gamma_{i}} }{1 + e^{x_{i}' \beta + \gamma_{i}}} \right)^{y_{i}} \left( \frac{ 1 }{1 + e^{x_{i}' \beta + \gamma_{i}}} \right)^{N_i-y_{i}} \right] \left[ \prod_{i=1}^{n} \tau^{1/2} e^{-\tau \gamma_i^2 / 2} \right] \\ & \quad \times \left[ \exp \left( -\frac{1}{2} (\beta - \beta_0)' \Sigma_0^{-1} (\beta - \beta_0) \right) \right] \left[ \tau^{a/2 - 1} e^{-b \tau /2} \right]. \end{aligned} \]
To perform Gibbs sampling we need to find \[ p( \beta | y, \gamma, \tau), p(\gamma | y, \beta, \tau), \text{ and, } p(\tau | y, \beta, \gamma). \]

Gibbs Sampling

From the form of the joint density, \[ p(\tau | y, \beta, \gamma) \sim \text{Gamma} \left( \frac{n+a}{2} , \frac{b+\sum \gamma_i^2}{2} \right) . \]
We can also see that the \(\gamma_i\)s are conditionally independent with \[ p(\gamma_i | y, \beta, \tau) \propto \left( \frac{ e^{x_{i}' \beta + \gamma_{i}} }{1 + e^{x_{i}' \beta + \gamma_{i}}} \right)^{y_{i}} \left( \frac{ 1 }{1 + e^{x_{i}' \beta + \gamma_{i}}} \right)^{N_i-y_{i}} \tau^{1/2} e^{-\tau \gamma_i^2 / 2} \]
While this is not a recognizable distribution, it is a scalar distribution making it easier to sample.

Gibbs Sampling

Finally, we could use Metropolis or Acceptance-Rejection to sample from the multivariate distribution \[ \begin{aligned} p( \beta | y, \gamma, \tau) \propto & \left[ \prod_{i=1}^{n} \left( \frac{ e^{x_{i}' \beta + \gamma_{i}} }{1 + e^{x_{i}' \beta + \gamma_{i}}} \right)^{y_{i}} \left( \frac{ 1 }{1 + e^{x_{i}' \beta + \gamma_{i}}} \right)^{N_i-y_{i}} \right] \\ & \times \left[ \exp \left( -\frac{1}{2} (\beta - \beta_0)' \Sigma_0^{-1} (\beta - \beta_0) \right) \right] . \end{aligned} \]
Alternatively, we could sample each scalar \(\beta_j\) by sampling \(p( \beta_j | y, \beta_{-j}, \gamma, \tau)\) for \(j = 1, \dots, r\), i.e. a component-wise approach.

Centering Model

Because there is a separate random effect for each individual, we can modify the parameterization by putting the fixed regression effects into the model for the random effects.
This redefining of the \(\gamma_i\)s is called the centering model.
Does not affect the substance of the model, the likelihood function is identical, but it affects how we sample from the posterior.

Centering Model

Take \[ y_i | \gamma \stackrel{ind}{\sim} \text{Bin} (N_i , \theta_i), \quad \text{logit}(\theta_i) = \gamma_i , \] and \[ \gamma_1, \dots, \gamma_n | \beta, \tau \stackrel{ind}{\sim} N(x_i' \beta , 1/\tau). \]
Now \(\beta\) is a parameter relative to \(\gamma\) rather than \(y_i\).
Can only do this because the model includes a separate parameter \(\gamma_i\) for each \(y_i\), something that does not generally happen in mixed models for binomial data.
Use the same prior \[ \beta \sim N (\beta_0, \Sigma_0) \perp \tau \sim \text{Gamma} \left( \frac{a}{2}, \frac{b}{2} \right). \]

Centering Model

Then the joint density has form \[ p(y, \beta, \gamma, \tau) = f_{*} (y | \gamma) f_0 (\gamma | \beta, \tau) p(\beta) p(\tau). \]
Specifically, the joint density is \[ \begin{aligned} p(y, & \beta, \gamma, \tau) \\ & \propto \left[ \prod_{i=1}^{n} \left( \frac{ e^{\gamma_{i}} }{1 + e^{\gamma_{i}}} \right)^{y_{i}} \left( \frac{ 1 }{1 + e^{\gamma_{i}}} \right)^{N_i-y_{i}} \right] \left[ \prod_{i=1}^{n} \tau^{1/2} e^{-\tau (\gamma_i - x_{i}' \beta)^2 / 2} \right] \\ & \quad \times \left[ \exp \left( -\frac{1}{2} (\beta - \beta_0)' \Sigma_0^{-1} (\beta - \beta_0) \right) \right] \left[ \tau^{a/2 - 1} e^{-b \tau /2} \right]. \end{aligned} \]
To perform Gibbs sampling we need to find \[ p( \beta | y, \gamma, \tau), p(\gamma | y, \beta, \tau), \text{ and, } p(\tau | y, \beta, \gamma). \]

Centering Model

Again, from the form of the joint density, \[ p(\tau | y, \beta, \gamma) \sim \text{Gamma} \left( \frac{n+a}{2} , \frac{b+\sum (\gamma_i - x_{i}' \beta)^2}{2} \right) . \]
We can also see the \(\gamma_i\)s are again conditionally independent with \[ p(\gamma_i | y, \beta, \tau) \propto \left( \frac{ e^{\gamma_{i}} }{1 + e^{\gamma_{i}}} \right)^{y_{i}} \left( \frac{ 1 }{1 + e^{\gamma_{i}}} \right)^{N_i-y_{i}} \tau^{1/2} e^{-\tau (\gamma_i - x_{i}' \beta)^2 / 2} \]
Still a scalar distribution and not recognizable.

Centering Model

Mathematically, the big advantage is that the full conditional distribution of \(\beta\) is a multivariate normal distribution.
The full conditional is based on \[ p( \beta | y, \gamma, \tau) \propto \left[ \prod_{i=1}^{n} \tau^{1/2} e^{-\tau (\gamma_i - x_{i}' \beta)^2 / 2} \right] \left[ \exp \left( -\frac{1}{2} (\beta - \beta_0)' \Sigma_0^{-1} (\beta - \beta_0) \right) \right] . \]
Does not depend on the \(y_i\)s; it is actually a standard normal theory linear model posterior discussed later where \(\tau\) is known and the data are the \(\gamma_i\)s.

Centering Model

With the matrix \(X\) having rows \(x_i'\), the posterior multivariate normal has mean vector \[ \tilde{\beta}(\gamma, \tau) = \left( \tau X'X + \Sigma_0^{-1} \right)^{-1} \left[ \tau X' \gamma + \Sigma_0^{-1} \beta_0 \right] \] and covariance matrix \[ \left( \tau X'X + \Sigma_0^{-1} \right)^{-1}. \]
Only occurs because we are using a normal prior for \(\beta\), a prior your text does not recommend.
Centering model is used to alleviate correlations among MCMC iterates and improve convergence in the chain.

Group Exercise

Obtain a Gibbs sampler for the centering model using the Toenail Fungus data. Use priors noted above.
Describe, with words and appropriate plots, the mixing behavior of the Gibbs sampler.
Determine a suitable number of Markov chain iterations to use in you analysis. Describe and justify your selection.
Obtain posterior summaries similar to the text. Do your numbers agree?
Recall, we have good reason to believe that \(\beta_1\) should be 0. Is it?
Further, recall we should see the log-odds decreasing with time. Are they?
Which treatment seems to be working better? Why?

Be prepared to present you results on 2/27 during class. Each group will have 20 minutes to present. Discussion on 2/22 is open for working time and questions related to the the group exercise.