STAT 203: Lecture 6

Agenda

Conjugate Priors
Independence Priors
Two-Sample Normal Model
Diasorin Example

Conjugate Priors

Conjugate priors for normal data with unknown precision are determined by \[ \tau \sim \text{Gamma}\left( \frac{a}{2}, \frac{b}{2} \right) \text{ and } \mu | \tau \sim N\left( \mu_0, 1 / \omega_0 \tau \right). \]
Here \(a\), \(b\), \(\mu_0\), and \(\omega_0\) are hyperparameters (known numbers) chosen to characterize the prior information.
Problem with using this prior in practical data analysis is the difficulty of specifying a distribution for \(\mu\) that is conditional on \(\tau\).

Conjugate Priors

Before presenting the posterior, define a Bayesian point estimate for \(\mu\) as \[ \hat{\mu}_B = \frac{n}{n+\omega_0} \bar{y} + \frac{\omega_0}{n+\omega_0} \mu_0 , \] which is a weighted average of the prior mean and the sample mean.
Also define the Bayesian sum of squares error as \[ BSSE = \sum_{i=1}^n (y_i - \bar{y})^2 +\frac{n\omega_0}{n+\omega_0} (\mu_0 -\bar{y})^2. \]
For ease of exposition, we also consider non-integer versions of the \(\chi^2\) and \(t\) distributions.

Joint Posterior

Joint posterior obtained using \[ \mu | \tau,y \sim N (\hat{\mu}_B, 1 / \tau(n+\omega_0)) \] and \[ \tau | y \sim \text{Gamma} \left( \frac{n+a}{2}, \frac{BSSE + b}{2} \right). \]
Then the marginal posterior distribution of \(\tau\) is determined by \[ (BSSE + b) \tau | y = \frac{BSSE + b}{\sigma^2} |y \sim \chi^2(n+a). \]

Joint Posterior

Notice the mean of a \(\chi^2(n+a)\) is \(n+a\), then \[ E(\tau|y) = 1/\hat{\sigma}_B^2 \text{ where } \hat{\sigma}_B^2 = \frac{BSSE + b}{n+a}. \]
Marginal posterior distribution of μ is determined by \[ \frac{\mu - \hat{\mu}_B}{\sqrt{\hat{\sigma}_B^2 / (n+\omega_0)}} \Bigg| y \sim t(n+a). \]

Probability and Prediction Intervals

This leads to the 95% probability interval \[ 0.95 = P\left[\hat{\mu}_B - t \hat{\sigma}_B / \sqrt{n+\omega_0} \le \mu \le \hat{\mu}_B + t \hat{\sigma}_B / \sqrt{n+\omega_0} | y \right] \] where \(t = t(0.975,n+a)\).
Predictive distribution determined by \[ \frac{y_{n+1} - \hat{\mu}_B}{\hat{\sigma}_B \sqrt{1 + \frac{1}{n+\omega_0}}} \Bigg| y \sim t(n+a). \]
Leads to the 95% prediction interval \[ 0.95 = P\left[\hat{\mu}_B - t \hat{\sigma}_B \sqrt{1 + \frac{1}{n+\omega_0}} \le \mu \le \hat{\mu}_B + t \hat{\sigma}_B \sqrt{1 + \frac{1}{n+\omega_0}} | y \right]. \]

Conjugate Versus SIR Priors

Results on SIR priors nearly agree with these when \(\omega_0 = a = b = 0\).
Difference is that when \(\omega_0 = 0\), as opposed to \(\omega_0 > 0\) but small, the BSSE is the sum of \(n\) terms instead of \(n+1\) terms which causes the first hyperparameter of the gamma distribution to be \(n−1\) rather than \(n+a = n+0\).

Independence Priors

Independence priors assume information about \(\mu\) can be elicited independently of information on \(\tau\) (or \(\sigma\)), so \[ p(\mu,\tau) = p(\mu)p(\tau). \]
Makes elicitation of expert's information relatively easy.
Previously considered Gamma priors for \(\tau\) because they are conjugate.
Once we abandon conjugate priors, there is no reason to restrict attention to Gamma priors.
Any other priors with positive support are fair game, as well as priors defined for \(\sigma\) or \(\sigma^2\).
Independent priors constitutes our first real case in which the posterior cannot be obtained using calculus. Instead we must sample from the posterior.

Example

Need to identify a prior distribution that gives information about the unknown parameters \(\mu\) and \(\tau = 1/\sigma^2\).
Say our expert estimated with 95% certainty that the mean zinc percentage \(\mu\) before pouring should be between 4.5% and 5% and would be centered at 4.75%.
Interpret this information as \(E(\mu) = 4.75\) and \(P(4.5 < \mu < 5) = 0.95\) in the prior distribution.
Further assuming that the prior on \(\mu\) is normal, we find \[ \mu \sim N(4.75, 0.0163). \]
We have no good information on \(\sigma^2\), the variance of an observation \(y_i\), so we specify a reference prior on \(\sigma^2\) independent of \(\mu\).
Use is a gamma distribution on \(\tau = 1/\sigma^2\) with very small parameters. In particular, can consider \[ \tau \sim \text{Gamma}(0.001, 0.001). \]

Prior Elicitation

Typically, independence prior assumes \(\mu \sim N(a,1/b)\) and either \(\tau \sim \text{Gamma}(c,d)\) or \(\sigma \sim \text{Gamma}(e,f)\).
We must elicit information that will allow us to identify the four hyperparameters.
Mean \(\mu\) is easy to think about, but the precision \(\tau\) and the standard deviation \(\sigma\) are not.
Idea is to ask for information that is easier to think about, e.g. percentiles of the sampling distribution or an estimate of uncertainty.
Prior elicitation is hard work.
Common to consider prior sensitivity analysis.

Conditional Posteriors

Recall \[ y_1, \dots, y_n | \mu, \tau \stackrel{iid}{\sim} N(\mu, 1/\tau). \]
Consider independence priors \[ \mu \sim N(a,1/b) \quad \perp \quad \tau \sim \text{Gamma}(c,d). \]
First real example in which the posterior distribution cannot be obtained by calculus; it requires simulation.
Look at the posterior in more detail and find the conditional posteriors, yielding a Gibbs sampler.

Conditional Posteriors

The likelihood remains \[ L(\mu, \tau | y) \propto \tau^{n/2} \exp \left[ -\frac{\tau}{2}(n-1)s^2 - \frac{\tau}{2} n (\bar{y} - \mu)^2 \right]. \]
With the independent normal and Gamma priors on \(\mu\) and \(\tau\), respectively, the joint posterior density becomes \[ p(\mu, \tau | y) \propto \tau^{n/2} \exp \left[ -\frac{\tau}{2} \left( (n-1)s^2 + n (\bar{y} - \mu)^2 \right)\right] \exp \left[ -\frac{b}{2} (\mu-a)^2 \right] \tau^{c-1} e^{-\tau d}. \]

Conditional Posteriors

Set \[ \hat{\mu} = \hat{\mu} (\tau) = \frac{n\tau}{n\tau + b} \bar{y} + \frac{b}{n\tau + b} a. \]
Then the posterior can be rewritten as \[ \begin{aligned} p(\mu, \tau | y) \propto \exp & \left[ -\frac{1}{2} (n\tau + b)(\mu - \hat{\mu})^2 \right] \tau^{(c+n/2)-1} \\ & \exp \left[ -\frac{\tau}{2} \left( 2d + (n-1)s^2 + (n \tau b / n \tau + b) (\bar{y} - a)^2 \right) \right]. \end{aligned} \]
Can't write this in any truly convenient form, certainly not one that would allow us to identify it as a particular distribution.
Can find the conditional posteriors \(\mu | \tau,y\) and \(\tau | \mu, y\) exactly!
Conditional distributions are precisely what we need to sample from the posterior distribution using the Gibbs sampler.

Conditional Posteriors

Density of \(\mu | \tau,y\) obtained by dropping multiplicative terms not involving \(\mu\).
Hence \[ p(\mu | \tau, y) \propto \exp \left[ -\frac{1}{2} (n\tau + b)(\mu - \hat{\mu})^2 \right]. \]
Recognizable as a normal density, i.e. \[ \mu | \tau, y \sim N\left( \hat{\mu}, \frac{1}{n\tau + b} \right). \]

Conditional Posteriors

Density of \(\tau | \mu, y\) obtained by dropping multiplicative terms that involving \(\tau\).
Hence \[ p(\tau | \mu, y) \propto \tau^{(c+n/2)-1} \exp \left[ -\tau \left(d + \left\{ (n-1)s^2 + n (\bar{y} - \mu)^2 \right\} /2 \right) \right]. \]
Recognizable as a Gamma density, i.e. \[ \tau | \mu, y \sim \text{Gamma} \left( c+n/2, d + \left\{ (n-1)s^2 + n (\bar{y} - \mu)^2 \right\} /2 \right), \] or \[ \tau | \mu, y \sim \text{Gamma} \left( c+n/2, d + \sum_{i=1}^n (y_i - \mu)^2 /2 \right). \]

Two-Sample Normal Model

If we observe independent samples from two normal populations, often the goal is to compare the two populations.
For example, compare student satisfaction from two methods of teaching or compare wheat yields for two kinds of fertilizer.
Such comparisons are greatly facilitated when the populations have the same variance.

Two-Sample Normal Model

In general we assume the sampling model \[ y_{11}, \dots, y_{1n_1} | \mu_1, \tau_1 \stackrel{iid}{\sim} N(\mu_1, 1/\tau_1) \perp y_{21}, \dots, y_{2n_1} | \mu_2, \tau_2 \stackrel{iid}{\sim} N(\mu_2, 1/\tau_2). \]
Typically assume independent priors for the two populations \[ \mu_1, \tau_1 \perp \mu_2, \tau_2 . \]
Can use any of the one-sample techniques—reference priors, conjugate priors, or independence priors—to obtain priors for the parameters of each population.
Heteroscedasticity poses no technical problems to Bayesians. Straightforward to obtain posterior inferences by simulation after incorporating unequal variances into a Bayesian model.
But for Bayesians and frequentists alike, the real issue is one of interpretation.

Heteroscedasticity

What does it mean if \(\mu_b\) is larger than \(\mu_r\) but the standard deviation for 'red' population 'blue' population?
If larger responses are preferable, 'blue' population is preferable.
Sizable proportion of individuals from 'red' who have higher responses.

Heteroscedasticity

For example, we are often interested in whether some cut-off value is exceeded: systolic blood pressure over 140, diastolic over 90, cholesterol score over 240. In prior figure, 'red' population has a much higher percentage of people with scores above 7 than 'blue' population, even though \(\mu_r < \mu_b\).
Of course if the difference dwarfs the standard deviations, then we can still achieve separation.
In any particular application, it is not obvious from the means alone which population is preferable.
With equal variances, the population means tell much more of the story.
Plotting the predictive distribution is a most useful tool, especially when analyzing normal data with unequal variances.

Two-Sample Normal Model

Traditional approach to comparing two different normal populations is to look at \(\mu_1 - \mu_2\).
However, when we allow for distinct variances (precisions) in the two populations, a simple comparison of means can be meaningless.
Differences in variances can create huge changes in the practical effects associated with a difference in the means.
Making inferences on means when the variances differ has a long and controversial history with little agreement even today.
A practical method for dealing with this problem is by examining the predictive distributions and carefully interpreting them.
However, if we can assume \(\tau_1 = \tau_2 = \tau\), there is a nice analytical Bayesian solution for finding the posterior distribution of \(\mu_1 - \mu_2\) using conjugate priors and no controversy about \(\mu_1 - \mu_2\) being the correct effect measure to examine.

Diasorin Example

Renal osteodystrophy is a bone disease that occurs when the kidneys fail to maintain proper levels of calcium and phosphorus in the blood. Monitoring patients with loss of kidney function for lower than normal bone turnover aids in managing the disease. A commercially available diagnostic assay, Diasorin, claims to be able to tell patients apart who have low versus normal bone turnover.
Cross-section of 34 kidney patients from the bone registry at the University of Kentucky were identified as low or normal turnover by other means and then given the commercial assay to determine whether it could correctly identify them. From boxplots a normal sampling model appears untenable due to marked skewness, but boxplots and quantile plots of the log transformed data seem reasonably normal.

Diasorin Example

n1 <- 19
n2 <- 15
low <- c(91, 46,  95,  60, 33, 410, 105, 43, 189,1097, 54,178, 114, 137, 233, 101, 25,70,357)
normal <- c(370, 267, 99,157, 75,1281, 48, 298, 268, 62,804,430,171,694,404)
diasorin <- c(low,normal)
group <- c(rep(1,n1),rep(2,n2))
group1 <- factor(group, 1:2, c('Low','Normal'))

Diasorin Example

Priors

We elicit independence priors for each population.
Although we analyze data on the log scale, we elicit priors on the original scale.
To get a prior on a mean \(\mu\), we elicit information on the median \(m = e^{\mu}\) of the sampling distribution.
Specify a best guess \(\tilde{m}\) for the median, and a percentile \(\tilde{u}\) for which we are, say, 95% sure that the median is below (or above).
Percentiles are invariant under the log transformation, thus \(P(\mu \le \log \tilde{m}) = 0.5\) and \(P(\mu \le \log \tilde{u}) = 0.95\).
If we model \(\mu \sim N(a,b)\), then \(a = \log \tilde{m}\) and \[ \frac{\log \tilde{u} - \log \tilde{m}}{\sqrt{b}} = 1.645, \] so \(b = (\log \tilde{u} - \log \tilde{m})^2 / 1.645^2\).

Priors

For the low bone turnover group \((n_1 = 19)\), use \(\tilde{m}_L = \tilde{m}_1 = 130\) and 95% sure that this median was less than \(\tilde{u}_L = \tilde{u}_1 =142\).
For the low bone turnover group \(\mu_L = \mu_1 \sim N(4.87,0.00288)\).
For the normal bone turnover group \((n_2 = 15)\), use \(\tilde{m}_N = \tilde{m}_2 = 220\) and 95% sure that this median was less than \(\tilde{u}_N = \tilde{u}_2 =240\).
For the normal bone turnover group \(\mu_N = \mu_2 \sim N(5.39,0.00280)\).

Priors

The priors on the \(\tau\)s are also obtained as for the one-sample normal case.
Elicit information from our expert about the 90th (or some other) percentile of the sampling distribution \(\gamma_{0.9}\). The log of this is \(\mu + 1.645 \sqrt{1/\tau}\).
The elicitation is now conditional on the best guess for \(\mu\) being \(\log \tilde{m}\).
Suppose our best guess for the 90th percentile is \(\tilde{\gamma}_{0.9}\).
Then our best guess for \(\tau\), say \(\tau_0\), conditional on our best guess for \(\mu\), is obtained by solving \(\log \tilde{\gamma}_{0.9} = \log \tilde{m} + 1.645 \sqrt{1/\tau_0}\).
We obtain \(\tau_0 = 1.645^2 / \left\{ \log (\tilde{\gamma}_{0.9} / \tilde{m}) \right\}^2\).
Since we assume a \(\text{Gamma}(c,d)\) prior for \(\tau\), set \((c−1)/d = \tau_0\).
Can proceed with eliciting an upper limit on \(\gamma_{0.9}\) but often the expert wants to stop, in which case we introduce the same large variability as in a proper gamma reference prior by picking a small value for \(d\).

Priors

Best guess for the 90th percentile of Diasorin values in the low (\(\tilde{\gamma}_{0.9,1} = 170\)) and normal (\(\tilde{\gamma}_{0.9,2} = 280\)) bone turnover groups.
We consider gamma priors with modes \(\tau_{0,1} = 37.60\) and \(\tau_{0,2} = 46.53\).
Consider \(d = 0.001\) so \[ \tau_1 \sim \text{Gamma}(1.0376,0.001) \text{ and } \tau_2 \sim \text{Gamma}(1.04653,0.001). \]
Concern with this choice is that the means of the \(\tau\)s are on the order of 1000, so these priors give lots of vaguely specified probability to the right of the mode. In fact, they give 97% and 96% prior probabilities to being greater than their modes, respectively. For \(b = 0.01\), these probabilities reduce to 0.7 and 0.65, and if \(b = 0.1\), they are 0.025 and 0.01, respectively.
Still used b = 0.001 but did a sensitivity analysis.

Posterior Sampling

Can use the Gibbs sampler presented to obtain correlated samples from the posterior \[ p(\mu_1, \tau_1, \mu_2, \tau_2 | y_1, y_2) = p(\mu_1, \tau_1|y_1) p(\mu_2, \tau_2|y_2). \]
Activity: Obtain 50000 samples from the posterior above, then use these to sample from the predictive densities. You can use R, WinBUGS, or MATLAB.
Present my sampler and posterior summaries during our next class.