• Time to get serious about analyzing data
  • Present commonly used parametric models including the binomial, negative binomial, normal with unknown precision, and Poisson
  • Priors can involve substantive scientific input or they can be chosen as convenient reference priors
  • Scientific input obtained by eliciting information about objects that scientists can easily think about, then applied to the parameters of the model

Inference for Proportions

  • Does smoking increase the probability of lung cancer?
  • Do Fords break down more than Toyotas?
  • Which of two major airlines has a higher proportion of on-time arrivals?
  • Involve comparing probabilities for two groups.
  • Relevant data involve sampling individuals from each group (people, cars, patients, flights) and recording binary outcomes (cancer, breakdown, cure, on-time).
  • Consider more complex problems with binomial data later.

Ball Washer Example

  • We considered data from the St. Paul Brass and Aluminum Foundry.
  • Out of 2,430 push rod eyes poured at their foundry only 2,211 actually shipped.
  • Interest is in the probability of pouring a defective part.
  • Vice President of Operations thinks that a plausible range for the proportion of scrap is 5% to 15%.
  • Assume that the number of defective parts \(y\) has a binomial distribution \[ y|\theta \sim \text{Bin} (2430, \theta). \]

Ball Washer Example

  • Need a prior on \(\theta\). Interpreted the Vice President as specifying \(P(\theta \le 0.05) = 0.025\) and \(P(\theta \ge 0.15) = 0.025\).
  • Assuming that \(\theta\) is modeled with a Beta prior, these two probability statements can be shown to determine the Beta distribution \[ \theta \sim \text{Beta} (12.05,116.06). \]
  • With \(y = 219 = 2430−2211\) the posterior distribution is \(\text{Beta}(219+12.05,2430−219+116.06)\) with median 0.0902 and 95% probability interval \((0.0795,0.1017)\).
  • Estimate the median of the posterior distribution as 0.09 with 95% PI for \(\theta\) of \((0.08,0.10)\).

Prior Distributions

  • Beta family provides a flexible and convenient class of distributions for modeling uncertainty about probabilities.
  • Can lead to a variety of density shapes including U-shaped, J-shaped, L-shaped, unimodal symmetric, unimodal skewed right and left, and the \(U(0,1)\) density.
  • Finding a prior that agrees with the information of the expert is far more important than the convenience of using a Beta distribution.
  • If the expert's prior is bimodal, then we could use a mixture of Betas.

Reference Priors

  • For \(y|\theta \sim \text{Bin} (n, \theta)\) there is little agreement on how to choose a reference prior.
  • Standard candidates include:
    • improper \(\text{Beta}(0,0)\) distribution
    • Jeffreys' prior, i.e. the \(\text{Beta}(0.5,0.5)\) distribution
    • the \(U(0,1) = \text{Beta}(1,1)\) prior
  • First two put most of their probability on values very near 0 and 1, the improper prior overwhelmingly so. Fortunately, these two priors tend to have little effect on the posterior when sample sizes are moderate and when the data don't concentrate near 0 or 1.
  • All three priors are data augmentation priors in the sense that they can be viewed as adding prior successes and failures to the data.

Informative Beta Priors

  • Most often we obtain a prior on the probability \(\theta\) from an expert by eliciting two pieces of information:
    • best guess for the probability
    • biggest value the probability could reasonably be, say, a value with only a 5% chance that the probability would exceed it
  • More generally, our elicitation involves the expert guessing the value of \(\theta\) and providing a percentile for the distribution of \(\theta\).
  • Values must be elicited independently of the current data being analyzed.
  • Priors are often elicited after the current data have been collected which makes elicitation ignoring the current data more difficult.
  • Inputs could also be based on published values or historical data.

Other Settings

  • Rare Events: When dealing with probabilities that are very small (or large), we actually have a great deal of prior information.
  • Non-Beta Priors: Use a mixture of Beta distributions for bimodal priors or a truncated Beta distribution for a proportion restricted to a subset of \([0,1]\).

Effect Measures

  • With binary data the main subject of interest is the probability of "success" \(\theta\).
  • In biosciences \(\theta\) is often referred to as a risk, e.g. proportion of smokers who develop lung cancer denotes the "risk" of smoking.
  • Denotes both good and bad outcomes just as we call \(\theta\) the probability of success even though \(\theta\) may be the probability of death or failure.


  • Related measure of risk is the odds.
  • The odds \(O\) are the ratio of the success probability to the failure probability, \[ O = \theta / (1-\theta). \]
  • Can retrieve the probability from the odds because \[ \theta = O / (1-O). \]
  • With \(0 < \theta < 1\), the odds is a positive number.
  • Odds gets larger as \(\theta\) gets larger.
  • Odds of one correspond to \(\theta = 0.5\).
  • Sometimes we look at the log of the odds, with values from negative infinity to positive infinity.

Two Populations

  • More complicated and more interesting when we have two populations to compare. Say to compare the probabilities of lung cancer for smokers, say \(\theta_1\), and non-smokers, \(\theta_2\).
  • One comparison looks at their ratio, known as the relative risk (RR), \(RR =\theta_1 / \theta_2\).
  • Can also look at the odds ratio defined as \[ OR = \frac{\theta_1/(1-\theta_1)}{\theta_2/(1-\theta_2)}. \]
  • Mathematically, odds ratios have some nice properties. Log-odds ratios appear naturally in later models.
  • Third effect measure for two populations is the risk difference, \(RD = \theta_1 - \theta_2\), also called the attributable difference.
  • Point estimates and probability intervals for RD, RR, or OR are readily available using a sample from the posterior.

Case-Control Sampling

  • Often can sample individuals and cross-classify them by an explanatory factor and a response factor. We then condition on the explanatory factor to arrive at data from two independent binomials.
  • Alternatively, if the explanatory factor defines convenient populations, we can take independent samples from each level of the explanatory factor.
  • Both sampling strategies break down when studying rare events.
  • Suppose we wanted to study the relationship between childhood vaccinations and autism. Autism is still relatively rare, so we would need a very large sample of children if we were to include, say, 50 autistic children. This would be true regardless of whether we sample children and cross-classify them, or whether we could conveniently sample the subpopulations of vaccinated children and unvaccinated children (a highly unlikely prospect).

Case-Control Sampling

  • Instead, we might conveniently take one sample of autistic children and one sample of nonautistic children, that is, a sample of cases and a sample of controls.
  • Could then classify children in each group by whether they had received childhood vaccinations.
  • If we have a population of autistic children available, it becomes much easier to ensure that the study includes a reasonable number of them.

  • Problem with case-control sampling for vaccinations and autism is that we want to study \(\theta'_i\)s, the probabilities that children are autistic given their vaccination status, but what the data allow us to study are \(\theta_j\)s, the probabilities of being vaccinated given their autism status.
  • Although we cannot study the \(\theta'_i\)s directly, we will exploit, and later show, that the odds ratio based on the \(\theta'_i\)s is the same as the odds ratio based on the \(\theta_j\)s, a value we can estimate.

Case-Control Sampling

  • Assume a disease of interest \(D\) with \(D = 1\) indicating presence and \(D = 2\) indicating absence of the disease and an exposure variable \(E\) with \(E = 1\) indicating exposure and \(E = 2\) indicating no exposure.
  • For example, \(E = 1\) corresponds to “vaccinated” and \(E = 2\) corresponds to “non-vaccinated” while \(D = 1\) corresponds to “autism” and \(D = 2\) corresponds to “no autism.”
  • Interested in determining how \(E\) affects the probability of \(D\), namely \(P(D|E)\).
  • Want to compare \(P(D=1|E=1) = \theta'_1\) and \(P(D=1|E=2) = \theta'_2\).
  • Ideally, we would like to know, say, the risk ratio \[ RR = P(D=1|E=1) / P(D=1|E=2) = \theta'_1 / \theta'_2. \]
  • Unfortunately, our study design does not allow us to estimate the risk ratio, risk difference, or the \(\theta'_i\)s.

Case-Control Sampling

  • However, the odds ratio of being autistic when vaccinated relative to being autistic when not vaccinated turns out to be estimable.
  • With our study, we can estimate \[ \theta_1 = P(E=1 | D=1) \text{ and } \theta_2 = P(E=1 | D=2). \]
  • Moreover, the odds ratio of being autistic when vaccinated relative to being autistic when not vaccinated turns out to be a function of the \(\theta_j\)s.
  • We will show that the odds ratio we want turns out to equal the odds ratio of being vaccinated when autistic relative to being vaccinated when not autistic.

Odds Ratio

  • Note the probability of being vaccinated when autistic is \[ \theta_1 = P(E=1 | D=1) = \frac{P(D=1|E=1) P(E=1)}{P(D=1)} = \frac{\theta'_1 P(E=1)}{P(D=1)}. \]
  • The odds of being vaccinated when autistic are \[ \frac{\theta_1}{1-\theta_1} = \frac{P(E=1 | D=1)}{P(E=2 | D=1)} = \frac{P(D=1|E=1) P(E=1)}{P(D=1|E=2) P(E=2)} = \frac{\theta'_1 P(E=1)}{\theta'_2 P(E=2)}. \]

Odds Ratio

  • Similarly, the probability of being vaccinated when not autistic is \[ \theta_2 = \frac{(1 - \theta'_1) P(E=1)}{P(D=2)} \] and the odds of being vaccinated when not autistic are \[ \frac{\theta_2}{1-\theta_2} = \frac{(1 - \theta'_1) P(E=1)}{(1 - \theta'_2) P(E=2)}. \]
  • Finally, the odds ratio is \[ OR = \frac{\theta_1 / (1-\theta_1)}{\theta_2 / (1-\theta_2)} = \frac{\theta'_1 P(E=1)}{\theta'_2 P(E=2)} \bigg/ \frac{(1 - \theta'_1) P(E=1)}{(1 - \theta'_2) P(E=2)} = \frac{\theta'_1/(1-\theta'_1)}{\theta'_2/(1-\theta'_2)}. \]

Odds Ratio

  • Willing to look at odds ratios as our effect measure, we can examine the relative effect of vaccination on the odds of autism even from a case-control study.
  • Not exactly what we hoped to achieve; we wanted to estimate the risk ratio.
  • When looking at small probabilities, like the probability of autism, the odds ratio is a good approximation to the risk ratio.
  • For example if OR = 10, the odds of autism in the vaccinated group would be 10 times that for the non-vaccinated group.
  • If \(P(D = 1|E = 1) = 0.01\) and \(P(D = 1|E = 2) = 0.001\) we get \(RR = 10 \approx OR\).
  • Unfortunately, when the probabilities are not small, say if \(P(D = 1|E = 1) = 10/19 = 0.53\) and \(P(D = 1|E = 2) = 0.1\), \(OR = 10\) but \(RR = 5.3\), a big difference.

Case-Control Sampling

  • General model for our simple case-control study consists of two independent binomials \(y_1\) and \(y_2\) that give the numbers of \(E = 1\) individuals from the cases and controls, respectively.
  • Assume independently \[ y_i | \theta_i \sim \text{Bin}(n_i , \theta). \]
  • Bayesian approach to analysis takes on different forms depending on how the prior information is specified.
  • If the experts have prior information on \(\theta_1 = P(E=1 | D=1)\) and \(\theta_2 = P(E=1 | D=2)\), then simple independent Beta priors on these may suffice and the analysis is straight forward.
  • Only caveat is that interest focuses almost exclusively on the odds ratio.

Prior Information on \(OR\)

  • Another situation that can arise is where there is prior information on \(OR\), perhaps based on a previous case-control study.
  • Can place a prior on the parameters by placing a normal prior on \(\delta = \log(OR)\) and an independent Beta prior on \(\theta_2\).
  • This induces a prior on \((\theta_1, \theta_2)\).
  • To apply this, we need to solve for \(\theta_1\) in terms of \(\delta\) and \(\theta_2\), i.e. \[ \theta_1 = \frac{e^{\delta} \theta_2}{1 - \theta_2(1 - e^{\delta})}. \]
  • To elicit a prior on \(\delta\) we think about \(OR\). If our best guess is that \(OR = 3\), then we take the mean of the normal distribution for \(\delta\) to be \(\log (3) = 1.1\). Moreover, if we are, say, 90% sure that the \(OR\) is at least 0.8, then we are also 90% sure that \(\log (OR)\) is at least \(\log (0.8) = −0.22\).
  • We need to find a normal distribution with a mean of 1.1 and a 10th percentile of −0.22. Accomplished by using a \(N(1.1, (1.03)^2)\) on \(\delta\).

Case-Control Sampling

  • Third scenario involves eliciting prior information on \(\theta'_1 = P(D=1|E=1)\) and \(\theta'_2 = P(D=1|E=2)\), and \(\gamma = P(E = 1)\).
  • With prior information on these three parameters, we can combine it with the information from the case-control data to obtain posterior inferences for all parameters of interest including the risk ratio of the \(\theta'_j\)s.
  • We need to define the relationship between the variables. Using Bayes' Theorem, \[ \theta_1 = \frac{\theta'_1 \gamma}{\theta'_1 \gamma + \theta'_2 (1 - \gamma)} \text{ and } \theta_2 = \frac{(1 - \theta'_1) \gamma}{(1-\theta'_1) \gamma + (1-\theta'_2) (1 - \gamma)}. \]
  • Prior on \((\theta'_1, \theta'_2, \gamma)\) induces a prior on \((\theta_1, \theta_2)\).

Case-Control Sampling

  • However, there is no information in our case-control data about the probability that someone would have autism.
  • More generally case-control data contain no information about \[ P(D=1) = \theta'_1 \gamma + \theta'_2(1-\gamma). \]
  • Thus there exists a function of the parameters used in the prior whose distribution will not change in the posterior.
  • What we can learn about from the case-control data are \(\theta_1\) and \(\theta_2\) and anything that is a function of them, like the \(OR\).
  • If we try to make inferences on anything else there is no guarantee that more data will get us closer to the "true"" answer because aspects of the posterior will depend only on the prior and not the data.

Inference for Normal Populations

  • Normal data are ubiquitous in the scientific world because they often arise from measurements.
  • Let \(\tau = 1 / \sigma^2\) and suppose we have observations \[ y_1, \dots, y_n | \mu, \sigma^2 \stackrel{iid}{\sim} N(\mu, 1/\tau). \]
  • Need to place a prior on \(\theta = (\mu, \tau)'\). Traditionally, this was done using either a reference prior or a conjugate prior.
  • Conjugate prior can have limited practical utility for data analysis because it can be difficult to elicit expert information for the parameters.
  • Most people use priors with independent information on \(\mu\) and \(\tau\).

Reference Priors

  • Most commonly used reference prior puts "independent" flat priors on both \(\mu\) and \(\log(\tau)\), i.e. \[ p(\mu, \tau) \propto \frac{1}{\tau}. \]
  • Refer to this prior as the standard improper reference (SIR) prior.
  • Resulting posterior inference mimics the usual frequentist results for normal data.
  • Can think of this as a justification for using the usual frequentist results.
  • Denote the usual sample mean and sample variance defined as \[ \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i \text{ and } s^2 =\frac{1}{n-1}\sum_{i=1}^n(y_i - \bar{y})^2. \]

SIR Prior

  • Then we can establish that \[ \mu | \tau,y \sim N(\bar{y}, 1/n\tau) \] and \[ \tau | y \sim \text{Gamma} \left[ (n-1) / 2, (n-1)s^2/2 \right] . \]
  • Frequentist confidence intervals for \(\mu\) and \(\sigma^2\) are identical to Bayesian posterior intervals under this prior.
  • Except that it is perfectly proper now to discuss the probability of \(\mu\) and \(\sigma^2\) being in the interval.

Posterior Derivation

  • We now derive the posterior using the SIR prior \(p(\mu, \tau) \propto \frac{1}{\tau}\).
  • Let \(N(·|a,b)\) be a normal density and \(Gam(·|c,d)\) be a gamma density.
  • Then we have likelihood \[ \begin{aligned} L(\mu, \tau | y) & \propto \tau^{n/2} \exp \left[ -\frac{\tau}{n}(n-1)s^2 - \frac{\tau}{2} n (\bar{y} - \mu)^2 \right] \\ & = \tau^{1/2} \exp \left[ -\frac{n\tau}{2} (\bar{y} - \mu)^2 \right] \tau^{(n-1)/2} \exp \left[-\frac{(n-1)s^2}{2} \tau \right]. \end{aligned} \]
  • To get the posterior, multiply by the prior \[ p(\mu, \tau | y) \propto = \tau^{1/2} \exp \left[ -\frac{n\tau}{2} (\bar{y} - \mu)^2 \right] \tau^{(n-1)/2-1} \exp \left[-\frac{(n-1)s^2}{2} \tau \right]. \]

Posterior Derivation

  • To obtain the conditional density of \(\mu\) given \(\tau\) and \(y\), drop the last two multiplicative terms that do not include \(\mu\), \[ p(\mu, \tau | y) \propto \tau^{1/2} \exp \left[ -\frac{n\tau}{2} (\bar{y} - \mu)^2 \right] \propto N(\mu | \bar{y}, 1/n\tau). \]
  • Note this is only a function of \(\mu\), i.e. the constant of proportionality does not depend on \(\tau\). The role of \(\tau\) in the normal density is important in the next step.

Posterior Derivation

  • To get the marginal posterior distribution of \(\tau\), in the joint posterior integrate out terms involving \(\mu\). The normal density integrates to 1, so \[ \begin{aligned} p(\tau|y) & = \int p(\mu, \tau | y) d\mu \\ & \propto \tau^{(n-1)/2-1} \exp \left[-\frac{(n-1)s^2}{2} \tau \right] \int N(\mu | \bar{y}, 1/n\tau) \\ & \propto Gam\left( \tau | \frac{n-1}{2} , \frac{(n-1)s^2}{2}\right). \end{aligned} \]
  • We also have
    \[ (n-1)s^2 \tau | y \sim \text{Gamma} \left[ (n-1) / 2, 1/2 \right] \sim \chi^2_{n-1} . \]

Posterior Derivation

  • Finally, to get the marginal posterior of \(\mu\), rewrite \[ \begin{aligned} p(\mu, \tau | y) & \propto \tau^{1/2} \exp \left[ -\frac{n\tau}{2} (\bar{y} - \mu)^2 \right] \tau^{(n-1)/2-1} \exp \left[-\frac{(n-1)s^2}{2} \tau \right].\\ & = \tau^{n/2-1} \exp \left[ \frac{-\tau \left( n (\bar{y} - \mu)^2 + (n-1)s^2 \right)}{2} \right] \end{aligned} \] which, as a function of \(\tau\), is proportional to the density of a \(\text{Gamma}(n/2, \left( n (\bar{y} - \mu)^2 + (n-1)s^2 \right)/2)\) distribution.
  • Then the marginal for \(\mu|y\) is \[ p(\mu | y) = \int p(\mu, \tau | y) d \tau \propto \int_0^{\infty} \tau^{n/2-1} \exp \left[ \frac{-\tau \left( n (\bar{y} - \mu)^2 + (n-1)s^2 \right)}{2} \right] d \tau, \] where we can evaluate the integral because we know that Gamma densities integrate to 1.

Posterior Derivation

  • Evaluating the integral gives \[ \begin{aligned} p(\mu | y) & \propto \frac{\Gamma(n/2)}{\left( n (\bar{y} - \mu)^2 + (n-1)s^2 \right)^{n/2}} \\ & \propto \left( 1 + (\bar{y} - \mu)^2 / (n-1)(s^2/n) \right)^{-((n-1)+1)/2}. \end{aligned} \]
  • As a function of \(\mu\) this is the kernel of a univariate Student density.
  • Standardizing the Student random variable, that is, subtracting the location and dividing by the dispersion, we obtain a standard \(t(n−1)\) distribution, i.e. \[ \frac{\mu - \bar{y}}{\sqrt{s^2/n}} | y \sim t(n-1). \]


EXERCISE 5.22. Derive the posterior distribution for the conjugate prior by using the complete the square formula and adapting the arguments illustrated for the SIR prior.

  1. Derive the conditional density \(p(\mu| \tau, y)\).
  2. Use (1) to obtain the marginal densities \(p(\mu|y)\) and \(p(\tau|y)\)
  3. Derive the predictive density for a future observation based on this model.