Agenda

  • The Monte Hall Problem
  • Science, Priors, and Prediction
  • Statistical Models
  • Binomial Data (again)

Bayes' Theorem

  • For two events A and B the conditional probability of \(A\) given \(B\) is defined as \[ P(A|B) = \frac{P(A\cap B)}{P(B)}, \] where \(A\cap B\) denotes the intersection of \(A\) and \(B\). Let \(A^c\) denote the complement of \(A\).

  • Then Bayes Theorem allows us to compute \(P(A|B)\) from \(P(B|A)\), \(P(B|A^c)\), and \(P(A)\) via \[ P(A|B) = \frac{P(B|A) P(A)}{P(B|A)P(A) + P(B|A^c)P(A^c)}. \]

  • Direct result of the definition of conditional probability (numerator) and the Law of Total Probability (denominator).

The Monte Hall Problem

On the television show Let's Make a Deal, hosted by Monte Hall, the grand prize was awarded in the following manner. The prize was placed behind one of three doors. The contestant selected a door. Monte then showed the contestant what was behind one of the other two doors but it was never the grand prize. Finally, the contestant was allowed either to keep their initial choice or switch to the remaining unopened door.

Some people's intuition is that there is a 50/50 chance that the prize is behind either of the two remaining unopened doors, so it would not matter if you switch. In fact, the probability is 2/3 that the prize is behind the other door that Monte did not open. One intuitive way to arrive at this conclusion argues that you already know the prize is not behind one of the two doors you did not select and the fact that Monte showed you it was not behind one of them gives you no additional information. However, by switching from your initial choice, essentially you are being allowed to get both of the other two doors, and thus have a 2/3s chance of getting the prize. This argument is rather inexact, so we now give a careful argument using Bayes' Theorem.

The Monte Hall Problem

The Monte Hall Problem

  • Three variables involved, let \(P\) denote the door that contains the prize, let \(C\) denote the door that you initially chose, and let \(S\) denote the door that Monte shows you. We assume that the prize is randomly placed so \[ P(P = p) = 1/3 \text{ for } p = 1,2,3. \]
  • Prize placed prior to your choice of door, so it is independent of C and \[ P(P = p) = P(P=p | C=c) = 1/3 \text{ for all } p \text{ and } c. \]
  • What Monte shows you depends on where the prize is and what door you have chosen, so the door he shows you is selected according to a conditional probability \(P(S = s|P = p,C = c)\). Assume that you initially chose door number 1, so \(C = 1\).

The Monte Hall Problem

  • Monte never shows you the prize. If the prize is behind door number 1, Monte randomly picks either door 2 or door 3 and shows it to us. If the prize is behind door number 2, Monte shows us door 3. If the prize is behind door number 3, Monte shows us door 2.
  • Write \(f(s|p) = P(S=s | P=p, C=1)\), then
Summary of probabilities.
1 2 3
f(s|1) 0 0.5 0.5
f(s|2) 0 0.0 1.0
f(s|3) 0 1.0 0.0

The Monte Hall Problem

  • Using Bayes' Theorem \[ \begin{aligned} P(P=1|S=2) & = \frac{f(2|1) P(P=1)}{f(2|1) P(P=1)+f(2|2) P(P=2)+f(2|3) P(P=3)} \\ & = \frac{0.5 (1/3)}{0.5 (1/3) + 0 (1/3) + 1 (1/3)} \\ & = 1/3 \end{aligned} \] and \[ \begin{aligned} P(P=3|S=2) & = \frac{f(2|3) P(P=3)}{f(2|1) P(P=1)+f(2|2) P(P=2)+f(2|3) P(P=3)} \\ & = \frac{1 (1/3)}{0.5 (1/3) + 0 (1/3) + 1 (1/3)} \\ & = 2/3 . \end{aligned} \]

The Monte Hall Problem

  • Probability of getting the prize is 2/3 if we switch doors.
  • Similarly, if Monte shows us door number 3 the probability of getting the prize is 2/3 if we switch doors.
  • Similar results hold regardless of the initial choice C.
  • From the movie 21 and show Numb3rs
  • Also an \(n\)-stage Monte Hall problem

Science, Priors, and Prediction

  • Bayesian statistics starts by using (prior) probabilities to describe your current state of knowledge. It then incorporates information through the collection of data. This results in new (posterior) probabilities to describe your state of knowledge after combining the prior probabilities with the data.
  • All uncertainty and information are incorporated through probability distributions, and all conclusions obey the laws of probability theory.
  • Some form of prior 'information' is always available, but may not translate to a probability distribution. The 'information' also may not directly relate to the parameters of statistical models.
  • Typically, we obtain 'characteristics' from an expert about the population under study, then identify a mathematically convenient prior that agrees.
  • After statisticians develop such a prior distribution, they should always return to the expert to validate that the prior is a reasonable approximation to the expert's actual information.

Science, Priors, and Prediction

  • Although parameters are often mere conveniences, frequently the parameter \(\theta\) has some basis in physical reality.
  • Rather than describing where \(\theta\) really is, the prior describes beliefs about where \(\theta\) is.
  • Say \(\theta\) is the mean global temperature change in the last 50 years.
  • Different climate scientists will have different knowledge bases and therefore different probabilities for \(\theta\).
  • If they analyze the same data, they will continue to have different opinions about \(\theta\) until sufficient data are collected so that their beliefs converge and a consensus is reached.
  • This should occur unless one or more of them is unrealistically dogmatic.

Statistical Models

  • Statistical models are useful tools for scientific prediction.
  • Parameters \(\theta\) are often selected for convenience in building models that predict well.
  • Use of parameters is not a fundamental aspect of Bayesian analysis.
  • Relationship between parameters and observations may not be obvious.
  • Instead we can focus on observables (prediction) and parameters that are closely related to observables.
  • Before discussing prediction, we discuss posterior distributions for the parameters of a statistical model.

Statistical Models

  • Statistical models typically involve multiple observations (random variables), say, \(y_1, \dots, y_n\).
  • Observations are collected independently given the parameters of the model, which we denote \(\theta = (\theta_1, \dots, \theta_r)\).
  • Bayesian statistics begins with prior information about the state of nature \(\theta\) embodied in the prior density \(p(\theta)\).
  • Use Bayes' Theorem and the random data \(y\), with sampling density \(f(y|\theta)\), to update this information into a posterior density \(p(\theta|y)\) that incorporates both the prior information and the data.
  • Bayes' Theorem tells us that \[ p(\theta|y) = \frac{f(y|\theta)p(\theta)}{\int f(y|\theta)p(\theta) d \theta}. \]
  • Text illustrates these ideas with Binomial, Bernoulli, and normal data. In these cases the mathematics is relatively simple but by no means trivial.

Binomial Data

  • Suppose we are interested in assessing the proportion of U.S. transportation industry workers who use drugs on the job.
  • Let \(\theta\) denote this proportion and assume that a random sample of \(n\) workers is to be taken
  • The number of positive tests is denoted \(y\). In particular, we took \(n = 10\) samples and obtained \(y = 2\) positive test results.
  • Data obtained like this follow a Binomial distribution, where we write \[ y | \theta \sim \text{Bin}(n, \theta). \]
  • Recall, the discrete density function for \(y = 0, \dots, n\) is \[ f(y|\theta) = {n \choose y} \theta^y (1 - \theta)^{n-y}. \]

Beta Prior

  • It is convenient (but not necessary) to use a Beta distribution as a model for the prior.
  • We will later see the Beta distribution is conjugate to the Binomial distribution.
  • We will say \[ \theta \sim \text{Beta}(a,b), \] with density \[ p(\theta) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \theta^{a-1} (1 - \theta)^{b-1} I(0 \le \theta \le 1). \]
  • Hyperparameters \(a\) and \(b\) are selected to reflect the researcher's beliefs and uncertainty.

Posterior Distribution

  • The posterior distribution turns out to be another Beta distribution, specifically \[ \theta | y \sim \text{Beta}(y+a,n-y+b), \]
  • For the transportation example, we observed \(n = 10\) and \(y = 2\). Then if \(a = 3.44\) and \(b = 22.99\) (justified later) we have posterior \[ \theta | y \sim \text{Beta}(2+3.44,10-2+22.99) = \text{Beta}(5.44,30.99). \]

Posterior Distribution

## Warning: package 'tidyr' was built under R version 3.4.2

Posterior Distribution

  • To obtain the posterior, apply Bayes' Theorem to the densities \[ \begin{aligned} p(\theta|y) & = \frac{f(y|\theta)p(\theta)}{\int f(y|\theta)p(\theta) d \theta} \\ & = \frac{{n \choose y} \theta^y (1 - \theta)^{n-y} \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \theta^{a-1} (1 - \theta)^{b-1} I(0 \le \theta \le 1)}{\int_0^1 {n \choose y} \theta^y (1 - \theta)^{n-y} \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \theta^{a-1} (1 - \theta)^{b-1} d \theta} \\ & = \frac{ \theta^y (1 - \theta)^{n-y} \theta^{a-1} (1 - \theta)^{b-1} I(0 \le \theta \le 1)}{\int_0^1 \theta^y (1 - \theta)^{n-y} \theta^{a-1} (1 - \theta)^{b-1} d \theta} \\ & = \frac{ \theta^{y+a-1} (1 - \theta)^{n-y+b-1} I(0 \le \theta \le 1)}{\int_0^1 \theta^{y+a-1} (1 - \theta)^{n-y+b-1} d \theta} \\ & = \frac{\Gamma(n+a+b)}{\Gamma(y+a)\Gamma(n-y+b)} \theta^{y+a-1} (1 - \theta)^{n-y+b-1} I(0 \le \theta \le 1) . \end{aligned} \]
  • The last equality is the trickiest and we discuss it next this is the density of a \(\text{Beta}(y+a,n-y+b)\) distribution.

Posterior Distribution

  • To obtain the last equality of the posterior density derivation, recall that any density must integrate to 1, so for a \(\text{Beta}(a,b)\) \[ 1 = \int_0^1 \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \theta^{a-1} (1 - \theta)^{b-1} d \theta. \]
  • It follows immediately that \[ \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)} = \int_0^1 \theta^{a-1} (1 - \theta)^{b-1} d \theta, \] and correspondingly \[ \frac{\Gamma(y + a)\Gamma(n - y + b)}{\Gamma(n + a+b)} = \int_0^1 \theta^{y+a-1} (1 - \theta)^{n - y + b-1} d \theta, \] from which the last equality of the posterior density derivation follows.

Data Augmentation Priors

  • In the posterior \(\theta|y \sim \text{Beta}(y+a,n-y+b)\), the number of “successes” \(y\) and the hyperparameter from the prior \(a\) play similar roles. Also, the number of “failures” \(n−y\) and \(b\) play similar roles.
  • We can think of the prior as augmenting the data with \(a\) successes and \(b\) failures out of \(a+b\) trials.
  • Priors that allow such an interpretation are called data augmentation priors (DAPs).
  • In our transportation example with \(n = 10\) we have used a rather potent prior with \(a+b = 3.44+22.99 = 26.43\) prior observations.
  • In DAPs, the prior density \(p(\theta)\) has the same functional form as the sampling density \(f(y|\theta)\) when viewed as a function of \(\theta\).
  • That is, \(f(y|\theta) p(\theta) \propto f(y_*|\theta)\) for some new vector \(y_*\) and every vector \(\theta\). However, \(y_*\) may not be subject to the same restrictions as \(y\), e.g., for binomial data \(y\) and \(n\) are integers but \(a\) and \(b\) need not be.

Posterior Distribution

  • To a Bayesian, the best information one can ever have about \(\theta\) is to know the posterior density \(p(\theta|y)\).
  • In our transportation example, we know the posterior is exactly a \(\text{Beta}(5.44,30.99)\). This is rarely the case in more interesting problems.
  • Nonetheless, it is often convenient to summarize the posterior information.
  • Most summaries involve integration, which we typically perform by a computer simulation.

Summarizing Posterior Information

  • For the Beta distribution, we can calculate the mean, variance and mode (provided \(a>0\) and \(b>0\)).
  • Many other items of interest cannot be computed analytically, so we discuss computer simulations of them. These include the median and the probability that \(\theta\) falls into any particular set, e.g., \(P(\theta > .5)\).

  • The posterior is a Beta(5.44,30.99) distribution, so for our data \[ E(\theta|y) = \frac{y+a}{n+a+b} = \frac{5.44}{30.99} = 0.15. \]
  • With the DAP prior, the posterior mean can be written as a weighted average of the data proportion of successes and the prior proportion of successes with weights being the relative sizes of the actual and prior data sets, that is, \[ E(\theta|y) = \left( \frac{n}{n+a+b} \right) \frac{y}{n} + \left( \frac{a+b}{n+a+b} \right) \frac{a}{a+b}. \]

Predictive Density

  • Now consider prediction. Suppose we are to observe a future binomial value, \(y'\), assumed to be a \(\text{Bin}(m, \theta)\) random variable that is independent of \(y\) given \(\theta\).
  • The predictive density is \[ f(y' | y) = \int_0^1 f(y' | y, \theta) p(\theta | y) d \theta = \int_0^1 f(y' | \theta) p(\theta | y) d \theta \] due to the conditional independence of \(y'\) and \(y\) given \(\theta\).

Predictive Density

  • With \[ f(y' | \theta) = {m \choose y'} \theta^{y'} (1-\theta){m-y'}, \] we obtain for \(y' = 0, \dots, m\), \[ \begin{aligned} f(y' | y) & = {m \choose y'} \frac{\Gamma(n+a+b)}{\Gamma(y+a)\Gamma(n-y+b)} \int_0^1 \theta^{a+y+y'-1} (1-\theta)^{b+n-y+m-y'-1} d \theta \\ & = {m \choose y'} \frac{\Gamma(n+a+b)}{\Gamma(y+a)\Gamma(n-y+b)} \frac{\Gamma(a+y+y') \Gamma(b+n-y+m-y'-1)}{\Gamma(a+b+n+m)}. \end{aligned} \]
  • This is called a Beta-Binomial distribution.
  • Main point is that there is an analytic expression for the predictive density.

Predictive Density

  • There are some other analytic results available but they are limited both in number and utility. For example, \[ E(y'|y) = E_{\theta|y}\left[ E(y'|y,\theta) \right]= E_{\theta|y}\left[ E(y'|\theta) \right]= E_{\theta|y}\left[ m \theta \right] = m E(\theta|y) \]
  • Predictive variance can also be found using the iterated variance formula.

Prior Calculation

  • How did we determine our prior?
  • Suppose that (i) a researcher has estimated that 10% of transportation workers use drugs on the job, and (ii) the researcher is 95% sure that the actual proportion was no larger than 25%.
  • Mathematically, our best guess for \(\theta\) is 0.10 and we have \(P(\theta < 0.25) = 0.95\).
  • One way to model the prior assumes that the prior is a member of some parametric family of distributions and uses the two pieces of information to identify an appropriate member of the family.
  • Let's say \[ \theta \sim \text{Beta}(a,b), \] we might identify the estimate of 10% with either the mode \((a−1)/(a+b−2)\), or the mean \(a/(a+b)\) of the Beta distribution.

Prior Calculation

  • We prefer using the mode so we set \[ 0.10 = \frac{a-1}{a+b-2}. \]
  • Then can find \(a\) in terms of \(b\), namely \[ a(b) = \frac{1+0.1(b-2)}{0.9}. \]
  • We then search through \(b\) values until we find a distribution \(\text{Beta}(a(b),b)\) for which \(P(\theta < 0.25) = 0.95\). The \(\text{Beta}(3.44,22.99)\) distribution satisfies the constraints.
  • Since the prior is based on just two characteristics of a parametric family, once this prior is found, one should go back to the researcher and verify that the entire distribution is a reasonable representation of the researcher's beliefs.

Priors for Proportions

  • Beta distributions are typically used to model uncertainty about proportions.
  • A commonly used Beta prior distribution has a = b = 1, the uniform distribution with density \(p(\theta) = I_{(0,1)}(\theta)\).
  • This is a common reference prior and corresponds to a belief that all possible proportions are equally likely, so the expert would be saying that it was equally plausible for the proportion of drug users to be 0.00001 or 0.99999 or 0.5.
  • Such a prior seems inappropriate for drug use since most of us think that \(\theta\) is much more likely to be less than 0.5 than to be above it.
  • The Beta family is rich, encompassing many shapes.
  • There are no bimodal shapes but bimodal prior distributions on probabilities are rare.

Discussion

  • Assignment - Read Chapter 1 and "Why isn't everyone a Bayesian?" available on iLearn.
  • Everyone expected to contribute to discussion, don't be shy.
  • Which example was most compelling? What do you like or dislike about Bayesian statistics? Are you a Bayesian? What are some lurking complexities? What did you find interesting in reading Chapter 1? What was interesting in Efron's paper?

Questions