Agenda

  • Posterior Analysis
  • Activity
  • Integration Versus Simulation

Posterior Analysis

  • Recall the sampling density of the data \(y\) is \(f(y|\theta)\).
  • Bayesians incorporate prior knowledge about \(\theta\) through a density \(p(\theta)\).
  • Then the joint density of \(\theta\) and \(y\) is \[ p(\theta,y) = f(y|\theta)p(\theta) . \]
  • Integrating out \(\theta\), the marginal density of \(y\) is \[ f(y) = \int f(y|\theta)p(\theta) d \theta . \]
  • The marginal distribution of the data is sometimes called the marginal predictive distribution.

Posterior Analysis

  • By definition, the conditional density of \(\theta\) given \(y\) is \[ p(\theta|y) = \frac{p(\theta,y)}{f(y)}. \]
  • Note \(p(\theta|y)\) is the posterior distribution.
  • We have alternatively stated this via Bayes' Theorem, i.e. \[ p(\theta|y) = \frac{f(y|\theta)p(\theta)}{\int f(y|\theta)p(\theta) d \theta}. \]
  • Notice the integral in the denominator is \(r\) dimensional.
  • If \(\theta\) has a discrete distribution, the integral is replaced by a sum.

Posterior Analysis

  • Posterior density \(p(\theta|y)\) is a function of \(\theta\), so the denominator \(f(y) = \int f(y|\theta)p(\theta) d \theta\) is merely a constant.
  • In other words \[ 1 = \int p(\theta|y) d \theta, \] so the denominator is whatever constant is needed to make \(f(y|\theta)p(\theta)\) integrate to 1.
  • We often write \[ p(\theta|y) \propto f(y|\theta)p(\theta), \] since the function on the right determines the posterior.
  • Right-hand side can sometimes be recognized as having the form of a well-known distribution such as a Beta, normal, or gamma density.

Posterior Analysis

  • To a Bayesian, the best information one can ever have about \(\theta\) is to know the posterior density \(p(\theta|y)\).
  • Nonetheless, it is often convenient to summarize the posterior information.
  • Most summaries involve integration, which we typically perform by a computer simulation.

Posterior Analysis

  • When \(r=1\), can summarize posterior information with the posterior median, say \(M(y)=M\) such that \[ \frac{1}{2} = \int_{-\infty}^M p(\theta|y) d \theta. \]
  • Or posterior mode, \(\theta\) such that \(\max_{\theta} \left\{ p(\theta|y) \right\}\).
  • Or posterior mean \[ E(\theta|y) = \int \theta p(\theta|y) d \theta . \]
  • Or posterior variance \[ \text{Var}(\theta|y) = \int \left[ \theta - E(\theta|y) \right]^2 p(\theta|y) d \theta . \]

Highest Posterior Density (HPD)

  • Another useful summary is, say, a 95% probability interval \([a(y),b(y)]\) where \(a(y)\) and \(b(y)\) satisfy \[ 0.95 = \int_{a(y)}^{b(y)} p(\theta|y) d \theta. \]
  • There are many choices for \(a\) and \(b\), so one might want to find a “best” interval, typically the shortest interval that contains 95% probability. This is called a highest posterior density (HPD) interval.
  • Typically choose \(a(y)\) and \(b(y)\) so that \(P[\theta < a(y)|y] = 0.025\) and \(P[\theta < b(y)|y] = 0.975\), usually called the 95% probability interval or 95% credible interval.

Functions of \(\theta\)

  • We may also be interested in some function of \(\theta\), say \(\gamma = g(\theta)\).
  • The ultimate Bayesian inference about \(\gamma\) if through its posterior density, \(p(\gamma|y)\).
  • Here we can easily express the posterior mean \[ E(\gamma|y) = \int g(\theta) p(\theta|y) d \theta . \] posterior variance, \[ \text{Var}(\theta|y) = \int \left[ g(\theta) - E(\gamma|y) \right]^2 p(\theta|y) d \theta . \] or other posterior quantities.

Predictive Density

  • Consider now prediction of future observations \(y' = (y_1', \dots, y_q')\).
  • Conditionally independent of \(y\) given \(\theta\).
  • Same scientific theory that gives us \(f(y|\theta)\) should also provide a density $f_p (y'|) $ for the new observations.
  • Predictive density of the future observations given the past observations is \[ f_p(y'|y) = \int f_p (y'|\theta) p(\theta|y) d \theta. \]
  • The predictive distribution is sometimes called the posterior predictive distribution.

Predictive Density

  • To see this, use the fact that by conditional independence \(f_p (y'|y,\theta) = f_p (y'|\theta)\) and note that for any (measurable) set A, by the Law of Total Probability \[ \begin{aligned} P(y' \in A | y) & = \int P(y' \in A|y,\theta) p(\theta|y) d \theta \\ & = \int \left[ \int_A f_p (y'|y,\theta) dy' \right] p(\theta|y) d \theta \\ & = \int \left[ \int_A f_p (y'|y,\theta) p(\theta|y) dy' \right] d \theta \\ & = \int_A \left[ \int f_p (y'|y,\theta) p(\theta|y) d \theta \right] dy' \\ & = \int_A f_p (y'|\theta) dy' . \end{aligned} \]
  • Frequently, \(f_p(\cdot|\theta)\) has a similar functional form to \(f(\cdot|\theta)\) in which case we omit the subscript \(p\).

Useful Predictions

  • Fundamentally, it does not really matter whether a scientific theory is correct. They almost never are.
  • What matters is whether the scientific theory allows us to make useful predictions about future observations.
  • Better understanding of scientific processes allows us to build better models, which should lead to better predictions.
  • Closest we ever come to a correct scientific model is when we can make sufficiently accurate predictions by using \(f(y|\theta)\) and \(f_p (y'|\theta)\).
  • We will only ever agree on the state of nature when we have sufficient data so that we essentially agree on \(p(\theta|y)\), and more usefully \(f_p (y'|y)\), even though we did not agree on \(p(\theta)\).

Exercise

Derive the posterior densities for a Poisson sample with a Gamma prior, and for an exponential sample with a Gamma prior. Do you recognize these densities?

Integration Versus Simulation

  • Need to integrate complicated functions when performing Bayesian analysis.
  • Usually accomplished via computational tools rather than difficult mathematical integration.
  • Focus will be on Monte Carlo and Markov chain Monte Carlo (MCMC) methods.

Integration Versus Simulation

  • Virtually all of Bayesian inference involves the calculation of various integrals.
  • Historically, Bayesian methods were limited by one's ability to perform the integration.
  • Modern Bayesian statistics relies on computer simulations to approximate the values of integrals.
  • Most difficult integrations occur when the dimension \(r\) of the vector \(\theta\) becomes large.
  • High dimensional integration is almost always difficult, even with simulations.

Monte Carlo Integration

  • Suppose \(\theta \sim N(0,1)\), then \[ P(\theta \ge 1) = \int_{1}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-\theta^2/2} d \theta. \]
  • This integral cannot be evaluated analytically (at least not by us), but we can simulate an answer.
  • Suppose we randomly sample 10,000 observations with \(\theta_k \sim N(0,1)\).
  • Consider \(w_k = I_{[1, \infty)}(\theta_k)\), then the \(w_k\)s are iid Bernoulli random variables with probability of “success,” \(P(\theta \ge 1)\). Formally we have \[ E(w_k) = \int_{-\infty}^{\infty} I_{[1, \infty)}(\theta) \frac{1}{\sqrt{2\pi}} e^{-\theta^2/2} d \theta = \int_{1}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-\theta^2/2} d \theta = P(\theta \ge 1). \]

Monte Carlo Integration

  • Then the sample mean \(\bar{w} = \sum_{k=1}^{10000} w_k / 10000\) estimates Pr(θ ≥ 1).
  • Since the mean is based on 10,000 samples, it is a very good estimate, i.e. one with a standard deviation of \[ \sqrt{P(\theta \ge 1) \left[1 - P(\theta \ge 1) \right] / 10000} < 0.005, \] where the upper bound on the variance is obtained by replacing \(P(\theta \ge 1)\) with 0.5.
  • If the standard deviation is too large, simply sample more observations.
  • Since \(P(\theta \ge 1) \approx 0.16\), the actual standard deviation is \(\sqrt{0.16(0.84) / 10000} = 0.0037\).

Monte Carlo Integration

  • Simple to estimate \(P(\theta \ge 1)\) with its standard deviation in R.
set.seed(1) # For reproducibility
y <- rnorm(10000); w <- as.numeric(y>1)
prob.est <- mean(w); sd.est <- sd(w) / sqrt(length(w))
round(c(prob.est, sd.est), 4)
## [1] 0.1636 0.0037

Monte Carlo Integration

  • Approach in the previous example is very general.
  • Suppose we have a posterior density \(p(\theta|y)\) and that we can generate a random sample from the distribution, say \(\theta_1, \dots, \theta_s\).
  • Then we can approximate the posterior mean using \[ \hat{\theta} = \frac{1}{s} \sum_{k=1}^s \theta_k . \]
  • Similarly, the posterior variance can be approximated using the sample variance.
  • Any posterior percentile, including the median, can also be approximated by the corresponding sample percentile.
  • Posterior density can be approximated as a 'smoothed' histogram of the sample via a kernel smoother (requires selecting a bandwidth).

Monte Carlo Integration

  • For a function \(\gamma = g(\theta)\), the posterior mean can be approximated by \[ \hat{\gamma} = \frac{1}{s} \sum_{k=1}^s g(\theta_k). \]
  • Similarly, for a set \(A\) \[ P(\gamma \in A) = \frac{1}{s} \sum_{k=1}^s I_A \left[g(\theta_k) \right]. \]
  • By making \(s\) large, we can make these approximations as good as we want.
  • Samples from a predictive distribution involve a bit more work.

Example: Two Binomials

  • Assume \[ y_1 | \theta_1 \sim \text{Bin}(n_1, \theta_1) \perp y_2 | \theta_2 \sim \text{Bin}(n_2, \theta_2). \]
  • We want to estimate \(\gamma = \theta_1 - \theta_2\), and test \(H_0 : \theta_1 \ge \theta_2\) versus \(H_A : \theta_1 < \theta_2\), which we can write as \(H_0 : \gamma \ge 0\) versus \(H_A : \gamma < 0\).
  • Need to model the joint prior density \(p(\theta_1, \theta_2)\). Assuming independence, we have \[ p(\theta_1, \theta_2) = p(\theta_1) p(\theta_2) \] with \[ \theta_1 \sim \text{Beta}(a_1, b_1) \perp \theta_2 \sim \text{Beta}(a_2, b_2) \]

Example: Two Binomials

  • The joint posterior density for our independent binomials is \[ \begin{aligned} p & (\theta_1, \theta_2 | y_1, y_2) \\ & \propto f(y_1, y_2 | \theta_1, \theta_2) p(\theta_1, \theta_2) \\ & \propto \theta_1^{y_1}(1 - \theta_1)^{n_1 - y_1} \theta_2^{y_2}(1 - \theta_2)^{n_2 - y_2} \theta_1^{a_1 - 1} (1-\theta_1)^{b_1 - 1} \theta_2^{a_2 - 1} (1-\theta_2)^{b_2 - 1} \\ & = \theta_1^{a_1 + y_1 - 1} (1-\theta_1)^{b_1 + n_1 - y_1 - 1} \theta_2^{a_2 + y_2 - 1} (1-\theta_2)^{b_2 + n_2 - y_2 - 1}. \end{aligned} \]
  • This can be factored as \(p(\theta_1, \theta_2 | y_1, y_2) = g(\theta_1 | y_1)g(\theta_2 | y_2)\).
  • Implies that \(\theta_1\) and \(\theta_2\) are independent given \(y_1, y_2\). That is, \[ \theta_1 | y_1 \sim \text{Beta}(a_1 + y_1, b_1+n_1-y_1) \] independently of \[ \theta_2|y_2 \sim \text{Beta}(a_2+y_2, b_2+n_2 - y_2). \]

Example: Two Binomials

  • We want to find the distribution of \(\gamma = \theta_1 - \theta_2\).
  • Hard way to do this would be analytically by transforming the random vector \((\theta_1, \theta_2)\) into \((\gamma, \lambda)\) where \(\lambda = \theta_2\). (In other words, do a change of variables.)
  • Easier to get this kind of result via simulation.

Example: Two Binomials

  • Take \(\left\{ (\theta_{1,k} , \theta_{2,k}) : k = 1, \dots, s \right\}\) independently from the posterior distribution.
  • Compute \(\gamma_k = \theta_{1,k} - \theta_{2,k}\) and use this to estimate quantities from the posterior distribution of \(\gamma\).
  • For example, use the sample mean to estimate the posterior mean, use the median to estimate the median of the posterior distribution, use the 2.5 and 97.5 percentiles of the sample to estimate a 95% probability interval.
  • More generally, we can think of the distribution that takes the value$_k$ with probability \(1/s\) as a discrete approximation to the (continuous) posterior distribution.

Example: Two Binomials

  • We can approach the testing problem by examining whether \(\theta_1 / \theta_2\) is at least 1.
  • Define \[ v_k = \begin{cases} 1 & \theta_{1,k} / \theta_{2,k} \ge 1 , \\ 0 & \text{otherwise} .\end{cases} \]
  • Notice \[ P(\theta_1 \ge \theta_2 | y_1, y_2) = P(\theta_1 / \theta_2 \ge 1 | y_1, y_2), \] which can be estimated with \[ \bar{v} = \frac{1}{s} \sum_{k=1}^s v_k. \]

Example: Two Binomials

  • Suppose \[ y_1 | \theta_1 \sim \text{Bin}(80, \theta_1) \perp y_2 | \theta_2 \sim \text{Bin}(100, \theta_2) \] and \[ \theta_1 \sim \text{Beta}(1, 1) \perp \theta_2 \sim \text{Beta}(2, 1). \]
  • Further suppose \(y_1 = 32\) and \(y_2 = 35\).
  • Recall \(\gamma = \theta_1 - \theta_2\), then estimate \(P(\gamma \ge 0 | y_1, y_2)\) along with the marginal posterior density of \(\gamma\).

Example: Two Binomials

n <- 10000
t1 <- rbeta(n, (1+32), (1+80-32))
t2 <- rbeta(n, (2+35), (1+100-35))
gamma <- t1-t2
w <- as.numeric(gamma>0)
prob.est <- mean(w)
sd.est <- sd(w) / sqrt(length(w))
round(c(prob.est, sd.est), 4)
## [1] 0.7185 0.0045

Example: Two Binomials

gamma.density <- density(gamma)
plot(gamma.density, main="Marginal Posterior Distribution")