- Posterior Analysis
- Activity
- Integration Versus Simulation

- Posterior Analysis
- Activity
- Integration Versus Simulation

- Recall the sampling density of the data \(y\) is \(f(y|\theta)\).
- Bayesians incorporate prior knowledge about \(\theta\) through a density \(p(\theta)\).
- Then the joint density of \(\theta\) and \(y\) is \[ p(\theta,y) = f(y|\theta)p(\theta) . \]
- Integrating out \(\theta\), the marginal density of \(y\) is \[ f(y) = \int f(y|\theta)p(\theta) d \theta . \]
- The marginal distribution of the data is sometimes called the
**marginal predictive distribution**.

- By definition, the conditional density of \(\theta\) given \(y\) is \[ p(\theta|y) = \frac{p(\theta,y)}{f(y)}. \]
- Note \(p(\theta|y)\) is the
**posterior distribution**. - We have alternatively stated this via Bayes' Theorem, i.e. \[ p(\theta|y) = \frac{f(y|\theta)p(\theta)}{\int f(y|\theta)p(\theta) d \theta}. \]
- Notice the integral in the denominator is \(r\) dimensional.
- If \(\theta\) has a discrete distribution, the integral is replaced by a sum.

- Posterior density \(p(\theta|y)\) is a function of \(\theta\), so the denominator \(f(y) = \int f(y|\theta)p(\theta) d \theta\) is merely a constant.
- In other words \[ 1 = \int p(\theta|y) d \theta, \] so the denominator is whatever constant is needed to make \(f(y|\theta)p(\theta)\) integrate to 1.
- We often write \[ p(\theta|y) \propto f(y|\theta)p(\theta), \] since the function on the right determines the posterior.
- Right-hand side can sometimes be recognized as having the form of a well-known distribution such as a Beta, normal, or gamma density.

**To a Bayesian, the best information one can ever have about \(\theta\) is to know the posterior density \(p(\theta|y)\)**.- Nonetheless, it is often convenient to summarize the posterior information.
- Most summaries involve integration, which we typically perform by a computer simulation.

- When \(r=1\), can summarize posterior information with the posterior median, say \(M(y)=M\) such that \[ \frac{1}{2} = \int_{-\infty}^M p(\theta|y) d \theta. \]
- Or posterior mode, \(\theta\) such that \(\max_{\theta} \left\{ p(\theta|y) \right\}\).
- Or posterior mean \[ E(\theta|y) = \int \theta p(\theta|y) d \theta . \]
- Or posterior variance \[ \text{Var}(\theta|y) = \int \left[ \theta - E(\theta|y) \right]^2 p(\theta|y) d \theta . \]

- Another useful summary is, say, a 95% probability interval \([a(y),b(y)]\) where \(a(y)\) and \(b(y)\) satisfy \[ 0.95 = \int_{a(y)}^{b(y)} p(\theta|y) d \theta. \]
- There are many choices for \(a\) and \(b\), so one might want to find a “best” interval, typically the shortest interval that contains 95% probability. This is called a highest posterior density (HPD) interval.
- Typically choose \(a(y)\) and \(b(y)\) so that \(P[\theta < a(y)|y] = 0.025\) and \(P[\theta < b(y)|y] = 0.975\), usually called the 95% probability interval or 95% credible interval.

- We may also be interested in some function of \(\theta\), say \(\gamma = g(\theta)\).
- The ultimate Bayesian inference about \(\gamma\) if through its posterior density, \(p(\gamma|y)\).
- Here we can easily express the posterior mean \[ E(\gamma|y) = \int g(\theta) p(\theta|y) d \theta . \] posterior variance, \[ \text{Var}(\theta|y) = \int \left[ g(\theta) - E(\gamma|y) \right]^2 p(\theta|y) d \theta . \] or other posterior quantities.

- Consider now prediction of future observations \(y' = (y_1', \dots, y_q')\).
- Conditionally independent of \(y\) given \(\theta\).
- Same scientific theory that gives us \(f(y|\theta)\) should also provide a density $f_p (y'|) $ for the new observations.
**Predictive density**of the future observations given the past observations is \[ f_p(y'|y) = \int f_p (y'|\theta) p(\theta|y) d \theta. \]- The
**predictive distribution**is sometimes called the**posterior predictive distribution**.

- To see this, use the fact that by conditional independence \(f_p (y'|y,\theta) = f_p (y'|\theta)\) and note that for any (measurable) set A, by the Law of Total Probability \[ \begin{aligned} P(y' \in A | y) & = \int P(y' \in A|y,\theta) p(\theta|y) d \theta \\ & = \int \left[ \int_A f_p (y'|y,\theta) dy' \right] p(\theta|y) d \theta \\ & = \int \left[ \int_A f_p (y'|y,\theta) p(\theta|y) dy' \right] d \theta \\ & = \int_A \left[ \int f_p (y'|y,\theta) p(\theta|y) d \theta \right] dy' \\ & = \int_A f_p (y'|\theta) dy' . \end{aligned} \]
- Frequently, \(f_p(\cdot|\theta)\) has a similar functional form to \(f(\cdot|\theta)\) in which case we omit the subscript \(p\).

- Fundamentally, it does not really matter whether a scientific theory is correct. They almost never are.
- What matters is whether the scientific theory allows us to make useful predictions about future observations.
- Better understanding of scientific processes allows us to build better models, which should lead to better predictions.
- Closest we ever come to a correct scientific model is when we can make sufficiently accurate predictions by using \(f(y|\theta)\) and \(f_p (y'|\theta)\).
- We will only ever agree on the state of nature when we have sufficient data so that we essentially agree on \(p(\theta|y)\), and more usefully \(f_p (y'|y)\), even though we did not agree on \(p(\theta)\).

Derive the posterior densities for a Poisson sample with a Gamma prior, and for an exponential sample with a Gamma prior. Do you recognize these densities?

- Need to integrate complicated functions when performing Bayesian analysis.
- Usually accomplished via computational tools rather than difficult mathematical integration.
- Focus will be on Monte Carlo and Markov chain Monte Carlo (MCMC) methods.

- Virtually all of Bayesian inference involves the calculation of various integrals.
- Historically, Bayesian methods were limited by one's ability to perform the integration.
- Modern Bayesian statistics relies on computer simulations to approximate the values of integrals.
- Most difficult integrations occur when the dimension \(r\) of the vector \(\theta\) becomes large.
- High dimensional integration is almost always difficult, even with simulations.

- Suppose \(\theta \sim N(0,1)\), then \[ P(\theta \ge 1) = \int_{1}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-\theta^2/2} d \theta. \]
- This integral cannot be evaluated analytically (at least not by us), but we can simulate an answer.
- Suppose we randomly sample 10,000 observations with \(\theta_k \sim N(0,1)\).
- Consider \(w_k = I_{[1, \infty)}(\theta_k)\), then the \(w_k\)s are iid Bernoulli random variables with probability of “success,” \(P(\theta \ge 1)\). Formally we have \[ E(w_k) = \int_{-\infty}^{\infty} I_{[1, \infty)}(\theta) \frac{1}{\sqrt{2\pi}} e^{-\theta^2/2} d \theta = \int_{1}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-\theta^2/2} d \theta = P(\theta \ge 1). \]

- Then the sample mean \(\bar{w} = \sum_{k=1}^{10000} w_k / 10000\) estimates Pr(θ ≥ 1).
- Since the mean is based on 10,000 samples, it is a very good estimate, i.e. one with a standard deviation of \[ \sqrt{P(\theta \ge 1) \left[1 - P(\theta \ge 1) \right] / 10000} < 0.005, \] where the upper bound on the variance is obtained by replacing \(P(\theta \ge 1)\) with 0.5.
- If the standard deviation is too large, simply sample more observations.
- Since \(P(\theta \ge 1) \approx 0.16\), the actual standard deviation is \(\sqrt{0.16(0.84) / 10000} = 0.0037\).

- Simple to estimate \(P(\theta \ge 1)\) with its standard deviation in
`R`

.

set.seed(1) # For reproducibility y <- rnorm(10000); w <- as.numeric(y>1) prob.est <- mean(w); sd.est <- sd(w) / sqrt(length(w)) round(c(prob.est, sd.est), 4)

## [1] 0.1636 0.0037

- Approach in the previous example is very general.
- Suppose we have a posterior density \(p(\theta|y)\) and that we can generate a random sample from the distribution, say \(\theta_1, \dots, \theta_s\).
- Then we can approximate the posterior mean using \[ \hat{\theta} = \frac{1}{s} \sum_{k=1}^s \theta_k . \]
- Similarly, the posterior variance can be approximated using the sample variance.
- Any posterior percentile, including the median, can also be approximated by the corresponding sample percentile.
- Posterior density can be approximated as a 'smoothed' histogram of the sample via a kernel smoother (requires selecting a bandwidth).

- For a function \(\gamma = g(\theta)\), the posterior mean can be approximated by \[ \hat{\gamma} = \frac{1}{s} \sum_{k=1}^s g(\theta_k). \]
- Similarly, for a set \(A\) \[ P(\gamma \in A) = \frac{1}{s} \sum_{k=1}^s I_A \left[g(\theta_k) \right]. \]
- By making \(s\) large, we can make these approximations as good as we want.
- Samples from a predictive distribution involve a bit more work.

- Assume \[ y_1 | \theta_1 \sim \text{Bin}(n_1, \theta_1) \perp y_2 | \theta_2 \sim \text{Bin}(n_2, \theta_2). \]
- We want to estimate \(\gamma = \theta_1 - \theta_2\), and test \(H_0 : \theta_1 \ge \theta_2\) versus \(H_A : \theta_1 < \theta_2\), which we can write as \(H_0 : \gamma \ge 0\) versus \(H_A : \gamma < 0\).
- Need to model the joint prior density \(p(\theta_1, \theta_2)\). Assuming independence, we have \[ p(\theta_1, \theta_2) = p(\theta_1) p(\theta_2) \] with \[ \theta_1 \sim \text{Beta}(a_1, b_1) \perp \theta_2 \sim \text{Beta}(a_2, b_2) \]

- The joint posterior density for our independent binomials is \[ \begin{aligned} p & (\theta_1, \theta_2 | y_1, y_2) \\ & \propto f(y_1, y_2 | \theta_1, \theta_2) p(\theta_1, \theta_2) \\ & \propto \theta_1^{y_1}(1 - \theta_1)^{n_1 - y_1} \theta_2^{y_2}(1 - \theta_2)^{n_2 - y_2} \theta_1^{a_1 - 1} (1-\theta_1)^{b_1 - 1} \theta_2^{a_2 - 1} (1-\theta_2)^{b_2 - 1} \\ & = \theta_1^{a_1 + y_1 - 1} (1-\theta_1)^{b_1 + n_1 - y_1 - 1} \theta_2^{a_2 + y_2 - 1} (1-\theta_2)^{b_2 + n_2 - y_2 - 1}. \end{aligned} \]
- This can be factored as \(p(\theta_1, \theta_2 | y_1, y_2) = g(\theta_1 | y_1)g(\theta_2 | y_2)\).
- Implies that \(\theta_1\) and \(\theta_2\) are independent given \(y_1, y_2\). That is, \[ \theta_1 | y_1 \sim \text{Beta}(a_1 + y_1, b_1+n_1-y_1) \] independently of \[ \theta_2|y_2 \sim \text{Beta}(a_2+y_2, b_2+n_2 - y_2). \]

- We want to find the distribution of \(\gamma = \theta_1 - \theta_2\).
- Hard way to do this would be analytically by transforming the random vector \((\theta_1, \theta_2)\) into \((\gamma, \lambda)\) where \(\lambda = \theta_2\). (In other words, do a change of variables.)
- Easier to get this kind of result via simulation.

- Take \(\left\{ (\theta_{1,k} , \theta_{2,k}) : k = 1, \dots, s \right\}\) independently from the posterior distribution.
- Compute \(\gamma_k = \theta_{1,k} - \theta_{2,k}\) and use this to estimate quantities from the posterior distribution of \(\gamma\).
- For example, use the sample mean to estimate the posterior mean, use the median to estimate the median of the posterior distribution, use the 2.5 and 97.5 percentiles of the sample to estimate a 95% probability interval.
- More generally, we can think of the distribution that takes the value$_k$ with probability \(1/s\) as a discrete approximation to the (continuous) posterior distribution.

- We can approach the testing problem by examining whether \(\theta_1 / \theta_2\) is at least 1.
- Define \[ v_k = \begin{cases} 1 & \theta_{1,k} / \theta_{2,k} \ge 1 , \\ 0 & \text{otherwise} .\end{cases} \]
- Notice \[ P(\theta_1 \ge \theta_2 | y_1, y_2) = P(\theta_1 / \theta_2 \ge 1 | y_1, y_2), \] which can be estimated with \[ \bar{v} = \frac{1}{s} \sum_{k=1}^s v_k. \]

- Suppose \[ y_1 | \theta_1 \sim \text{Bin}(80, \theta_1) \perp y_2 | \theta_2 \sim \text{Bin}(100, \theta_2) \] and \[ \theta_1 \sim \text{Beta}(1, 1) \perp \theta_2 \sim \text{Beta}(2, 1). \]
- Further suppose \(y_1 = 32\) and \(y_2 = 35\).
- Recall \(\gamma = \theta_1 - \theta_2\), then estimate \(P(\gamma \ge 0 | y_1, y_2)\) along with the marginal posterior density of \(\gamma\).

n <- 10000 t1 <- rbeta(n, (1+32), (1+80-32)) t2 <- rbeta(n, (2+35), (1+100-35)) gamma <- t1-t2 w <- as.numeric(gamma>0) prob.est <- mean(w) sd.est <- sd(w) / sqrt(length(w)) round(c(prob.est, sd.est), 4)

## [1] 0.7185 0.0045

gamma.density <- density(gamma) plot(gamma.density, main="Marginal Posterior Distribution")