## Agenda

• Method of Composition
• Monte Carlo Justification
• Conjugate, Improper, and Flat Priors

## Method of Composition

• Simulating from joint distributions specified conditionally.
• Particular useful in sampling predictive distributions.
• Method of composition, involves simulating from a series of conditional distributions.
• Illustrate the method using a combination of normal and gamma distributions.
• Consider a joint distribution defined by an observation $y | \mu, \tau \sim N(\mu, 1/\tau)$ and a conditionally specified prior $\mu | \tau \sim N(\mu_0, 1/\tau) \quad \tau \sim \text{Gamma}(a,b).$

## Method of Composition

• The joint density is $$f(y|\mu, \tau) p(\mu|\tau) p(\tau)$$.
• To sample from the joint distribution, take a random sample $\tau_k \sim \text{Gamma}(a,b), k = 1, \dots, s$ then, conditional on the $$\tau_k$$s, randomly sample $\mu_k \sim N(\mu_0, 1/\tau_k),$ and finally sample $y_k \sim N(\mu_k, 1/\tau_k).$
• For $$k = 1, \dots, s$$, the $$(\tau_k, \mu_k, y_k)$$s form a random sample from the joint distribution.
• Then one could smooth a histogram of the $$y_k$$s to get a picture of the marginal density of $$y$$ or use the Monte Carlo sample to approximate means or variances.

## Predictive Distributions

• Predictive distributions are typically specified via $f_p(y'|y) = \int f_p (y'|\theta) p(\theta|y) d \theta.$
• To sample from the predictive distribution, sample $$\theta_1, \dots, \theta_s$$ from the posterior $$p(\theta|y)$$.
• Then for each $$\theta_k$$ sample a $$y'_k$$ from $$f_p (y'|\theta_k)$$.
• Marginally the $$y'_k$$s are sampled from the predictive distribution.
• Together the $$(\theta_k,y'_k)$$s are sampled from the joint posterior and predictive distribution $$p(\theta,y'|y)$$.

## Example: Two Binomials

• Recall the two-binomial problem with independent Beta priors from our last class.
• Now we want to obtain a numerical approximation to the joint predictive density for a pair of future binomials, one from each population.
• Assume the future values are independent of the data given the parameters.

## Example: Two Binomials

• Suppose $y_1 | \theta_1 \sim \text{Bin}(80, \theta_1) \perp y_2 | \theta_2 \sim \text{Bin}(100, \theta_2)$ and $\theta_1 \sim \text{Beta}(1, 1) \perp \theta_2 \sim \text{Beta}(2, 1).$
• Consider observed data $$y_1 = 32$$ and $$y_2 = 35$$.
• Recall $$\theta_1$$ and $$\theta_2$$ are independent given $$y_1, y_2$$. That is, $\theta_1 | y_1 \sim \text{Beta}(1 + 32, 1+80-32)$ independently of $\theta_2|y_2 \sim \text{Beta}(1+35, 2+100 - 35).$

## Example: Two Binomials

• If $$n'_1 = 50$$ and $$n'_2=50$$, then sampling from the joint predictive density is as follows.
n <- 10000
t1 <- rbeta(n, (1+32), (1+80-32))
t2 <- rbeta(n, (2+35), (1+100-35))
y1.new <- rbinom(n, 50, t1)
y2.new <- rbinom(n, 50, t2)

## Example: Two Binomials

• We can now numerically approximating the predictive probability that $$y'_1 > y'_2 + 5$$.
• The code also includes and estimate of the standard error for the estimated probability.
gamma <- y1.new - y2.new - 5
w <- as.numeric(gamma>0)
prob.est <- mean(w)
sd.est <- sd(w) / sqrt(length(w))
round(c(prob.est, sd.est), 4)
## [1] 0.2901 0.0045

## Monte Carlo Integration

• Discussed Monte Carlo integration without offering any justification that it works.
• Introduce the ideas that justify simulation methods, that is the Law of Large Numbers (LLN) and Central Limit Theorem (CLT).

## Motivating Example

• Think of a study to estimate the average height of all adults living in Kalamazoo, Michigan, say $$\mu$$.
• Standard approach would be to take a random sample of size $$n$$ from the population, say $$y_1, \dots, y_n$$ and estimate $$\mu$$ by the sample mean $$\bar{y}$$.
• Before we actually go out and get the sample, $$\bar{y}$$ is random (because the sampling is random) and $$\bar{y}$$ has a distribution.
• Intuitively, larger $$n$$ gives a better $$\bar{y}$$ for estimating $$\mu$$ since more information goes into the estimate.
• So typically, larger $$n$$ yields $$\bar{y}$$s closer to $$\mu$$. In the extreme case when $$n$$ is as large as possible and the entire population is sampled, then $$\bar{y} = \mu$$.

## Law of Large Numbers

• Let $$\theta_1, \theta_2, \dots$$ be a random sample with density $$p(\theta)$$, then if $$z_i = g(\theta_i)$$ for $$i=1, 2, \dots$$ and $$E[ g(\theta) ] < \infty$$, $\bar{z} = \frac{1}{s} \sum_{k=1}^s g(\theta_k) \stackrel{P}{\rightarrow} E[ g(\theta) ] = \int g(\theta) p(\theta) d \theta.$
• LLN is more flexible than you might think since we can consider any function $$g$$ with finite expectation.
• For example, applies to mean, variances, standard deviations, probabilities, and so on.
• Require a slightly different argument for quantiles such as the median, but such a result is readily available.

## Ergodic Theorem

• Unfortunately Markov chain methods do not give random samples. We usually get a sample $$\theta_1, \dots, \theta_s$$ that are identically distributed but not necessarily independent.
• Fortunately, an alternative version of the LLN called the Ergodic Theorem exists for identically distributed but dependent sequences.
• LLN and Ergodic Theorem give us a basis for believing that our simulations provide good answers for large Monte Carlo samples, but how large do the samples need to be?
• Central Limit Theorem (CLT), provides an answer.

## Central Limit Theorem

• CLT says that sample means computed from random samples closely follow a normal distribution for large sample sizes.
• Let $$\theta_1, \theta_2, \dots$$ be a random sample with density $$p(\theta)$$, then if $$z_i = g(\theta_i)$$ for $$i=1, 2, \dots$$ and $$E[ g(\theta)^2 ] < \infty$$, $\bar{z} = \frac{1}{s} \sum_{k=1}^s g(\theta_k) \stackrel{approx}{\sim} N \left( E[ g(\theta) ] , \text{Var} [ g(\theta) ] / s \right).$
• CLT allows us to compute a standard error (or Monte Carlo standard error or MCSE) for the estimate, namely the standard deviation of $$z_1, \dots, z_s$$ divided by $$s$$.
• WinBUGS uses an alternative method to compute the Monte Carlo error.
• Then we can simply simulate until the MCSE is sufficiently small.
• Fortunately, for Markov chain methods there is a Markov chain CLT and we can proceed similarly. More on this later…

## Prior Selection

• Conjugate Priors
• Improper Priors
• Flat Priors
• Jeffreys Priors

## Conjugate Priors

• Given a data distribution $$f(\theta | y)$$, a family of distributions is said to be conjugate to the given distribution if whenever the prior is in the conjugate family, so is the posterior, regardless of the observed value of the data.
• For example, if the data distribution is binomial, then the conjugate family of distributions is beta.
• Or if the data distribution is normal with known variance, then the conjugate family of distributions is normal.

## Improper Priors

• A subjective Bayesian is a person who really buys the Bayesian philosophy. Probability is the only correct measure of uncertainty, and this means that people have probability distributions in their heads that describe any quantities they are uncertain about. In any situation one must make one’s best effort to get the correct prior distribution out of the head of the relevant user and into Bayes rule.
• Many people, however, are happy to use the Bayesian paradigm while being much less fussy about priors. When the sample size is large, the likelihood outweighs the prior in determining the posterior. So, when the sample size is large, the prior is not crucial.

## Improper Priors

• Such people are willing to use priors chosen for mathematical convenience rather than their accurate representation of uncertainty.
• They often use priors that are very spread out to represent extreme uncertainty. Such priors are called “vague” or “diffuse” even though these terms have no precise mathematical definition.
• In the limit as the priors are spread out more and more one gets so-called improper priors.

## Improper Priors

• There is no guarantee that $\text{likelihood } × \text{ improper prior } = \text{ unnormalized posterior}$ results in anything that can be normalized. If the right-hand side integrates, then we get a proper posterior after normalization. If the right-hand does not integrate, then we get complete nonsense.
• You have to be careful when using improper priors that the answer makes sense. Probability theory doesn’t guarantee that, because improper priors are not probability distributions.

## Improper Priors

• Improper priors are questionable
• Subjective Bayesians think they are nonsense. They do not correctly describe the uncertainty of anyone.
• Everyone has to be careful using them, because they don’t always yield proper posteriors. Everyone agrees improper posteriors are nonsense.
• Because the joint distribution of data and parameters is also improper, paradoxes arise. These can be puzzling.
• However they are widely used and need to be understood.

## Objective Bayesian Inference

• The subjective, personalistic aspect of Bayesian inference bothers many people. Hence many attempts have been made to formulate objective priors, which are supposed to be priors that many people can agree on, at least in certain situations.
• However, none of the proposed objective priors achieve wide agreement.

## Flat Priors

• One obvious default prior is flat (constant), which seems to give no preference to any parameter value over any other.
• One problem with flat priors is that they are only flat for one parameterization.
• Consider $$y \sim \text{Bin}(n, \theta)$$. Often people think that putting a uniform distribution on $$\theta$$ denotes ignorance of the value of $$\theta$$. However, if you are ignorant about $$\theta$$ you should also be ignorant about $$\theta^2$$, and you cannot find a distribution that is uniform on both $$\theta$$ and $$\theta^2$$.

## Flat Priors

• If the parameter space is unbounded, then the flat prior is improper since $$\int p(\theta) d \theta = \infty$$.
• Christensen et al. refer to improper flat priors as "inherently stupid priors".
• Why? Given any bounded set $$A$$ and its complement $$A^c$$, the integral outside the set $$A$$ is $$\int_{A^c} p(\theta) d \theta = \infty$$ whereas the integral inside is $$\int_{A} p(\theta) d \theta < \infty$$.
• Prior belief is that virtually all weight goes to parameter values that are bigger than any number you can think of.
• Flat priors are often approximated by a proper prior with a large variance. This is what caused problems in the Lindley-Jeffreys paradox.
• Moreover, if you have a sampling distribution $$f(y| \theta)$$ and use a flat prior $$p(\theta) \propto 1$$ or another improper prior, the marginal density for the data $$f(y) = \int f(y| \theta) p(\theta) d \theta$$ often does not exist.

## Flat Priors

• Virtue of flat priors is that in many problems, the flat prior is easily overwhelmed by the data.
• Consider $$p(\theta) = K$$ for any old constant $$K$$, Bayes' Theorem gives us \begin{aligned} p(\theta|y) & = \frac{f(y|\theta)p(\theta)}{\int f(y|\theta)p(\theta) d \theta}\\ & = \frac{f(y|\theta)K}{\int f(y|\theta)K d \theta}\\ & = \frac{f(y|\theta)}{\int f(y|\theta) d \theta} , \end{aligned} so the prior appears to play no role in the posterior.

## Flat Priors

• Posterior is simply a renormalization of the likelihood into a density for $$\theta$$.
• Not all likelihoods can be renormalized, because not all have finite integrals with respect to $$\theta$$.
• Using flat priors with the large sample approximations leads to Bayesian inferences that are similar to frequentist large sample maximum likelihood inferences.
• Another alternative is Jeffreys priors.

## Jeffreys Priors

• Jeffreys (1961) proposed a class of priors for Bayesian problems. Many people view them as being "noninformative".
• For a one parameter problem, Fisher's information is defined to be the expected value of the negative of the second derivative of the log-likelihood.
• Jeffreys' prior is defined as being proportional to the square root of the Fisher information.
• Using Jeffreys' priors does not satisfy the stopping rule principle or the likelihood principle.
• For multiple parameters, the Jeffreys' Prior is the square root of the determinant of Fisher's information matrix.

## Normal Data

• Suppose we have independent normal observations with known mean $$\mu_0$$ and unknown variance $$\sigma^2$$ $y_1, \dots, y_n | \sigma^2 \stackrel{iid}{\sim} N(\mu_0, \sigma^2).$
• With precision $$\tau = 1 / \sigma^2$$, up to a constant the likelihood for $$\tau$$ is $L(\tau | y) \propto \prod_{i=1}^n \tau^{1/2} \exp \left[ \frac{\tau}{2} (y_i - \mu_0) \right] = \tau^{n/2} \exp \left[ \frac{\tau}{2} \sum_{i=1}^n (y_i - \mu_0) \right].$
• The log of the likelihood is $\log \left[ L(\tau | y) \right] = \frac{n}{2} \log (\tau) + \left[ \frac{\tau}{2} \sum_{i=1}^n (y_i - \mu_0) \right] + C.$

## Normal Data

• The derivative of the log-likelihood is $\frac{d}{d\tau} \log \left[ L(\tau | y) \right] = \frac{n}{2} \tau^{-1} + \left[ \frac{1}{2} \sum_{i=1}^n (y_i - \mu_0) \right].$
• The second derivative of the log-likelihood is $\frac{d^2}{d\tau^2} \log \left[ L(\tau | y) \right] = - \frac{n}{2} \tau^{-2}.$
• Jeffreys' prior for this problem is defined as $p(\tau) \propto \sqrt{\frac{n}{2} \tau^{-2}} \propto \frac{1}{\tau}.$

## Binomial Data

• Let $$y | \theta \sim \text{Bin}(n, \theta)$$, so $L(\theta | y) \propto \theta^y (1-\theta)^{n-y}.$
• The log-likelihood is $$y \log (\theta) + (n−y) \log (1−\theta)$$. The derivative is $$(y/\theta) − (n−y)/(1-\theta)$$ and the negative of the second derivative is $$(y/ \theta^2) + (n−y)/(1−\theta)^2$$.
• The expected value of this is $\frac{n\theta}{\theta^2} + \frac{n-n\theta}{(1−\theta)^2} = n \frac{1}{\theta} + n \frac{1}{(1-\theta)} = \frac{n}{\theta(1-\theta)}.$
• Then the Jeffreys' prior is $p(\theta) \propto \theta^{-1/2} (1-\theta) ^{-1/2},$ which is a Beta(0.5,0.5) distribution and gives greater plausibility to values near 0 and 1 than to values in between.

## Binomial Data

• Jeffreys' priors for binomial and negative binomial data are different.
• Jeffreys' priors gives different answers for binomial and negative binomial data that have exactly the same number of successes and failures.
• Finding the negative of the second derivative of the log-likelihood is equivalent in the two cases, but Fisher's information involves taking an expected value and the expected values come out differently in the binomial and negative binomial cases.