- Bayesian inference
- Priors
- Point estimates
- Bayesian hypothesis testing
- Bayes factors
To a Bayesian probability is the only way to describe uncertainty. Things not known for certain – like values of parameters – must be described by a probability distribution
The data changes your uncertainty, which is then described by a new probability distribution called your posterior distribution
Most of Bayesian inference is about how to go from prior to posterior
Suppose the prior distribution for \(p\) is Beta(\(\alpha_1, \alpha_2\)) and the conditional distribution of \(x\) given \(p\) is Bin(\(n\), \(p\)). Then \[ f(x|p) = {n \choose p} p^x (1-p)^{n-x} \] and \[ g(p) = \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} p^{\alpha_1 -1} (1-p)^{\alpha_2 - 1}. \] Then \[ f(x|p) g(p) = {n \choose p} \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} \] and this, considered as a function of \(p\) for fixed \(x\) is, except for constants, the PDF of a Beta(\(x + \alpha_1, n - x + \alpha_2\)) distribution. So that is the posterior.
Why? \[ \begin{aligned} h (p | x) & = \frac{f( x | p) g(p)}{\int f( x | p) g(p) d p} \\ & \propto f( x | p) g(p) \\ & = {n \choose p} \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} \\ & \propto p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} \end{aligned} \] And there is only one PDF with support \([0,1]\) of that form, i.e. a Beta(\(x + \alpha_1, n - x + \alpha_2\)) distribution. So that is the posterior.
\[ \text{likelihood } × \text{ unnormalized prior } = \text{ unnormalized posterior} \]
In our example we could have multiplied likelihood \[ p^x (1-p)^{n-x} \] times unnormalized prior \[ p^{\alpha_1 -1} (1-p)^{\alpha_2 - 1} \] to get unnormalized posterior \[ p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} \] which, as before, can be recognized as an unnormalized beta PDF.
Suppose \(X_1 , \dots, X_n\) are i.i.d. \(N(\theta , \sigma^2)\) where \(\sigma^2\) is known. Suppose further we have a prior \(\theta \sim N(\mu, \tau^2)\). Then the posterior can be obtained as follows, \[ \begin{aligned} f (\theta | x) & \propto f(\theta) \prod_{i=1}^n f(x_i | \theta) \\ & \propto \exp \left \{ -\frac{1}{2} \left( \frac{(\theta-\mu)^2}{\tau^2} + \frac{\sum_{i=1}^{n} (x_i - \theta)^2}{\sigma^2} \right) \right\} \\ & \propto \exp \left \{ -\frac{1}{2} \frac{\left( \theta - \displaystyle \frac{\mu / \tau^2 + n\bar{x} / \sigma^2}{1/\tau^2 + n/\sigma^2} \right)^2}{\displaystyle \frac{1}{1/\tau^2 + n/\sigma^2}} \right\}. \end{aligned} \]
Or \(f(\theta | x) \sim N( \mu_n, \tau_n^2)\) where \[ \mu_n = \left( \frac{\mu}{\tau^2} + \frac{n \bar{x}}{\sigma^2} \right) \tau_n ^2 \quad \mbox{and} \quad \tau_n^2 = \frac{1}{1/\tau^2 + n/\sigma^2} . \] We will call this a conjugate Bayes model. Also note a 95% credible region for \(\theta\) is given by (this is also the HPD, highest posterior density) \[ \left( \mu_n - 1.96 \tau_n, \mu_n + 1.96 \tau_n \right) . \] For large \(n\), the data will overwhelm the prior.
qbeta(0.5, x + alpha1, n - x + alpha2)
qbeta(0.1, x + alpha1, n - x + alpha2) qbeta(0.9, x + alpha1, n - x + alpha2)
alpha1 <- alpha2 <- 1 / 2 x <- 2 n <- 10 (x + alpha1) / (n + alpha1 + alpha2)
## [1] 0.2272727
qbeta(0.5, x + alpha1, n - x + alpha1)
## [1] 0.2103736
cbind(qbeta(0.1, x + alpha1, n - x + alpha2), qbeta(0.9, x + alpha1, n - x + alpha2))
## [,1] [,2] ## [1,] 0.08361516 0.3948296
pbeta(0.5, x + alpha1, n - x + alpha2)
## [1] 0.9739634
pbeta(0.5, x + alpha1, n - x + alpha2, lower.tail = FALSE)
## [1] 0.02603661
alpha1 <- alpha2 <- 1 / 2 x <- 2 n <- 10 pbeta(0.5, x + alpha1, n - x + alpha2)
## [1] 0.9739634
pbeta(0.5, x + alpha1, n - x + alpha2, lower.tail = FALSE)
## [1] 0.02603661
The unnormalized posterior for everything, models and parameters within models, is \[ f(x | \theta, m) g(\theta | m) h(m) \] To obtain the conditional distribution of \(x\) given \(m\), we must integrate out the nuisance parameters \(\theta\) \[ \begin{aligned} q(x | m) & = \int_{\Theta_m} f(x | \theta, m) g(\theta | m) h(m) d \theta \\ & h(m) \int_{\Theta_m} f(x | \theta, m) g(\theta | m) d \theta \end{aligned} \] These are the unnormalized posterior probabilities of the models.
The normalized probabilities are \[ p(m | x) = \frac{q(x | m)}{\sum q(x | m)} \]
It is useful to define \[ b(x | m) = \int_{\Theta_m} f(x | \theta, m) g(\theta | m) d \theta \] so \[ q(x | m) = b(x | m) h(m) \] Then the ratio of posterior probabilities of models \(m_1\) and \(m_2\) is \[ \frac{p(m_1|x)}{p(m_2|x)} = \frac{q(x | m_1)}{q(x | m_2)} = \frac{b(x | m_1) h(m_1)}{b(x | m_2) h(m_2)} \] This ratio is called the posterior odds of the models (a ratio of probabilities is called an odds) of these models.
The prior odds is \[ \frac{h(m_1)}{h(m_2)} \] The term we have not yet named in \[ \frac{p(m_1|x)}{p(m_2|x)} = \frac{b(x | m_1) h(m_1)}{b(x | m_2) h(m_2)} \] is called the Bayes factor \[ \frac{b(x | m_1)}{b(x | m_2)} \] the ratio of posterior odds to prior odds.
The prior odds tells how the prior compares the probability of the models. The Bayes factor tells us how the data shifts that comparison going from prior to posterior via Bayes rule.
Then \[ \begin{aligned} b(x | m_2) & = \int_0^1 f(x|p) g(p|m_2) dp \\ & = \int_0^1 {n \choose x} \frac{1}{B(\alpha_1, \alpha_2)} p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} dp \\ & = {n \choose x} \frac{B(x+ \alpha_1, n - x + \alpha_2)}{B(\alpha_1, \alpha_2)} \end{aligned} \] where \[ B(\alpha_1, \alpha_2) = \frac{\Gamma(\alpha_1)\Gamma(\alpha_2)}{\Gamma(\alpha_1 + \alpha_2)} \] by properties of the Beta distribution
alpha1 <- alpha2 <- 1 / 2 x <- 2 n <- 10 p0 <- 1 / 2 b1 <- dbinom(x, n, p0) b2 <- choose(n, x) * beta(x + alpha1, n - x + alpha2) / beta(alpha1, alpha2) BayesFactor <- b1 / b2 BayesFactor
## [1] 0.5967366
pvalue <- 2 * pbinom(x, n, p0) pvalue
## [1] 0.109375