---
title: 'MCMC I'
author: "James M. Flegal"
output:
ioslides_presentation:
smaller: yes
---
## Agenda
- Like Ordinary Monte Carlo (OMC), but better?
- SLLN and Markov chain CLT
- Variance estimation
- AR(1) example
- Metropolis-Hastings algorithm (with an exercise)
## Markov chain Monte Carlo
- A Markov chain is a dependent sequence of random variables $X_1, X_2, \dots$ or or random vectors $X_1, X_2, \dots$ having the property that the future is independent of the past given the present
- Conditional distribution of $X_{n+1}$ given $X_1, \dots, X_n$ depends only on $X_n$
- The Markov chain has **stationary transition probabilities** if the conditional distribution of $X_{n+1}$ given $X_n$ is the same for all $n$
+ Every Markov chain used in MCMC has this property
- The joint distribution of $X_1, \dots, X_n$ is determined by the initial distribution of the Markov chain and the transition probabilities
+ Marginal distribution of $X_1$
+ Conditional distribution of $X_{n+1}$ given $X_n$
## Markov chain Monte Carlo
- A scalar functional of a Markov chain is a time series, but not necessarily a Markov chain
- A Markov chain is **stationary** if its initial distribution is stationary
+ Different from having stationary transition probabilities
+ All chains used in MCMC have stationary transition probabilities, but none are exactly stationary
## Markov chain Monte Carlo
- To be (exactly) stationary, must start the chain with simulation from the equilibrium (invariant, stationary) distribution
- If chain is stationary, then every iterate $X_i$ has the same marginal distribution, which is the equilibrium distribution
- If chain is not stationary but has a unique equilibrium distribution, which includes chains used in MCMC, then the marginal distribution $X_i$ converges to the equilibrium distribution as $i \to \infty$
## Markov chain Monte Carlo
- Let $\pi$ be a probability distribution having support ${\cal X} \subseteq \mathbb{R}^{d}$, $d \ge 1$ we want to explore
- When i.i.d. observations are unavailable, a Markov chain with stationary distribution $\pi$ can be utilized
- Summarize $\pi$ with expectations, quantiles, density plots ...
## Markov chain Monte Carlo
- Suppose $X_1, \dots, X_n$ are simulation from a Markov chain having a unique equilibrium distribution (say $\pi$), and suppose we want to know an expectation
\[
\mu_g = E [ g(X_i) ] = \int_{{\cal X}} g(x) \, \pi(dx)
\]
where the expectation is with respect to unique equilibrium distribution $\pi$
- If $E_{\pi} |g(X_i)| < \infty$, then
\[
\hat{\mu}_{n} = \frac{1}{n} \sum_{i=1}^{n} g(X_i) \stackrel{a.s.}{\rightarrow} \mu_g \quad \text{as } n \rightarrow \infty \; (SLLN).
\]
## Markov chain Monte Carlo
- The central limit theorem (CLT) for Markov chains says
\[
\sqrt{n} (\hat{\mu}_{n} - E_{\pi} g (X_i)) \to \text{N} (0, \sigma^{2}) \; ,
\]
where
\[
\sigma^{2} = \text{Var} g(X_i) + 2 \sum_{k = 1}^{\infty} \text{Cov} \left[ g(X_i), g(X_{i+k})\right]
\]
- CLT holds if $E_{\pi} |g (X_i) |^{2+\epsilon} < \infty$ and the Markov chain is geometrically ergodic
- Can estimate $\sigma^{2}$ in various ways
- Verifying such a mixing condition is generally very challenging
- Nevertheless, we expect the CLT to hold in practice when using a **smart** sampler
## Batch means
- In order to make MCMC practical, need a method to estimate the variance $\sigma^{2}$ in the CLT, then can proceed just like in OMC
- If $\hat{\sigma}^{2}$ is a consistent estimate of $\sigma^{2}$, then an asymptotic 95% confidence interval for $\mu_g$ is
\[
\hat{\mu}_{n} \pm 1.96 \frac{\hat{\sigma}}{\sqrt{n}}
\]
- The method of batch means estimates the asymptotic variance for a stationary time series
## Batch means
- Markov chain CLT says
\[
\hat{\mu}_{n} \approx \text{N} \left( \mu_g, \frac{\sigma^{2}}{n} \right)
\]
- Suppose $b$ evenly divides $n$ and we have the means
\[
\hat{\mu}_{b,k} = \frac{1}{b} \sum_{i=bk+1}^{bk+b} g(X_i)
\]
for $k = 1, \dots, a = n / b$
- Then each of these **batch means** satisfies (if $b$ is sufficiently large)
\[
\hat{\mu}_{b,k} \approx \text{N} \left( \mu_g, \frac{\sigma^{2}}{b} \right)
\]
## Batch means
- Thus empirical variance of the sequence of batch means
\[
\frac{1}{a} \sum_{k=1}^a \left( \hat{\mu}_{b,k} - \hat{\mu}_{n} \right)^2
\]
estimates $\sigma^{2}/b$
- And $b/n$ times this estimates $\sigma^{2} / n$, the asymptotic variance of $\hat{\mu}_{n}$
- Batch means can produce a strongly consistent estimator of $\sigma^{2}$ if $b \rightarrow \infty$ and $a \rightarrow \infty$ as $n \rightarrow \infty$
## Stopping rules
- Suppose $\epsilon>0$, then a **fixed-width stopping rule* terminates the simulation the first time half-width (or width) of a confidence interval is sufficiently small
- That is, simulate until
\[
1.96 \frac{\hat{\sigma}}{\sqrt{n}} < \epsilon .
\]
## Example: AR(1)
- Consider the Markov chain such that
\[
X_i = \rho X_{i-1} + \epsilon_i
\]
where $\epsilon_i \stackrel{iid}{\sim} N(0,1)$
+ Consider $X_1 = 0$, $\rho = .95$, and estimating $E_{\pi} X = 0$
+ Run until
\[
w_{n} = 2 z_{.975} \frac{ \hat{\sigma}}{ \sqrt{n} } \leq 0.2
\]
where $\hat{\sigma}$ is calculated using batch means
## Example: AR(1)
The following will provide an observation from the MC 1 step ahead
```{r}
ar1 <- function(m, rho, tau) {
rho*m + rnorm(1, 0, tau)
}
```
Next, we add to this function so that we can give it a Markov chain and the result will be p observations from the Markov chain
```{r}
ar1.gen <- function(mc, p, rho, tau, q=1) {
loc <- length(mc)
junk <- double(p)
mc <- append(mc, junk)
for(i in 1:p){
j <- i+loc-1
mc[(j+1)] <- ar1(mc[j], rho, tau)
}
return(mc)
}
```
## Example: AR(1)
```{r}
set.seed(20)
library(mcmcse)
tau <- 1
rho <- .95
out <- 0
eps <- 0.1
start <- 1000
r <- 1000
```
## Example: AR(1)
```{r}
out <- ar1.gen(out, start, rho, tau)
MCSE <- mcse(out)$se
N <- length(out)
t <- qt(.975, (floor(sqrt(N) - 1)))
muhat <- mean(out)
check <- MCSE * t
while(eps < check) {
out <- ar1.gen(out, r, rho, tau)
MCSE <- append(MCSE, mcse(out)$se)
N <- length(out)
t <- qt(.975, (floor(sqrt(N) - 1)))
muhat <- append(muhat, mean(out))
check <- MCSE[length(MCSE)] * t
}
N <- seq(start, length(out), r)
t <- qt(.975, (floor(sqrt(N) - 1)))
half <- MCSE * t
sigmahat <- MCSE*sqrt(N)
N <- seq(start, length(out), r) / 1000
```
## Example: AR(1)
```{r}
plot(N, muhat, main="Estimates of the Mean", xlab="Iterations (in 1000's)")
points(N, muhat, type="l", col="red") ; abline(h=0, lwd=3)
legend("bottomright", legend=c("Observed", "Actual"), lty=c(1,1), col=c(2,1), lwd=c(1,3))
```
## Example: AR(1)
```{r}
plot(N, sigmahat, main="Estimates of Sigma", xlab="Iterations (in 1000's)")
points(N, sigmahat, type="l", col="red"); abline(h=20, lwd=3)
legend("bottomright", legend=c("Observed", "Actual"), lty=c(1,1), col=c(2,1), lwd=c(1,3))
```
## Example: AR(1)
```{r}
plot(N, 2*half, main="Calculated Interval Widths", xlab="Iterations (in 1000's)", ylab="Width", ylim=c(0, 1.8))
points(N, 2*half, type="l", col="red"); abline(h=0.2, lwd=3)
legend("topright", legend=c("Observed", "Cut-off"), lty=c(1,1), col=c(2,1), lwd=c(1,3))
```
## Markov chain Monte Carlo
- MCMC methods are used most often in Bayesian inference where the equilibrium (invariant, stationary) distribution is a posterior distribution
- Challenge lies in construction of a suitable Markov chain with $f$ as its stationary distribution
- A key problem is we only get to observe $t$ observations from $\left\{ X_t \right\}$, which are serially **dependent**
- Other questions to consider
+ How good are my MCMC estimators?
+ How long to run my Markov chain simulation?
+ How to compare MCMC samplers?
+ What to do in high-dimensional settings?
+ ...
## Metropolis-Hastings algorithm
Setting $X_0 = x_0$ (somehow), the Metropolis-Hastings algorithm generates $X_{t+1}$ given $X_t = x_t$ as follows:
1. Sample a candidate value $X^* \sim g(\cdot | x_t)$ where $g$ is the proposal distribution
2. Compute the MH ratio $R(x_t, X^*)$, where
\[
R(x_t, X^*) = \frac{f(x^*) g (x_t | x^*)}{f(x_t) g (x^* | x_t)}
\]
3. Set
\[
X_{t+1} = \begin{cases} x^* \mbox{ w.p.\ } \min\{ R(x_t, X^*), 1\} \\ x_t \mbox{ otherwise} \end{cases}
\]
## Metropolis-Hastings algorithm
- Irreducibility and aperiodicity depend on the choice of $g$, these must be checked
- Performance (finite sample) depends on the choice of $g$ also, be careful
## Independence chains
- Suppose $g (x^* | x_t) = g (x^*)$, this yields an **independence** chain since the proposal does not depend on the current state
- In this case, the MH ratio is
\[
R(x_t, X^*) = \frac{f(x^*) g (x_t)}{f(x_t) g (x^*)},
\]
and the resulting Markov chain will be irreducible and aperiodic if $g > 0$ where $f>0$
- A good envelope function $g$ should resemble $f$, but should cover $f$ in the tails
## Random walk chains
- Generate $X^*$ such that $\epsilon\sim h(\cdot)$ and set $X^* = X_t + \epsilon$, then $g(x^* | x_t) = h(x^* - x_t)$
- Common choices of $h(\cdot)$ are symmetric zero mean random variables with a scale parameter, e.g.\ a Uniform($-a,a$), Normal($0, \sigma^2$), $c*T_{\nu}, \dots$
- For symmetric zero mean random variables, the MH ratio is
\[
R(x_t, X^*) = \frac{f(x^*)}{f(x_t)}
\]
- If the support of $f$ is connected and $h$ is positive in a neighborhood of 0, then the chain is irreducible and aperiodic.
## Example: Markov chain basics
***Exercise***: Suppose $f \sim Exp(1)$
1. Write an independence MH sampler with $g \sim Exp(\theta)$
2. Show $R(x_t, X^*) = \exp \left\{ (x_t - x^*)(1-\theta) \right\}$
3. Generate 1000 draws from $f$ with $\theta \in \{ 1/2, 1, 2 \}$
4. Write a random walk MH sampler with $h \sim N(0, \sigma^2)$
5. Show $R(x_t, X^*) = \exp \left\{ x_t - x^* \right \} I(x^* > 0)$
6. Generate 1000 draws from $f$ with $\sigma \in \{ .2, 1, 5 \}$
7. In general, do you prefer an independence chain or a random walk MH sampler? Why?
8. Implement the fixed-width stopping rule for you preferred chain
## Example: Markov chain basics
Independence Metropolis sampler with Exp($\theta$) proposal
```{r}
ind.chain <- function(x, n, theta = 1) {
## if theta = 1, then this is an iid sampler
m <- length(x)
x <- append(x, double(n))
for(i in (m+1):length(x)){
x.prime <- rexp(1, rate=theta)
u <- exp((x[(i-1)]-x.prime)*(1-theta))
if(runif(1) < u)
x[i] <- x.prime
else
x[i] <- x[(i-1)]
}
return(x)
}
```
## Example: Markov chain basics
Random Walk Metropolis sampler with N($0,\sigma$) proposal
```{r}
rw.chain <- function(x, n, sigma = 1) {
m <- length(x)
x <- append(x, double(n))
for(i in (m+1):length(x)){
x.prime <- x[(i-1)] + rnorm(1, sd = sigma)
u <- exp((x[(i-1)]-x.prime))
u
if(runif(1) < u && x.prime > 0)
x[i] <- x.prime
else
x[i] <- x[(i-1)]
}
return(x)
}
```
## Example: Markov chain basics
```{r}
trial0 <- ind.chain(1, 500, 1)
trial1 <- ind.chain(1, 500, 2)
trial2 <- ind.chain(1, 500, 1/2)
rw1 <- rw.chain(1, 500, .2)
rw2 <- rw.chain(1, 500, 1)
rw3 <- rw.chain(1, 500, 5)
```
## Example: Markov chain basics
```{r, echo=FALSE}
par(mfrow=c(2,3))
plot.ts(trial0, ylim=c(0,6), main="IID Draws")
plot.ts(trial1, ylim=c(0,6), main="Independence with 1/2")
plot.ts(trial2, ylim=c(0,6), main="Independence with 2")
plot.ts(rw1, ylim=c(0,6), main="Random Walk with .2")
plot.ts(rw2, ylim=c(0,6), main="Random Walk with 1")
plot.ts(rw3, ylim=c(0,6), main="Random Walk with 5")
par(mfrow=c(1,1))
```