MCMC I

Agenda

Like Ordinary Monte Carlo (OMC), but better?
SLLN and Markov chain CLT
Variance estimation
AR(1) example
Metropolis-Hastings algorithm (with an exercise)

Markov chain Monte Carlo

A Markov chain is a dependent sequence of random variables \(X_1, X_2, \dots\) or or random vectors \(X_1, X_2, \dots\) having the property that the future is independent of the past given the present
Conditional distribution of \(X_{n+1}\) given \(X_1, \dots, X_n\) depends only on \(X_n\)
The Markov chain has stationary transition probabilities if the conditional distribution of \(X_{n+1}\) given \(X_n\) is the same for all \(n\)
- Every Markov chain used in MCMC has this property
The joint distribution of \(X_1, \dots, X_n\) is determined by the initial distribution of the Markov chain and the transition probabilities
- Marginal distribution of \(X_1\)
- Conditional distribution of \(X_{n+1}\) given \(X_n\)

Markov chain Monte Carlo

A scalar functional of a Markov chain is a time series, but not necessarily a Markov chain
A Markov chain is stationary if its initial distribution is stationary
- Different from having stationary transition probabilities
- All chains used in MCMC have stationary transition probabilities, but none are exactly stationary

Markov chain Monte Carlo

To be (exactly) stationary, must start the chain with simulation from the equilibrium (invariant, stationary) distribution
If chain is stationary, then every iterate \(X_i\) has the same marginal distribution, which is the equilibrium distribution
If chain is not stationary but has a unique equilibrium distribution, which includes chains used in MCMC, then the marginal distribution \(X_i\) converges to the equilibrium distribution as \(i \to \infty\)

Markov chain Monte Carlo

Let \(\pi\) be a probability distribution having support \({\cal X} \subseteq \mathbb{R}^{d}\), \(d \ge 1\) we want to explore
When i.i.d. observations are unavailable, a Markov chain with stationary distribution \(\pi\) can be utilized
Summarize \(\pi\) with expectations, quantiles, density plots …

Markov chain Monte Carlo

Suppose \(X_1, \dots, X_n\) are simulation from a Markov chain having a unique equilibrium distribution (say \(\pi\)), and suppose we want to know an expectation \[ \mu_g = E [ g(X_i) ] = \int_{{\cal X}} g(x) \, \pi(dx) \] where the expectation is with respect to unique equilibrium distribution \(\pi\)
If \(E_{\pi} |g(X_i)| < \infty\), then \[ \hat{\mu}_{n} = \frac{1}{n} \sum_{i=1}^{n} g(X_i) \stackrel{a.s.}{\rightarrow} \mu_g \quad \text{as } n \rightarrow \infty \; (SLLN). \]

Markov chain Monte Carlo

The central limit theorem (CLT) for Markov chains says \[ \sqrt{n} (\hat{\mu}_{n} - E_{\pi} g (X_i)) \to \text{N} (0, \sigma^{2}) \; , \] where \[ \sigma^{2} = \text{Var} g(X_i) + 2 \sum_{k = 1}^{\infty} \text{Cov} \left[ g(X_i), g(X_{i+k})\right] \]
CLT holds if \(E_{\pi} |g (X_i) |^{2+\epsilon} < \infty\) and the Markov chain is geometrically ergodic
Can estimate \(\sigma^{2}\) in various ways
Verifying such a mixing condition is generally very challenging
Nevertheless, we expect the CLT to hold in practice when using a smart sampler

Batch means

In order to make MCMC practical, need a method to estimate the variance \(\sigma^{2}\) in the CLT, then can proceed just like in OMC
If \(\hat{\sigma}^{2}\) is a consistent estimate of \(\sigma^{2}\), then an asymptotic 95% confidence interval for \(\mu_g\) is \[ \hat{\mu}_{n} \pm 1.96 \frac{\hat{\sigma}}{\sqrt{n}} \]
The method of batch means estimates the asymptotic variance for a stationary time series

Batch means

Markov chain CLT says \[ \hat{\mu}_{n} \approx \text{N} \left( \mu_g, \frac{\sigma^{2}}{n} \right) \]
Suppose \(b\) evenly divides \(n\) and we have the means \[ \hat{\mu}_{b,k} = \frac{1}{b} \sum_{i=bk+1}^{bk+b} g(X_i) \] for \(k = 1, \dots, a = n / b\)
Then each of these batch means satisfies (if \(b\) is sufficiently large) \[ \hat{\mu}_{b,k} \approx \text{N} \left( \mu_g, \frac{\sigma^{2}}{b} \right) \]

Batch means

Thus empirical variance of the sequence of batch means \[ \frac{1}{a} \sum_{k=1}^a \left( \hat{\mu}_{b,k} - \hat{\mu}_{n} \right)^2 \] estimates \(\sigma^{2}/b\)
And \(b/n\) times this estimates \(\sigma^{2} / n\), the asymptotic variance of \(\hat{\mu}_{n}\)
Batch means can produce a strongly consistent estimator of \(\sigma^{2}\) if \(b \rightarrow \infty\) and \(a \rightarrow \infty\) as \(n \rightarrow \infty\)

Stopping rules

Suppose \(\epsilon>0\), then a **fixed-width stopping rule* terminates the simulation the first time half-width (or width) of a confidence interval is sufficiently small
That is, simulate until \[ 1.96 \frac{\hat{\sigma}}{\sqrt{n}} < \epsilon . \]

Example: AR(1)

Consider the Markov chain such that \[ X_i = \rho X_{i-1} + \epsilon_i \] where \(\epsilon_i \stackrel{iid}{\sim} N(0,1)\)
- Consider \(X_1 = 0\), \(\rho = .95\), and estimating \(E_{\pi} X = 0\)
- Run until \[ w_{n} = 2 z_{.975} \frac{ \hat{\sigma}}{ \sqrt{n} } \leq 0.2 \] where \(\hat{\sigma}\) is calculated using batch means

Example: AR(1)

The following will provide an observation from the MC 1 step ahead

ar1 <- function(m, rho, tau) {
rho*m + rnorm(1, 0, tau)
}

Next, we add to this function so that we can give it a Markov chain and the result will be p observations from the Markov chain

ar1.gen <- function(mc, p, rho, tau, q=1) {
loc <- length(mc)
junk <- double(p)
mc <- append(mc, junk)

for(i in 1:p){
j <- i+loc-1
mc[(j+1)] <- ar1(mc[j], rho, tau)
}
return(mc)
}

Example: AR(1)

set.seed(20)
library(mcmcse)

## mcmcse: Monte Carlo Standard Errors for MCMC
## Version 1.2-1 created on 2016-03-24.
## copyright (c) 2012, James M. Flegal, University of California,Riverside
##                     John Hughes, University of Minnesota
##                     Dootika Vats, University of Minnesota
##  For citation information, type citation("mcmcse").
##  Type help("mcmcse-package") to get started.

tau <- 1
rho <- .95
out <- 0
eps <- 0.1
start <- 1000
r <- 1000

Example: AR(1)

out <- ar1.gen(out, start, rho, tau)
MCSE <- mcse(out)$se
N <- length(out)
t <- qt(.975, (floor(sqrt(N) - 1)))
muhat <- mean(out)
check <- MCSE * t

while(eps < check) {
out <- ar1.gen(out, r, rho, tau)
MCSE <- append(MCSE, mcse(out)$se)
N <- length(out)
t <- qt(.975, (floor(sqrt(N) - 1)))
muhat <- append(muhat, mean(out))
check <- MCSE[length(MCSE)] * t
}

N <- seq(start, length(out), r) 
t <- qt(.975, (floor(sqrt(N) - 1)))
half <- MCSE * t
sigmahat <- MCSE*sqrt(N)
N <- seq(start, length(out), r) / 1000

Example: AR(1)

plot(N, muhat, main="Estimates of the Mean", xlab="Iterations (in 1000's)")
points(N, muhat, type="l", col="red") ; abline(h=0, lwd=3)
legend("bottomright", legend=c("Observed", "Actual"), lty=c(1,1), col=c(2,1), lwd=c(1,3))

Example: AR(1)

plot(N, sigmahat, main="Estimates of Sigma", xlab="Iterations (in 1000's)")
points(N, sigmahat, type="l", col="red"); abline(h=20, lwd=3)
legend("bottomright", legend=c("Observed", "Actual"), lty=c(1,1), col=c(2,1), lwd=c(1,3))

Example: AR(1)

plot(N, 2*half, main="Calculated Interval Widths", xlab="Iterations (in 1000's)", ylab="Width", ylim=c(0, 1.8))
points(N, 2*half, type="l", col="red"); abline(h=0.2, lwd=3)
legend("topright", legend=c("Observed", "Cut-off"), lty=c(1,1), col=c(2,1), lwd=c(1,3))

Markov chain Monte Carlo

MCMC methods are used most often in Bayesian inference where the equilibrium (invariant, stationary) distribution is a posterior distribution
Challenge lies in construction of a suitable Markov chain with \(f\) as its stationary distribution
A key problem is we only get to observe \(t\) observations from \(\left\{ X_t \right\}\), which are serially dependent
Other questions to consider
- How good are my MCMC estimators?
- How long to run my Markov chain simulation?
- How to compare MCMC samplers?
- What to do in high-dimensional settings?
- …

Metropolis-Hastings algorithm

Setting \(X_0 = x_0\) (somehow), the Metropolis-Hastings algorithm generates \(X_{t+1}\) given \(X_t = x_t\) as follows:

Sample a candidate value \(X^* \sim g(\cdot | x_t)\) where \(g\) is the proposal distribution
Compute the MH ratio \(R(x_t, X^*)\), where \[ R(x_t, X^*) = \frac{f(x^*) g (x_t | x^*)}{f(x_t) g (x^* | x_t)} \]
Set \[ X_{t+1} = \begin{cases} x^* \mbox{ w.p.\ } \min\{ R(x_t, X^*), 1\} \\ x_t \mbox{ otherwise} \end{cases} \]

Metropolis-Hastings algorithm

Irreducibility and aperiodicity depend on the choice of \(g\), these must be checked
Performance (finite sample) depends on the choice of \(g\) also, be careful

Independence chains

Suppose \(g (x^* | x_t) = g (x^*)\), this yields an independence chain since the proposal does not depend on the current state
In this case, the MH ratio is
\[ R(x_t, X^*) = \frac{f(x^*) g (x_t)}{f(x_t) g (x^*)}, \] and the resulting Markov chain will be irreducible and aperiodic if \(g > 0\) where \(f>0\)
A good envelope function \(g\) should resemble \(f\), but should cover \(f\) in the tails

Random walk chains

Generate \(X^*\) such that \(\epsilon\sim h(\cdot)\) and set \(X^* = X_t + \epsilon\), then \(g(x^* | x_t) = h(x^* - x_t)\)
Common choices of \(h(\cdot)\) are symmetric zero mean random variables with a scale parameter, e.g. a Uniform(\(-a,a\)), Normal(\(0, \sigma^2\)), \(c*T_{\nu}, \dots\)
For symmetric zero mean random variables, the MH ratio is
\[ R(x_t, X^*) = \frac{f(x^*)}{f(x_t)} \]
If the support of \(f\) is connected and \(h\) is positive in a neighborhood of 0, then the chain is irreducible and aperiodic.

Example: Markov chain basics

Exercise: Suppose \(f \sim Exp(1)\)

Write an independence MH sampler with \(g \sim Exp(\theta)\)
Show \(R(x_t, X^*) = \exp \left\{ (x_t - x^*)(1-\theta) \right\}\)
Generate 1000 draws from \(f\) with \(\theta \in \{ 1/2, 1, 2 \}\)
Write a random walk MH sampler with \(h \sim N(0, \sigma^2)\)
Show \(R(x_t, X^*) = \exp \left\{ x_t - x^* \right \} I(x^* > 0)\)
Generate 1000 draws from \(f\) with \(\sigma \in \{ .2, 1, 5 \}\)
In general, do you prefer an independence chain or a random walk MH sampler? Why?
Implement the fixed-width stopping rule for you preferred chain

Example: Markov chain basics

Independence Metropolis sampler with Exp(\(\theta\)) proposal

ind.chain <- function(x, n, theta = 1) {
  ## if theta = 1, then this is an iid sampler
  m <- length(x)
  x <- append(x, double(n))
  for(i in (m+1):length(x)){
    x.prime <- rexp(1, rate=theta)
    u <- exp((x[(i-1)]-x.prime)*(1-theta))
    if(runif(1) < u)
      x[i] <- x.prime
    else
      x[i] <- x[(i-1)]
  }
  return(x)
}

Example: Markov chain basics

Random Walk Metropolis sampler with N(\(0,\sigma\)) proposal

rw.chain <- function(x, n, sigma = 1) {
  m <- length(x)
  x <- append(x, double(n))
  for(i in (m+1):length(x)){
    x.prime <- x[(i-1)] + rnorm(1, sd = sigma)
    u <- exp((x[(i-1)]-x.prime))
    u
    if(runif(1) < u && x.prime > 0)
      x[i] <- x.prime
    else
      x[i] <- x[(i-1)]
  }
  return(x)
}

Example: Markov chain basics

trial0 <- ind.chain(1, 500, 1)
trial1 <- ind.chain(1, 500, 2)
trial2 <- ind.chain(1, 500, 1/2)
rw1 <- rw.chain(1, 500, .2)
rw2 <- rw.chain(1, 500, 1)
rw3 <- rw.chain(1, 500, 5)