Agenda

  • Importance Sampling
  • Metropolis Hastings Algorithm
  • Gibbs Sampler

Applied Measure Theory

  • If \(X\) is a random element of some probability space having measure \(P\), the we write \[ E\left\{ g(X) \right\} = \int g(x) P(dx) \] whenever \(g\) is a real-valued function such that the expectation exists.
  • Basically, this is a convenient shorthand that covers discrete, continuous, mixed, and multivariate random variates.
  • Use notation \(E _P \left\{ g(X) \right\}\) to add clarity when necessary.
  • Remember that a probability is a special case of expectation, i.e. \[ P(A) = E _P \left\{ I_A(X) \right\} = \int_A P(dx), \] where \(I_A\) denotes the indicator function of the set \(A\).

Intractable Integrals

  • Consider an integral that is an expectation, say \[ \mu = E\left\{ g(X) \right\} = \int g(x) P(dx) \] where \(X\) is a random variable with probability measure \(P\). We assume this expectation actually exists, i.e. \(\mu < \infty\).
  • Most integrals are impossible to calculate exactly, hence the best we can do is an approximation.
  • Only generally applicable tool for approximating integrals like this is so-called Monte Carlo integration.

SLLN

  • Suppose we can simulate an iid sequence \(X_1, X_2, \dots\) of random variables having the probability measure P. Then \[ Y_i = g(X_i), \quad i = 1,2,\dots \] is an iid sequence of random variables having mean \(\mu\).
  • SLLN says that if \(E|Y| < \infty\) then with probability 1 as \(n \to \infty\) \[ \bar{Y}_n = \frac{1}{n} \sum_{i=1}^n Y_i \to \mu . \]

Densities

  • We say a probability measure \(P\) has a density \(f\) with respect to another probability measure \(Q\) if our expectation can be rewritten \[ \begin{aligned} E \left\{ g(X) \right\} & = \int g(x) P(dx) \\ & = \int g(x) f(x) Q(dx) \end{aligned} \] and this holds for all functions \(g\) for which the expectation is defined.

Densities

  • For example, consider the case where both \(P\) and \(Q\) are the probability measures of continuous real-valued random variables.
  • Say \(X\) has measure \(P\) and \(Y\) has measure \(Q\) and \[ E \left\{ g(X) \right\} = \int_{-\infty}^{\infty} g(x) f_X (x) dx = \int g(x) P(dx) \] for any function \(g\) for which the expectation is defined and \[ E \left\{ g(Y) \right\} = \int_{-\infty}^{\infty} g(y) f_Y (y) dy = \int g(x) Q(dx) \] for any function \(g\) for which the expectation is defined, then in our first equation we have \[ f(w) = \frac{f_X(w)}{f_Y(w)}. \]

Importance Sampling

  • One of the most important techniques in Monte Carlo is the importance sampling trick.
  • Uses one distribution to give information about another.
  • Suppose \(P\) has a density \(f\) with respect to \(Q\) and we want to calculate the expectation \[ E \left\{ g(X) \right\} = \int g(x) P(dx) \] by Monte Carlo.
  • If direct simulation from \(P\) is difficult then the relationship \[ \begin{aligned} E \left\{ g(X) \right\} & = \int g(x) P(dx) \\ & = \int g(x) f(x) Q(dx) \end{aligned} \] gives us another way to do it.

Importance Sampling

  • That is, one Monte Carlo approximation of \(E_P \{ g(X) \}\) is \[ \frac{1}{n} \sum_{i=1}^n g(X_i) \quad \quad (A) \] where \(X_1, X_2, \dots\) form an identically \(P\)-distributed sequence obeying the SLLN.
  • Another is \[ \frac{1}{n} \sum_{i=1}^n g(Y_i) f(Y_i) \quad \quad (B) \] where \(Y_1, Y_2, \dots\) form an identically \(Q\)-distributed sequence obeying the SLLN.

Importance Sampling

  • This works because \[ E_P \{ g(X) \} = E_Q \{ g(Y) f(Y) \} \] when \(X\) has measure \(P\) and \(Y\) has measure \(Q\).
  • Just \(f(w)\) rewritten in different notation.
  • As long as we only discuss the SLLN, both estimates are equally good (that is, they both work).
  • But one may be better than the other if we consider the variance of the Monte Carlo estimators.

Importance Sampling

  • In the simplest case using independent sequences, the variance of (A) is \[ \frac{1}{n} \text{Var}_P \{ g(X) \} \quad \quad (VA) \] and the variance of (B) is \[ \frac{1}{n} \text{Var}_Q \{ g(Y) f(Y) \} \quad \quad (VB) \]
  • Two variances will typically not be the same.
  • In the extreme, we might have \[ f(y) \propto \frac{1}{g(y)} \] which would make the random variable \(g(Y )f(Y )\) constant and hence (VB) zero. Then there would be no Monte Carlo error, but only because we knew so much about the problem that we didn’t need to use Monte Carlo.

Importance Sampling

  • Variance reduction spin on importance sampling is perhaps the least interesting aspect of it.
  • Problem with classical importance sampling is that it’s often not worth the trouble.
  • Importance sampling is often used when efficiency considerations go the other way, when (VA) is smaller than (VB).
  • Sometimes the importance sampling estimator (B) is just easier to do, i.e. simulation from \(Q\) is easy.

Example

  • Consider the parametric family \[ \mathcal{P} = \left\{ P_{\theta} : \theta \in \Theta \right\} \] of probability measures (a statistical model) and we want to calculate \[ \mu (\theta) = E_{\theta} \{ g(X) \} = \int g(x) P_{\theta} (dx) \] for all \(\theta \in \Theta\).
  • Naive way to calculate this by Monte Carlo is by a different simulation \(X_1, X_2, \dots\) identically \(P_{\theta}\)-distributed for each \(\theta\) and use the estimator (A) to estimate \(\mu (\theta)\).
  • Trouble with the naive idea is that it takes an infinite amount of time to do an infinite number of simulations.
  • Anyone implementing the naive idea will only actually do a finite number of simulations at a grid of points in \(\Theta\), but still costly especially if \(\Theta\) is not one-dimensional.

Example

  • Suppose all of the measures in the model are dominated by some probability measure \(Q\), say.
  • Then \[ \mu (\theta) = \int g(x) P_{\theta} (dx) = \int g(y) f_{\theta} (y) Q (dy) \] for all functions \(g\) for which the expectation exists.
  • Then if \(Y_1, Y_2, \dots\) form an identically \(Q\)-distributed sequence obeying the SLLN \[ \bar{\mu}_n (\theta) = \frac{1}{n} \sum_{i=1}^n g(Y_i) f_{\theta} (Y_i) \] is a Monte Carlo estimate of \(\mu (\theta)\).

Example

  • We get an estimate for all \(\theta \in \Theta\) with just one Monte Carlo sample!
  • Efficient! But not in terms of the the usual importance sampling notion.
  • The variance of \(\bar{\mu}_n (\theta)\) will vary with \(\theta\) and may be very bad for some \(\theta\).
  • Idea here is not to have the optimal scheme for calculating any one expectation, but a scheme that does an acceptable job of calculating many expectations.

Normalized Importance Sampling

  • Recall \[ f(w) = \frac{f_X(w)}{f_Y(w)}. \]
  • In practice, often know \(f\) only up to a ratio of normalizing constants. That is, if \(h_X\) and \(h_Y\) are nonnegative functions such that \[ f(w) = \frac{h_X(w) / c_X}{h_Y(w) / c_Y} \] where \[ c_X = \int h_X(w) dw \text{ and } c_Y = \int h_Y(w) dw . \]
  • We know \(h_X\) and \(h_Y\), but not \(c_X\) and \(c_Y\).
  • Unnormalized densities slightly complicate importance sampling. The previous formula doesn’t work. However, a slight variant does.

Normalized Importance Sampling

  • Notice that \[ E_P \{ g(X) \} = E_Q \left\{ \frac{g(Y) f_X (Y)}{f_Y (Y)} \right\} = \frac{c_Y}{c_X} E_Q \left\{ \frac{g(Y) h_X (Y)}{h_Y (Y)} \right\} = \frac{c_Y}{c_X} E_Q \left\{ g(Y) w(Y) \right\} \] where the unnormalized importance weights are \[ w(y) = \frac{h_X (y)}{h_Y (y)}. \]
  • Hence if \(Y_1, Y_2, \dots Y_n\) form an identically \(Q\)-distributed sequence obeying the SLLN \[ \frac{1}{n} \sum_{i=1}^n g(Y_i) w(Y_i) \to E_Q \left\{ g(Y) w(Y) \right\} \] with probability 1 as \(n \to \infty\).

Normalized Importance Sampling

  • Now \[ E_Q \left\{ w(Y) \right\} = E_Q \left\{ \frac{h_X (Y)}{h_Y (Y)} \right\} = \frac{c_X}{c_Y} E_Q \left\{ \frac{f_X (Y)}{f_Y (Y)} \right\} = \frac{c_X}{c_Y} . \]
  • Thus, if \(Y_1, Y_2, \dots Y_n\) form an identically \(Q\)-distributed sequence obeying the SLLN \[ \frac{1}{n} \sum_{i=1}^n w(Y_i) = \frac{1}{n} \sum_{i=1}^n \frac{h_X (Y_i)}{h_Y (Y_i)} \to \frac{c_X}{c_Y} \] with probability 1 as \(n \to \infty\). Let \[ w^* (y) = \frac{w(y)}{\sum_{i=1}^n w(y_i)} \] which are called the normalized importance weights.

Normalized Importance Sampling

  • Then if \(Y_1, Y_2, \dots Y_n\) form an identically \(Q\)-distributed sequence obeying the SLLN \[ \frac{1}{n} \sum_{i=1}^n g(Y_i) w^*(Y_i) = \frac{\frac{1}{n} \sum_{i=1}^n g(Y_i) w(Y_i)}{\frac{1}{n} \sum_{i=1}^n w(Y_i)} \to \frac{c_Y}{c_X} E_Q \left\{ g(Y) w(Y) \right\} = E_P \{ g(X) \} \] with probability 1 as \(n \to \infty\).

  • Note that \(w^*\) is a function having the properties of a probability distribution on the sample \(Y_1, \dots ,Y_n\) \[ w^*(Y_i) \ge 0, \; i = 1, \dots, n \text{ and } \sum_{i=1}^n w^*(Y_i) = 1 . \]

Normalized Importance Sampling

  • Suppose that the function \(g\) for which we are calculating a Monte Carlo expectation is an indicator function (so the expectation is a probability), hence we write \[ P(A) = E_P \left\{ I_A(X) \right\} = \int_A P(dx) \] and the Monte Carlo estimate is \[ \bar{P}_n (A) = \sum_{i=1}^n I_A (Y_i) w^* (Y_i). \]
  • Now our intuition about probabilities says these should satisfy the complement rule \[ \bar{P}_n (A^C) = 1 - \bar{P}_n (A) \] and they do, but only because we are using normalized importance sampling.

Example

  • Suppose, as is often the case, that the densities \(f_{\theta}\) defined previously are known only up to a constant of proportionality, that is, we know functions \(h_{\theta}\) such that \[ f_{\theta} (x) \propto h_{\theta} (x) \] but we do not know the constant of proportionality.
  • Of course, its value is determined by the requirement that \(f_{\theta}\) integrate to one. Hence \[ f_{\theta} (x) = \frac{h_{\theta} (x)}{c(\theta)} \] where \[ c(\theta) = \int h_{\theta} (x) Q(dx). \]
  • But we may not know how to do this integral, so we don’t know the “normalizing constant” \(c(\theta)\) in any practical sense.

Normalized Importance Sampling

  • Useful to have terminology to describe this situation. We say that \[ \mathcal{H} = \left\{ h_{\theta} : \theta \in \Theta \right\} \] is a family of unnormalized probability densities with respect to \(Q\).
  • Then \(c(\theta)\) defines the normalizing function of the family, and \(f_{\theta} (x)\) defines the corresponding normalized densities \(f_{\theta} (x)\).

Normalized Importance Sampling

  • The previous SLLN argument for the validity as a Monte Carlo estimate says with probability 1, as \(n \to \infty\) \[ \frac{1}{n} \sum_{i=1}^n g(Y_i) f_{\theta} (Y_i) \to \mu (\theta). \]

Normalized Importance Sampling

  • In the current setting we get, as \(n \to \infty\) \[ \frac{1}{n} \sum_{i=1}^n g(Y_i) \frac{h_{\theta} (Y_i)}{c(\theta)} \to \mu (\theta) \] with probability 1, or \[ \frac{1}{n} \sum_{i=1}^n g(Y_i) h_{\theta} (Y_i) \to c(\theta) \mu (\theta). \]
  • In the special case with \(g=1\), because \(E_{\theta} (1) = 1\), we have \[ \frac{1}{n} \sum_{i=1}^n h_{\theta} (Y_i) \to c(\theta) \] with probability 1 as \(n \to \infty\).

Normalized Importance Sampling

  • Combining the previous two equations gives \[ \frac{\frac{1}{n} \sum_{i=1}^n g(Y_i) h_{\theta} (Y_i)}{\frac{1}{n} \sum_{i=1}^n h_{\theta} (Y_i)} \to \mu (\theta) \]
  • The left hand side is the desired Monte Carlo estimator of \(\mu(\theta)\) in the case where densities are unnormalized.
  • The \(n\) numbers \(h_{\theta} (Y_i)\) are the unnormalized importance weights.
  • When divided by their sum, they are the normalized importance weights \[ w_{\theta}(y) = \frac{h_{\theta} (y)}{\sum_{i=1}^n h_{\theta} (Y_i)} . \]
  • Then the left hand side can be written as a weighted average \[ \bar{\mu}_n (\theta) = \frac{\frac{1}{n} \sum_{i=1}^n g(Y_i) h_{\theta} (Y_i)}{\frac{1}{n} \sum_{i=1}^n h_{\theta} (Y_i)} = \sum_{i=1}^n g(Y_i) w_{\theta}(Y_i). \]

Markov chain Monte Carlo

  • MCMC methods are used most often in Bayesian inference where the equilibrium (invariant, stationary) distribution is a posterior distribution.
  • Challenge lies in construction of a suitable Markov chain with \(f\) as its stationary distribution.
  • Two primary algorithms to construct Markov chains in MCMC, Metropolis Hastings and Gibbs.

Metropolis-Hastings algorithm

Setting \(X_0 = x_0\) (somehow), the Metropolis-Hastings algorithm generates \(X_{t+1}\) given \(X_t = x_t\) as follows:

  1. Sample a candidate value \(X^* \sim g(\cdot | x_t)\) where \(g\) is the proposal distribution
  2. Compute the MH ratio \(R(x_t, X^*)\), where \[ R(x_t, X^*) = \frac{f(x^*) g (x_t | x^*)}{f(x_t) g (x^* | x_t)} \]
  3. Set \[ X_{t+1} = \begin{cases} x^* \mbox{ w.p.\ } \min\{ R(x_t, X^*), 1\} \\ x_t \mbox{ otherwise} \end{cases} \]

Metropolis-Hastings algorithm

  • Irreducibility and aperiodicity depend on the choice of \(g\), these must be checked
  • Performance (finite sample) depends on the choice of \(g\) also, be careful

Independence chains

  • Suppose \(g (x^* | x_t) = g (x^*)\), this yields an independence chain since the proposal does not depend on the current state
  • In this case, the MH ratio is
    \[ R(x_t, X^*) = \frac{f(x^*) g (x_t)}{f(x_t) g (x^*)}, \] and the resulting Markov chain will be irreducible and aperiodic if \(g > 0\) where \(f>0\)

  • A good envelope function \(g\) should resemble \(f\), but should cover \(f\) in the tails

Random walk chains

  • Generate \(X^*\) such that \(\epsilon\sim h(\cdot)\) and set \(X^* = X_t + \epsilon\), then \(g(x^* | x_t) = h(x^* - x_t)\)
  • Common choices of \(h(\cdot)\) are symmetric zero mean random variables with a scale parameter, e.g. a Uniform(\(-a,a\)), Normal(\(0, \sigma^2\)), \(c*T_{\nu}, \dots\)

  • For symmetric zero mean random variables, the MH ratio is
    \[ R(x_t, X^*) = \frac{f(x^*)}{f(x_t)} \]
  • If the support of \(f\) is connected and \(h\) is positive in a neighborhood of 0, then the chain is irreducible and aperiodic.

Example: Markov chain basics

Exercise: Suppose \(f \sim Exp(1)\)

  1. Write an independence MH sampler with \(g \sim Exp(\theta)\)
  2. Show \(R(x_t, X^*) = \exp \left\{ (x_t - x^*)(1-\theta) \right\}\)
  3. Generate 1000 draws from \(f\) with \(\theta \in \{ 1/2, 1, 2 \}\)
  4. Write a random walk MH sampler with \(h \sim N(0, \sigma^2)\)
  5. Show \(R(x_t, X^*) = \exp \left\{ x_t - x^* \right \} I(x^* > 0)\)
  6. Generate 1000 draws from \(f\) with \(\sigma \in \{ .2, 1, 5 \}\)
  7. In general, do you prefer an independence chain or a random walk MH sampler? Why?
  8. Implement the fixed-width stopping rule for you preferred chain

Example: Markov chain basics

Independence Metropolis sampler with Exp(\(\theta\)) proposal

ind.chain <- function(x, n, theta = 1) {
  ## if theta = 1, then this is an iid sampler
  m <- length(x)
  x <- append(x, double(n))
  for(i in (m+1):length(x)){
    x.prime <- rexp(1, rate=theta)
    u <- exp((x[(i-1)]-x.prime)*(1-theta))
    if(runif(1) < u)
      x[i] <- x.prime
    else
      x[i] <- x[(i-1)]
  }
  return(x)
}

Example: Markov chain basics

Random Walk Metropolis sampler with N(\(0,\sigma\)) proposal

rw.chain <- function(x, n, sigma = 1) {
  m <- length(x)
  x <- append(x, double(n))
  for(i in (m+1):length(x)){
    x.prime <- x[(i-1)] + rnorm(1, sd = sigma)
    u <- exp((x[(i-1)]-x.prime))
    u
    if(runif(1) < u && x.prime > 0)
      x[i] <- x.prime
    else
      x[i] <- x[(i-1)]
  }
  return(x)
}

Example: Markov chain basics

trial0 <- ind.chain(1, 500, 1)
trial1 <- ind.chain(1, 500, 2)
trial2 <- ind.chain(1, 500, 1/2)
rw1 <- rw.chain(1, 500, .2)
rw2 <- rw.chain(1, 500, 1)
rw3 <- rw.chain(1, 500, 5)

Example: Markov chain basics

Gibbs Sampling

  1. Select starting values \(x_0\) and set \(t=0\)
  2. Generate in turn (deterministic scan Gibbs sampler)
    • \(x^{(1)}_{t+1} \sim f( x^{(1)} | x^{(-1)}_t)\)
    • \(x^{(2)}_{t+1} \sim f( x^{(2)} | x^{(1)}_{t+1}, x^{(3)}_t, \dots, x^{(p)}_t)\)
    • \(x^{(3)}_{t+1} \sim f( x^{(3)} | x^{(1)}_{t+1}, x^{(2)}_{t+1}, x^{(4)}_t, \dots, x^{(p)}_t)\)
    • \(x^{(p)}_{t+1} \sim f( x^{(p)} | x^{(-p)}_{t+1})\)
  3. Increment \(t\) and go to Step 2

Gibbs Sampling

  • Common to have one or more components not available in closed form
  • Then one can just use a MH sampler for those components known as a Metropolis within Gibbs or Hybrid Gibbs sampling
  • Common to ``block'' groups of random variables

Example: Capture-recapture

  • Data from a fur seal pup capture-recapture study for \(i=7\) census attempts
  • Goal is to estimate the number of pups in a fur seal colony using a capture-recapture study
Count Parameter 1 2 3 4 5 6 7
Captured \(c_i\) 30 22 29 26 31 32 35
Newly Caught \(m_i\) 30 8 17 7 9 8 5

Example: Capture-recapture

  • Let \(N\) be the population size, \(I\) be the number of census attempts where \(c_i\) were captured (\(I=7\) in our case), and \(r\) be the total number captured (\(r = \sum_{i=1}^I m_i = 84\))

  • We consider a separate unknown capture probability for each census \((\alpha_1, \dots, \alpha_I)\) where the animals are equally ``catchable''
  • Then \[ L(N,\alpha | c,r) \propto \frac{N!}{(N-r)!} \prod_{i=1}^{I} \alpha_i ^{c_i} (1-\alpha_i) ^ {N-c_i} \]

Example: Capture-recapture

  • Assume \(N\) and \(\alpha\) are apriori independent with \[ f(N) \propto 1 \mbox{ and } f(\alpha_i | \theta_1, \theta_2) \stackrel{i.i.d.}{\sim} \mbox{Beta} (\theta_1, \theta_2) \]
  • We use \(\theta_1 = \theta_2 = 1/2\), which is the Jeffrey's Prior
  • The resulting posterior is proper when \(I>2\) and recommended when \(I>5\)

Example: Capture-recapture

  • Then it is easy to show the posterior is \[ f(N,\alpha | c,r) \propto \frac{N!}{(N-r)!} \prod_{i=1}^{I} \alpha_i ^{c_i} (1-\alpha_i) ^ {N-c_i} \prod_{i=1}^{I} \alpha_i^{-1/2} (1-\alpha_i)^{-1/2} . \]
  • Further, one can show \[ \begin{aligned} N - 84 | \alpha & \sim \mbox{NB} \left( 85, 1 - \prod_{i=1}^{I} (1-\alpha_i) \right) \mbox{ and }\\ \alpha_i | N & \sim \mbox{Beta} \left( c_i + 1/2 , N -c_i + 1/2 \right) \mbox{ for all }i \end{aligned} \]

Example: Capture-recapture

Then we can consider the chain \[ \left(N , \alpha\right) \rightarrow \left(N' , \alpha\right) \rightarrow \left(N' , \alpha'\right) \] or \[ \left(N , \alpha\right) \rightarrow \left(N , \alpha'\right) \rightarrow \left(N' , \alpha'\right) , \] where both involve a ``block'' update of \(\alpha\)

Example: Capture-recapture

First, we can write the data into R

captured <- c(30, 22, 29, 26, 31, 32, 35)
new.captures <- c(30, 8, 17, 7, 9, 8, 5)
total.r <- sum(new.captures)

Example: Capture-recapture

The following R code implements the Gibbs sampler

gibbs.chain <- function(n, N.start = 94, alpha.start = rep(.5,7)) {
    output <- matrix(0, nrow=n, ncol=8)
    for(i in 1:n){
        neg.binom.prob <- 1 - prod(1-alpha.start)
        N.new <- rnbinom(1, 85, neg.binom.prob) + total.r
        beta1 <- captured + .5
        beta2 <- N.new - captured + .5
        alpha.new <- rbeta(7, beta1, beta2)
        output[i,] <- c(N.new, alpha.new)
        N.start <- N.new
        alpha.start <- alpha.new    
    }
    return(output)
}

MCMC output analysis

How can we tell if the chain is mixing well?

  • Trace plots or time-series plots
  • Autocorrelation plots
  • Plot of estimate versus Markov chain sample size
  • Effective sample size (ESS) \[ \text{ESS} (n) = \frac{n}{1+2 \sum_{k=1}^{\infty} \rho_{k}(g)}, \] where \(\rho_{k}(g)\) is the autocorrelation of lag \(k\) for \(g\)
  • Alternative, ESS can be written as \[ \text{ESS} (n) = \frac{n}{\sigma ^2 / \text{Var} g} \] where \(\sigma ^2\) is the asymptotic variance from a Markov chain CLT

Example: Capture-recapture

Then we consider some preliminary simulations to ensure the chain is mixing well

trial <- gibbs.chain(1000)
plot.ts(trial[,1], main = "Trace Plot for N")
for(i in 1:7){
    plot.ts(trial[,(i+1)], main = paste("Alpha", i))
    }

acf(trial[,1], main = "Lag Plot for N")
for(i in 1:7){
    acf(trial[,(i+1)], main = paste("Lag Alpha", i))
    }

Example: Capture-recapture

Example: Capture-recapture

Example: Capture-recapture

Now for a more complete simulation to estimate posterior means and a 90% Bayesian credible region

sim <- gibbs.chain(10000)
N <- sim[,1]
alpha1 <- sim[,2]

Example: Capture-recapture

par(mfrow=c(1,2))
hist(N, freq=F, main="Estimated Marginal Posterior for N")
hist(alpha1, freq=F, main ="Estimating Marginal Posterior for Alpha 1")