parallelMCMCcombine Software Tutorial

Agenda

Overview of parallelMCMCcombine Package
Package
Example
Evaluation

Overview

parallelMCMCcombine provides methods for combining independent subset MCMC posterior samples to estimate a posterior density given the full data set.
Written by Alexey Miroshnikov(UCLA Math) and Erin Conlon(UMASS Math&Stat)
Useful when performing Bayesian MCMC analyses on larger data sets.
Flexible with four different subset-based parallel communication-free MCMC methods.

Overview

Effective for Bayesian models including logistic regression models, Guassian mixture models, hiearchical models.
Best suited to models with unknown parameters of fixed dimension in continuous parameter spaces.
Also provides options to remove potential correlations between MCMC samples from different machines.

Motivation

Big data sets are too large and complex for classic analysis tools to be used.
The restriction on file sizes that can be read into computer memory (RAM).
It may be necessary to store and process data sets on more than one machine due to their large sizes.

Methods

Bayesian and Markov chain Monte Carlo (MCMC) methods have been developed to address these issues.
One approach partitions large data sets into smaller subsets, and parallelizes the MCMC computation by analyzing the subsets on separate machines (Langford et al, 2009).
Alternative methods have been introduced that do not require communication between machines (Neiswanger et al, 2014)

Methods

Neiswange(2014) introduced several kernel density estimators that approximate the posterior density for each data subset; the full data posterior is then estimated by multiplying the subset posterior densities together.
Scott(2013) developed methods that combine subset posteriors into the approximated full data posterior using weighted averages over the subset MCMC samples.

Methods

Communication-free parallel computing for MCMC methods involved copying the full data set to each machine and computing independent, parallel Markov chains on each machine (Wilkinson et al, 2006)
Integrated Nested Laplace Approximation (INLA) by Rue et al, 2009.
Importance resampling methods by Huang and Gelman, 2005.

Methods in the Package

"parallelMCMCcombine" package contains four methods:

Average of subposterior samples method
Consensus Monte Carlo method, assuming independence of individual model parameters
Consensus Monte Carlo method, assuming covariance among model parameters
Semiparametric density product estimator method

Base Setting

\(\theta\) is the d-dimension unknown parameter. y is the full data set.
M machines, T iterations.
The subposterior draws of \(\theta_{t}^{m}=(\theta_{t1}^{m},\theta_{t2}^{m},...,\theta_{td}^{m})\) for machine m and MCMC interation t from non-overlapping subset \(y_{m}\).
Sampling outside the "parallelMCMCcombine" package.

Average of subposterior samples

\[\theta_{t}=\frac{1}{M}\sum_{m=1}^{M}\theta_{t}^{m}\] - The covariance between individual model parameters is assumed to be zero

Function

sampleAvg(subchain=NULL, shuff=FALSE)

subchain: Input full data, an array with dimensions=(d, T, M).
shuff: A logical value indicating whether the d-dimensional samples within a machine should be randomly permuted. Remove potential correlations between MCMC samples from different machines.

Consensus Monte Carlo method (assuming independence)

Assuming independence of individual model parameters

\[\theta_{ti}=\frac{\sum_{m=1}^{M}W_{mi}\theta_{ti}^{m}}{\sum_{m=1}^{M}W_{mi}}\] - \(W_{mi}\) are the weights defined by \((\Sigma_{i}^{m})^{-1}\), the inverse of the estimated sample variance of the T MCMC samples \(\Sigma_{i}^{m}=Var(\theta_i|y_m)\).

The covariance between individual model parameters is assumed to be zero.

Function

consensusMCindep(subchain=NULL, shuff=FALSE)

subchain: Input full data, an array with dimensions=(d, T, M).
shuff: A logical value indicating whether the d-dimensional samples within a machine should be randomly permuted. Remove potential correlations between MCMC samples from different machines.

Consensus Monte Carlo method (assuming covariance)

Assuming model parameters are correlated.

\[\theta_{ti}=(\sum_{m=1}^{M}W_{m})^{-1}(\sum_{m=1}^{M}W_{m}\theta_{t}^{m})\] - \(W_{m}=\Sigma_{m}^{-1}\), where \(\Sigma_{m}=Var(\theta|y_m)\) is the variance-covariance matrix for the d unknown model parameters. It is estimated by the sample variance-covariance matrix of the T MCMC subposterior samples.

Function

consensusMCcov(subchain=NULL, shuff=FALSE)

subchain: Input full data, an array with dimensions=(d, T, M).
shuff: A logical value indicating whether the d-dimensional samples within a machine should be randomly permuted. Remove potential correlations between MCMC samples from different machines.

Semiparametric density product estimator method

Subposterior densities for each data subset are estimated using kernel smoothing techniques.
The subposterior densities are then multiplied together to approximate the posterior density based on the full data set.

Function

semiparamDPE(subchain=NULL, bandw=rep(1.0, dim(subchain) [1]), anneal=TRUE, shuff=FALSE)

subchain: Input full data, an array with dimensions=(d, T, M).
bandw: the vector of bandwidths h of length d to be specified by the user.
anneal: a logical value. If TRUE, the bandwidth bandw (instead of being fixed) is decreased for each iteration of the algorithm.
shuff: A logical value indicating whether the d-dimensional samples within a machine should be randomly permuted. Remove potential correlations between MCMC samples from different machines.

A metric for comparing densities

\(d_2(p,\hat{p})\) is the \(L_2\) distance between the full data posterior \(p\) and the combined estimated posterior \(\hat{p}\)

\[d_2(p,\hat{p})=||p=\hat{p}||_{L_2}=\bigg(\int_{R^d}(p(\theta-\hat{p}(\theta))^2d\theta\bigg)^{1/2}\]

Example: Bayesian logistic regression model

We simulated \(100,000\) observations from a logistic regression model with five covariates:

The sample size of 100,000 was chosen so that a full data analysis was still feasible.
The covariates \(X\) and model parameters \(\beta\) were simulated from standard normal distributions.
The resulting simulated values of were \[\beta=(0.47, -1.70, 0.54, -0.90, 0.86)^T\] which are the parameters estimated in our analysis.

Example: Bayesian logistic regression model

The Bayesian logistic regression model with five covariates:

\[y_i|p_i\sim Bernoulli(p_i),i=1,...,n\] \[logit(p_i)=\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3}+\beta_4x_{i4}+\beta_5x_{i5}\] where

\[logit(p_i)=\log(\frac{p_i}{1-p_i})\] - We assigned uninformative priors to the \(\beta\) parameters.

Example: Bayesian logistic regression model

The outcome values \(y_i\), \(i=1,...,100,000\) were then simulated from the following:

\[y_i\sim Bernoulli(p_i)\]

\[p_{i}=\frac{\exp(X_i\beta)}{1+\exp(X_i\beta)}\] where \(X_i\) denotes the ith row of \(X\)

Example: Bayesian logistic regression model

The full data set was divided into 10 subsets of size \(10,000\) each.
We sampled \(50,000\) iterations after burnin of \(2,000\) iterations for each(in R).
We also implemented the Bayesian model for the full data set for the model parameters for comparison, again for \(50,000\) iterations each after burnin of \(2,000\) iterations.

Codes

library(parallelMCMCcombine)
library(MCMCpack)

Codes

ptm<-proc.time()
d=5
#logistic.sampleavg.combine <- sampleAvg()
sampT  <- 50000  # number of subset posterior samples
M      <- 10    # total number of subsets
simu <- 100000/M
## simulate Gaussian subposterior samples

theta <- array(NA,c(d,sampT,M)) 

beta.vec=c(0.47,-1.7,0.54,-0.9,0.86)
set.seed(1)

for(s in 1:M){
set.seed(s)
y.vec=replicate(simu,0)
p.vec=replicate(simu,0)
x.m=matrix(replicate(simu*d,0),nrow=simu,ncol=d)

beta.vec=c(0.47,-1.7,0.54,-0.9,0.86)
for(i in 1:simu){
  x.m[i,]=rnorm(d)
  e.t=exp(t(beta.vec)%*%x.m[i,])
  p.vec[i]=e.t/(e.t+1)
  y.vec[i]=rbinom(1,1,p.vec[i])
}

Codes

x.1=x.m[,1]
x.2=x.m[,2]
x.3=x.m[,3]
x.4=x.m[,4]
x.5=x.m[,5]
#logistic mcmc
logmcmc = MCMClogit(y.vec~x.1+x.2+x.3+x.4+x.5+0, burnin=2000, mcmc=sampT)
a=logmcmc[,1:5]
tem=matrix(replicate(d*sampT,0),nrow=sampT,ncol=d)
for(ii in 1:sampT){
  for(jj in 1:d){
    tem[ii,jj]=a[ii,jj]
    theta[jj,ii,s]=tem[ii,jj]
  }
}
if(s==1){
  x.whole=x.m
  y.whole=y.vec
}else{
  x.whole=rbind(x.whole,x.m)
  y.whole=c(y.whole,y.vec)
}
}
proc.time()-ptm

Codes

#full data analysis
ptm<-proc.time()
x.1=x.whole[,1]
x.2=x.whole[,2]
x.3=x.whole[,3]
x.4=x.whole[,4]
x.5=x.whole[,5]
logmcmc = MCMClogit(y.whole~x.1+x.2+x.3+x.4+x.5+0, burnin=2000, mcmc=sampT)
a=logmcmc[,1:d]
tem=matrix(replicate(d*sampT,0),nrow=sampT,ncol=d)
for(ii in 1:sampT){
  for(jj in 1:d){
    tem[ii,jj]=a[ii,jj]
  }
}
tem.full=tem
proc.time()-ptm

Codes

#10 subsets methods comparisons
#graph 1
d.full <- density(tem.full[,1]) # returns the density data 
plot(d.full,xlim=c(0.35,0.6),ylim=c(0,45),xlab="parameter",main="") # plots the results
for(i in 1:M){
  temp=x.whole[((1+(i-1)*simu):(i*simu)),]
  y.temp=y.whole[((1+(i-1)*simu):(i*simu))]
  x.1=temp[,1]
x.2=temp[,2]
x.3=temp[,3]
x.4=temp[,4]
x.5=temp[,5]
logmcmc.temp = MCMClogit(y.temp~x.1+x.2+x.3+x.4+x.5+0, burnin=2000, mcmc=sampT)
a=logmcmc.temp[,1:d]
tem=matrix(replicate(d*sampT,0),nrow=sampT,ncol=d)
for(ii in 1:sampT){
  for(jj in 1:d){
    tem[ii,jj]=a[ii,jj]
  }
}
d.temp <- density(tem[,1])
lines(d.temp$x,d.temp$y,col=i+1,lty=i+1)
}

Codes

#graph 2
avg.theta <- sampleAvg(subchain=theta, shuff=TRUE)
d.avg <- density(avg.theta[1,]) # returns the density data 
plot(d.full,xlab="parameter",main="") # plots the results
lines(d.avg$x,d.avg$y,lty=2,col="red")
#graph 3
cin.theta <- consensusMCindep(subchain=theta, shuff=TRUE)
d.cin <- density(cin.theta[1,]) # returns the density data 
plot(d.full,xlab="parameter",main="") # plots the results
lines(d.cin$x,d.cin$y,lty=2,col="red")
#graph 4
ccov.theta <- consensusMCcov(subchain=theta, shuff=TRUE)
d.cov <- density(ccov.theta[1,]) # returns the density data 
plot(d.full,xlab="parameter",main="") # plots the results
lines(d.cov$x,d.cov$y,lty=2,col="red")
#graph 5
semi.anneal.theta <- semiparamDPE(theta, bandw = rep(1.0, dim(theta)[1]), 
                                  anneal = TRUE,shuff=TRUE)
d.semi.anneal <- density(semi.anneal.theta[1,]) # returns the density data 
plot(d.full,xlab="parameter",main="") # plots the results
lines(d.semi.anneal$x,d.semi.anneal$y,lty=2,col="red")
#graph 6
semi.noanneal.theta <- semiparamDPE(theta, bandw = rep(1.0, dim(theta)[1]), 
                                    anneal = FALSE,shuff=TRUE)
d.semi.noanneal <- density(semi.noanneal.theta[1,]) # returns the density data 
plot(d.full,xlab="parameter",main="") # plots the results
lines(d.semi.noanneal$x,d.semi.noanneal$y,lty=2,col="red")

Results

The estimated combined posterior density and full data posterior density for \(\beta_1\)
Different numbers of subsets: 5,10,20 and 50.

Results: 5 subsets

Results: 10 subsets

Results: 20 subsets

Results: 50 subsets

Codes

#distance between estimated and true density
d0=d.full
vec=list(d.avg,d.cin,d.cov,d.semi.anneal,d.semi.noanneal)
for(k in 1:5){
d1=vec[[k]]

f0 <- approxfun(d0$x, d0$y)
f1 <- approxfun(d1$x, d1$y)


ovrng <- c(max(min(d0$x), min(d1$x)), min(max(d0$x), max(d1$x)))


i <- seq(min(ovrng), max(ovrng), length.out=500)


h <- f0(i)-f1(i)


area<-sum( (h[-1]+h[-length(h)]) /2 *diff(i) *(h[-1]>=0+0))
print(round(area,4))
}

Results: Distance

Estimated relative \(L_2\) distance for \(\beta_1\) parameter


Method	5 subsets	10 subsets	20 subsets	50 subsets
Sample Average	0.034	0.0764	0.1169	0.194
Consensus MC Independence	0.0144	0.0384	0.0539	0.1429
Consensus MC Covariance	0.028	0.0188	0.0219	0.0449
Semiparametric anneal=TRUE	0.0308	0.0208	0.0238	0.1004
Semiparametric anneal=FALSE	0.0264	0.0194	0.0217	0.0574

Results: Computing Time

Computational times in seconds (rounded unless than 1 second).


Method	5 subsets	10 subsets	20 subsets	50 subsets
Sample Average	0.07	0.08	0.12	0.33
Consensus MC Independence	2	2	2	3
Consensus MC Covariance	2	3	5	16
Semiparametric	411	803	1511	3890

Overall Evaluation

Advantages:

Help investigators explore the various methods for specific applications
Dealing well with large datasets
Improve the analysis speed with parallel computing

Drawbacks:

Some methods are not working well with special priors(spike and slab)
When the number of parameters is large, the computing time is also huge.

Reference

Package description:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0108425

MCMClogit:

https://www.rdocumentation.org/packages/MCMCpack/versions/1.4-2/topics/MCMClogit