## Agenda

• Overview of parallelMCMCcombine Package
• Package
• Example
• Evaluation

## Overview

• parallelMCMCcombine provides methods for combining independent subset MCMC posterior samples to estimate a posterior density given the full data set.
• Written by Alexey Miroshnikov(UCLA Math) and Erin Conlon(UMASS Math&Stat)
• Useful when performing Bayesian MCMC analyses on larger data sets.
• Flexible with four different subset-based parallel communication-free MCMC methods.

## Overview

• Effective for Bayesian models including logistic regression models, Guassian mixture models, hiearchical models.
• Best suited to models with unknown parameters of fixed dimension in continuous parameter spaces.
• Also provides options to remove potential correlations between MCMC samples from different machines.

## Motivation

• Big data sets are too large and complex for classic analysis tools to be used.
• The restriction on file sizes that can be read into computer memory (RAM).
• It may be necessary to store and process data sets on more than one machine due to their large sizes.

## Methods

• Bayesian and Markov chain Monte Carlo (MCMC) methods have been developed to address these issues.
• One approach partitions large data sets into smaller subsets, and parallelizes the MCMC computation by analyzing the subsets on separate machines (Langford et al, 2009).
• Alternative methods have been introduced that do not require communication between machines (Neiswanger et al, 2014)

## Methods

• Neiswange(2014) introduced several kernel density estimators that approximate the posterior density for each data subset; the full data posterior is then estimated by multiplying the subset posterior densities together.
• Scott(2013) developed methods that combine subset posteriors into the approximated full data posterior using weighted averages over the subset MCMC samples.

## Methods

• Communication-free parallel computing for MCMC methods involved copying the full data set to each machine and computing independent, parallel Markov chains on each machine (Wilkinson et al, 2006)
• Integrated Nested Laplace Approximation (INLA) by Rue et al, 2009.
• Importance resampling methods by Huang and Gelman, 2005.

## Methods in the Package

"parallelMCMCcombine" package contains four methods:

• Average of subposterior samples method
• Consensus Monte Carlo method, assuming independence of individual model parameters
• Consensus Monte Carlo method, assuming covariance among model parameters
• Semiparametric density product estimator method

## Base Setting

• $$\theta$$ is the d-dimension unknown parameter. y is the full data set.
• M machines, T iterations.
• The subposterior draws of $$\theta_{t}^{m}=(\theta_{t1}^{m},\theta_{t2}^{m},...,\theta_{td}^{m})$$ for machine m and MCMC interation t from non-overlapping subset $$y_{m}$$.
• Sampling outside the "parallelMCMCcombine" package.

## Average of subposterior samples

$\theta_{t}=\frac{1}{M}\sum_{m=1}^{M}\theta_{t}^{m}$ - The covariance between individual model parameters is assumed to be zero

## Function

sampleAvg(subchain=NULL, shuff=FALSE)

• subchain: Input full data, an array with dimensions=(d, T, M).
• shuff: A logical value indicating whether the d-dimensional samples within a machine should be randomly permuted. Remove potential correlations between MCMC samples from different machines.

## Consensus Monte Carlo method (assuming independence)

• Assuming independence of individual model parameters

$\theta_{ti}=\frac{\sum_{m=1}^{M}W_{mi}\theta_{ti}^{m}}{\sum_{m=1}^{M}W_{mi}}$ - $$W_{mi}$$ are the weights defined by $$(\Sigma_{i}^{m})^{-1}$$, the inverse of the estimated sample variance of the T MCMC samples $$\Sigma_{i}^{m}=Var(\theta_i|y_m)$$.

• The covariance between individual model parameters is assumed to be zero.

## Function

consensusMCindep(subchain=NULL, shuff=FALSE)

• subchain: Input full data, an array with dimensions=(d, T, M).
• shuff: A logical value indicating whether the d-dimensional samples within a machine should be randomly permuted. Remove potential correlations between MCMC samples from different machines.

## Consensus Monte Carlo method (assuming covariance)

• Assuming model parameters are correlated.

$\theta_{ti}=(\sum_{m=1}^{M}W_{m})^{-1}(\sum_{m=1}^{M}W_{m}\theta_{t}^{m})$ - $$W_{m}=\Sigma_{m}^{-1}$$, where $$\Sigma_{m}=Var(\theta|y_m)$$ is the variance-covariance matrix for the d unknown model parameters. It is estimated by the sample variance-covariance matrix of the T MCMC subposterior samples.

## Function

consensusMCcov(subchain=NULL, shuff=FALSE)

• subchain: Input full data, an array with dimensions=(d, T, M).
• shuff: A logical value indicating whether the d-dimensional samples within a machine should be randomly permuted. Remove potential correlations between MCMC samples from different machines.

## Semiparametric density product estimator method

• Subposterior densities for each data subset are estimated using kernel smoothing techniques.

• The subposterior densities are then multiplied together to approximate the posterior density based on the full data set.

## Function

semiparamDPE(subchain=NULL, bandw=rep(1.0, dim(subchain) [1]), anneal=TRUE, shuff=FALSE)

• subchain: Input full data, an array with dimensions=(d, T, M).

• bandw: the vector of bandwidths h of length d to be specified by the user.

• anneal: a logical value. If TRUE, the bandwidth bandw (instead of being fixed) is decreased for each iteration of the algorithm.

• shuff: A logical value indicating whether the d-dimensional samples within a machine should be randomly permuted. Remove potential correlations between MCMC samples from different machines.

## A metric for comparing densities

• $$d_2(p,\hat{p})$$ is the $$L_2$$ distance between the full data posterior $$p$$ and the combined estimated posterior $$\hat{p}$$

$d_2(p,\hat{p})=||p=\hat{p}||_{L_2}=\bigg(\int_{R^d}(p(\theta-\hat{p}(\theta))^2d\theta\bigg)^{1/2}$

## Example: Bayesian logistic regression model

We simulated $$100,000$$ observations from a logistic regression model with five covariates:

• The sample size of 100,000 was chosen so that a full data analysis was still feasible.

• The covariates $$X$$ and model parameters $$\beta$$ were simulated from standard normal distributions.

• The resulting simulated values of were $\beta=(0.47, -1.70, 0.54, -0.90, 0.86)^T$ which are the parameters estimated in our analysis.

## Example: Bayesian logistic regression model

• The Bayesian logistic regression model with five covariates:

$y_i|p_i\sim Bernoulli(p_i),i=1,...,n$ $logit(p_i)=\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3}+\beta_4x_{i4}+\beta_5x_{i5}$ where

$logit(p_i)=\log(\frac{p_i}{1-p_i})$ - We assigned uninformative priors to the $$\beta$$ parameters.

## Example: Bayesian logistic regression model

• The outcome values $$y_i$$, $$i=1,...,100,000$$ were then simulated from the following:

$y_i\sim Bernoulli(p_i)$

$p_{i}=\frac{\exp(X_i\beta)}{1+\exp(X_i\beta)}$ where $$X_i$$ denotes the ith row of $$X$$

## Example: Bayesian logistic regression model

• The full data set was divided into 10 subsets of size $$10,000$$ each.

• We sampled $$50,000$$ iterations after burnin of $$2,000$$ iterations for each(in R).

• We also implemented the Bayesian model for the full data set for the model parameters for comparison, again for $$50,000$$ iterations each after burnin of $$2,000$$ iterations.

## Codes

library(parallelMCMCcombine)
library(MCMCpack)

## Codes

ptm<-proc.time()
d=5
#logistic.sampleavg.combine <- sampleAvg()
sampT  <- 50000  # number of subset posterior samples
M      <- 10    # total number of subsets
simu <- 100000/M
## simulate Gaussian subposterior samples

theta <- array(NA,c(d,sampT,M))

beta.vec=c(0.47,-1.7,0.54,-0.9,0.86)
set.seed(1)

for(s in 1:M){
set.seed(s)
y.vec=replicate(simu,0)
p.vec=replicate(simu,0)
x.m=matrix(replicate(simu*d,0),nrow=simu,ncol=d)

beta.vec=c(0.47,-1.7,0.54,-0.9,0.86)
for(i in 1:simu){
x.m[i,]=rnorm(d)
e.t=exp(t(beta.vec)%*%x.m[i,])
p.vec[i]=e.t/(e.t+1)
y.vec[i]=rbinom(1,1,p.vec[i])
}

## Codes

x.1=x.m[,1]
x.2=x.m[,2]
x.3=x.m[,3]
x.4=x.m[,4]
x.5=x.m[,5]
#logistic mcmc
logmcmc = MCMClogit(y.vec~x.1+x.2+x.3+x.4+x.5+0, burnin=2000, mcmc=sampT)
a=logmcmc[,1:5]
tem=matrix(replicate(d*sampT,0),nrow=sampT,ncol=d)
for(ii in 1:sampT){
for(jj in 1:d){
tem[ii,jj]=a[ii,jj]
theta[jj,ii,s]=tem[ii,jj]
}
}
if(s==1){
x.whole=x.m
y.whole=y.vec
}else{
x.whole=rbind(x.whole,x.m)
y.whole=c(y.whole,y.vec)
}
}
proc.time()-ptm

## Codes

#full data analysis
ptm<-proc.time()
x.1=x.whole[,1]
x.2=x.whole[,2]
x.3=x.whole[,3]
x.4=x.whole[,4]
x.5=x.whole[,5]
logmcmc = MCMClogit(y.whole~x.1+x.2+x.3+x.4+x.5+0, burnin=2000, mcmc=sampT)
a=logmcmc[,1:d]
tem=matrix(replicate(d*sampT,0),nrow=sampT,ncol=d)
for(ii in 1:sampT){
for(jj in 1:d){
tem[ii,jj]=a[ii,jj]
}
}
tem.full=tem
proc.time()-ptm

## Codes

#10 subsets methods comparisons
#graph 1
d.full <- density(tem.full[,1]) # returns the density data
plot(d.full,xlim=c(0.35,0.6),ylim=c(0,45),xlab="parameter",main="") # plots the results
for(i in 1:M){
temp=x.whole[((1+(i-1)*simu):(i*simu)),]
y.temp=y.whole[((1+(i-1)*simu):(i*simu))]
x.1=temp[,1]
x.2=temp[,2]
x.3=temp[,3]
x.4=temp[,4]
x.5=temp[,5]
logmcmc.temp = MCMClogit(y.temp~x.1+x.2+x.3+x.4+x.5+0, burnin=2000, mcmc=sampT)
a=logmcmc.temp[,1:d]
tem=matrix(replicate(d*sampT,0),nrow=sampT,ncol=d)
for(ii in 1:sampT){
for(jj in 1:d){
tem[ii,jj]=a[ii,jj]
}
}
d.temp <- density(tem[,1])
lines(d.temp$x,d.temp$y,col=i+1,lty=i+1)
}

## Codes

#graph 2
avg.theta <- sampleAvg(subchain=theta, shuff=TRUE)
d.avg <- density(avg.theta[1,]) # returns the density data
plot(d.full,xlab="parameter",main="") # plots the results
lines(d.avg$x,d.avg$y,lty=2,col="red")
#graph 3
cin.theta <- consensusMCindep(subchain=theta, shuff=TRUE)
d.cin <- density(cin.theta[1,]) # returns the density data
plot(d.full,xlab="parameter",main="") # plots the results
lines(d.cin$x,d.cin$y,lty=2,col="red")
#graph 4
ccov.theta <- consensusMCcov(subchain=theta, shuff=TRUE)
d.cov <- density(ccov.theta[1,]) # returns the density data
plot(d.full,xlab="parameter",main="") # plots the results
lines(d.cov$x,d.cov$y,lty=2,col="red")
#graph 5
semi.anneal.theta <- semiparamDPE(theta, bandw = rep(1.0, dim(theta)[1]),
anneal = TRUE,shuff=TRUE)
d.semi.anneal <- density(semi.anneal.theta[1,]) # returns the density data
plot(d.full,xlab="parameter",main="") # plots the results
lines(d.semi.anneal$x,d.semi.anneal$y,lty=2,col="red")
#graph 6
semi.noanneal.theta <- semiparamDPE(theta, bandw = rep(1.0, dim(theta)[1]),
anneal = FALSE,shuff=TRUE)
d.semi.noanneal <- density(semi.noanneal.theta[1,]) # returns the density data
plot(d.full,xlab="parameter",main="") # plots the results
lines(d.semi.noanneal$x,d.semi.noanneal$y,lty=2,col="red")

## Results

• The estimated combined posterior density and full data posterior density for $$\beta_1$$
• Different numbers of subsets: 5,10,20 and 50.