Introduction to Bayesian Statistics

Agenda

Course overview
Class objectives
First 'presentations' on Thursday!
Motivating example
Review from undergraduate Mathematical Statistics
Software tutorials

STAT203: BAYESIAN STATISTICS

Lecture: TR 9:40 - 11:00 Olmsted 1429
Discussion: R 1:10 - 2:00 Olmsted 1431 (No Discussion Week 1)
Instructor: James M. Flegal, jflegal@ucr.edu
Office Hours: R 11:00 - 12:00 Olmsted 1428
Zoom Hours: M 9:00 - 10:00 and R 11:10 - 12:00
Internet: iLearn STAT 203

Description

This class is an introduction to Bayesian statistics including "subjective probability, Renyi axiom system, Savage axioms, coherence, Bayes theorem, credibility intervals, Lindley paradox, empirical Bayes estimation, natural conjugate priors, de Finetti's theorem, approximation methods, Bayesian bootstrap, Bayesian computer programs".

The class will be taught in the R language.

Course Mechanics and Grading

There will approximately eight participation / presentation exercises, five homework, and a final exam. Grades will be calculated as follows:

Participation and weekly presentations: 30%
Software tutorial: 20%
Homework: 20%
Final exam: 30%

Textbooks

Strongly recommended books:

Ronald Christensen, Wesley O. Johnson, Adam J. Branscum, and Timothy E. Hanson, Bayesian Ideas and Data Analysis: An Introduction for Scientists and Statisticians

Other books that may be helpful:

Peter M. Lee, Bayesian Statistics: An Introduction

Some R Resources

There are many online resources for learning about it and working with it, in addition to the textbooks:

The official intro, "An Introduction to R", available online in HTML and PDF
John Verzani, "simpleR", in PDF
Patrick Burns, The R Inferno. "If you are using R and you think you're in hell, this is a map for you."

Collaboration, Copying and Plagiarism

You are encouraged to discuss course material, including assignments, with your classmates. All work you turn in, however, must be your own. This includes both writing and code. Copying from other students, from books, or from websites (1) does nothing to help you learn, (2) is easy to detect, and (3) has serious negative consequences.

Calendar and Topics

I will try to cover Chapters 1-3 and 5-9 from Ronald Christensen, Wesley O. Johnson, Adam J. Branscum, and Timothy E. Hanson, Bayesian Ideas and Data Analysis: An Introduction for Scientists and Statisticians.
Topics include Fundamental Ideas, Integration Versus Simulation, Comparing Populations, Simulations, Binomial Regression, and Linear Regression.

Class Objectives

Describe Bayesian statistics, where one's inferences about parameters or hypotheses are updated as evidence accumulates. Learn to use Bayes’ rule to transform prior probabilities into posterior probabilities, and introduce the underlying theory and perspective of the Bayesian paradigm.
Apply Bayesian methods to several practical problems, to show end-to-end Bayesian analyses that move from framing the question to building models to eliciting prior probabilities to implementing in R the final posterior distribution.
Develop a 'Software Tutorial' …
Have fun! Class is an elective!

Weekly Participation

I will attempt to have a 'weekly' participation or presentation activity.
This will be based on readings, homework, data analysis, and so on. I anticipate some of these will be group activities.
Goal is to 'flip' the classroom … "Instructional strategy that reverses the traditional learning environment by delivering instructional content, often online, outside of the classroom. It moves activities, including those that may have traditionally been considered homework, into the classroom."

Weekly Participation

Everyone must be prepared to present, but we don't have time for everyone to present every week.
Use 'The Wheel' implemented at Amazon Web Services.
We will use the sample() function in R and weight away from participants recently chosen.

names <- c("VJ", "Lauren", "Cangao", "Yuting", "Ying", "Huiling", 
           "Samantha", "Jinhui", "Song", "Mi")
count <- rep(1, length(names))
chosen <- sample(names, 1, prob=1/count)
chosen

## [1] "Lauren"

count[chosen==names] <- count[chosen==names] + 1 # +1 subject to change
count

##  [1] 1 2 1 1 1 1 1 1 1 1

First 'presentations' on Thursday!

Assignment - Read Chapter 1 and "Why isn't everyone a Bayesian?" available on iLearn.
Everyone is expected to contribute to Thursday's discussion, this will be part of your grade.
Be ready to answer questions such as which example was most compelling? What do you like or dislike about Bayesian statistics? Are you a Bayesian? What are some lurking complexities? What did you find interesting in reading Chapter 1? What was interesting in Efron's paper?
Prepare one question for the class or instructor.

Bayesian Statistics

Bayesian statistical analysis is based on the premise that all uncertainty should be modeled using probabilities and that all statistical inferences should be based on the laws of probabilities.
Probability models long embraced for data.
Given a probability model for the data, Bayesian methods mandate an additional probability model for the unknown parameters in the data model.
Uncertainty about parameters is modeled using scientific expert information, called 'prior' information acquired 'a priori' to obtaining the data.

Probability of a Defective

Binomial data example from Section 1.1 of Christensen et al.
The Par-Aide Corporation in Lino Lakes, Minnesota, makes ball washers for golf courses. St. Paul Brass and Aluminum Foundry makes a part called a “push rod eye,” an integral component of the golf ball washer. Out of 2,430 push rod eyes poured over seven days in May, 2003, only 2,211 actually shipped. It is of interest to estimate the probability of pouring a defective part. The Vice President of Operations thinks that for this particular part a plausible range for the proportion of scrap is 5% to 15%.

Probability of a Defective

Assume production process determines a proportion of defective parts that we call \(\theta\). Treat the available data as a random sample from the production process, so that the probability of any part being scrapped is \(\theta\) and further we assume that all 2,430 parts being examined are independent. With these assumptions \[ y | \theta \sim \text{Bin}(2430, \theta). \]
The proportion defective, \(\theta\), is unknown. Use probability to reflect the Vice President's knowledge about it. We interpreted the Vice President as specifying \(Pr(\theta \le 0.05) = 0.025\) and \(Pr(\theta \ge 0.15) = 0.025\). Although these statements do not completely determine a probability distribution for \(\theta\), we restricted our attention to the beta probability distribution for \(\theta\). Then we have \[ \theta \sim \text{Beta}(12.05, 116.06). \]
This is the prior distribution.

Probability of a Defective

A Bayesian analysis uses Bayes Theorem to combine the data with the prior distribution to update the Vice President's probability distribution about \(\theta\).
This new probability distribution, called the posterior distribution, describes knowledge about \(\theta\) and is the fundamental tool in Bayesian statistical analysis. Typically, we use computer simulations to approximate the posterior distribution. Occasionally, we can find it mathematically such as in this example.
With \(y = 219\), the estimated median of the posterior distribution is 0.09.
A 95% equal-tailed probability interval for \(\theta\) is then (0.08,0.10).

Bayesian Inference (Short Version)

Suppose you are uncertain about something, which is described by a probability distribution called your prior distribution
Suppose you obtain some data relevant to that thing
The data changes your uncertainty, which is then described by a new probability distribution called your posterior distribution
Posterior distribution reflects the information both in the prior distribution and the data
Most of Bayesian inference is about how to go from prior to posterior

Bayesian Inference

Bayesians go from prior to posterior is to use the laws of conditional probability, sometimes called in this context Bayes rule or Bayes theorem
Suppose we have a PDF \(g\) for the prior distribution of the parameter \(\theta\), and suppose we obtain data \(x\) whose conditional PDF given \(\theta\) is \(f\)
Then the joint distribution of data and parameters is conditional times marginal \[ f( x | \theta) g(\theta) \]
May look strange because most of your training on considers the frequentist paradigm
Here both \(x\) and \(\theta\) are random variables

Bayesian Inference

The correct posterior distribution, according to the Bayesian paradigm, is the conditional distribution of \(\theta\) given \(x\), which is joint divided by marginal \[ h (\theta | x) = \frac{f( x | \theta) g(\theta)}{\int f( x | \theta) g(\theta) d \theta} \]
Often we do not need to do the integral if we recognize that \[ \theta \mapsto f( x | \theta) g(\theta) \] is, except for constants, the PDF of a brand name distribution, then that distribution must be the posterior

Binomial Data, Beta Prior

Suppose the prior distribution for \(p\) is Beta(\(\alpha_1, \alpha_2\)) and the conditional distribution of \(x\) given \(p\) is Bin(\(n\), \(p\)). Then \[ f(x|p) = {n \choose p} p^x (1-p)^{n-x} \] and \[ g(p) = \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} p^{\alpha_1 -1} (1-p)^{\alpha_2 - 1}. \] Then \[ f(x|p) g(p) = {n \choose p} \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} \] and this, considered as a function of \(p\) for fixed \(x\) is, except for constants, the PDF of a Beta(\(x + \alpha_1, n - x + \alpha_2\)) distribution. So that is the posterior.

Binomial Data, Beta Prior

Why? \[ \begin{aligned} h (p | x) & = \frac{f( x | p) g(p)}{\int f( x | p) g(p) d p} \\ & \propto f( x | p) g(p) \\ & = {n \choose p} \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} \\ & \propto p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} \end{aligned} \] And there is only one PDF with support \([0,1]\) of that form, i.e. a Beta(\(x + \alpha_1, n - x + \alpha_2\)) distribution. So that is the posterior.

Bayesian Inference

In Bayes rule, constants, meaning anything that doesn’t depend on the parameter, are irrelevant
We can drop multiplicative constants that do not depend on the parameter from \(f( x | \theta)\) obtaining the likelihood \(L(\theta)\)
We can also drop multiplicative constants that do not depend on the parameter from \(g(\theta)\) obtaining the unnormalized prior
Multiplying them together gives the unnormalized posterior

\[ \text{likelihood } × \text{ unnormalized prior } = \text{ unnormalized posterior} \]

Bayesian Inference

In our example we could have multiplied likelihood \[ p^x (1-p)^{n-x} \] times unnormalized prior \[ p^{\alpha_1 -1} (1-p)^{\alpha_2 - 1} \] to get unnormalized posterior \[ p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} \] which, as before, can be recognized as an unnormalized beta PDF.

Bayesian Inference

It is convenient to have a name for the parameters of the prior and posterior. If we call them parameters, then we get confused because they play a different role from the parameters of the distribution of the data.
The parameters of the distribution of the data, \(p\) in our example, the Bayesian treats as random variables. They are the random variables whose distributions are the prior and posterior.
The parameters of the prior, \(\alpha_1\) and \(\alpha_2\) in our example, the Bayesian treats as known constants. They determine the particular prior distribution used for a particular problem. To avoid confusion we call them hyperparameters.

Bayesian Inference

Parameters, meaning the parameters of the distribution of the data and the variables of the prior and posterior, are unknown constants. The Bayesian treats them as random variables because probability theory is the correct description of uncertainty.
Hyperparameters, meaning the parameters of the prior and posterior, are known constants. The Bayesian treats them as non- random variables because there is no uncertainty about their values.
In our example, the hyperparameters of the prior are \(\alpha_1\) and \(\alpha_2\), and the hyperparameters of the posterior are \(x + \alpha_1\) and \(n - x + \alpha_2\).

Example: Normal

Suppose \(X_1 , \dots, X_n\) are i.i.d. \(N(\theta , \sigma^2)\) where \(\sigma^2\) is known. Suppose further we have a prior \(\theta \sim N(\mu, \tau^2)\). Then the posterior can be obtained as follows, \[ \begin{aligned} f (\theta | x) & \propto f(\theta) \prod_{i=1}^n f(x_i | \theta) \\ & \propto \exp \left \{ -\frac{1}{2} \left( \frac{(\theta-\mu)^2}{\tau^2} + \frac{\sum_{i=1}^{n} (x_i - \theta)^2}{\sigma^2} \right) \right\} \\ & \propto \exp \left \{ -\frac{1}{2} \frac{\left( \theta - \displaystyle \frac{\mu / \tau^2 + n\bar{x} / \sigma^2}{1/\tau^2 + n/\sigma^2} \right)^2}{\displaystyle \frac{1}{1/\tau^2 + n/\sigma^2}} \right\}. \end{aligned} \]

Example: Normal

Or \(f(\theta | x) \sim N( \mu_n, \tau_n^2)\) where \[ \mu_n = \left( \frac{\mu}{\tau^2} + \frac{n \bar{x}}{\sigma^2} \right) \tau_n ^2 \quad \mbox{and} \quad \tau_n^2 = \frac{1}{1/\tau^2 + n/\sigma^2} . \] We will call this a conjugate Bayes model. Also note a 95% credible region for \(\theta\) is given by (this is also the HPD, highest posterior density) \[ \left( \mu_n - 1.96 \tau_n, \mu_n + 1.96 \tau_n \right) . \] For large \(n\), the data will overwhelm the prior.

Example: Normal

If \(f(\theta) \propto 1\), an improper prior, then a 95% credible region for \(\theta\) is the same as a 95% confidence interval since \(f(\theta | x) \sim N( \bar{x}, \sigma^2 / n)\) (try to show this at home).
Usually, we specify a prior and likelihood that result in an posterior that is intractable. That is, we can't work with it analytically or even calculate the appropriate normalizing constant \(c\).
However, it is often easy to simulate a Markov chain with \(f(\theta|x)\) as its stationary distribution.

Applied Bayesian Statistics

Bayesian analysis of complex statistical models usually requires Markov chain Monte Carlo (MCMC) methods.
Lots of tools available … Each of you will present a 'software tutorial' for a tool of your choice.
Presentations will be approximately 30 minutes preliminary scheduled for Weeks 6, 8, and 10.
Tutorials can be in R or a package in R, JAGS, MATLAB, OpenBUGS, or Stan.

Software Tutorials

Example tutorial using mcmc and mcmcse R packages will be presented during Week 5 (potentially Week 4). My tutorial will be made using Markdown.
Assume the audience has not used the software before, but is obviously a student in this class.
Example tutorials for statistical software can be found here.