- Course overview
- Class objectives
- First 'presentations' on Thursday!
- Motivating example
- Review from undergraduate Mathematical Statistics
- Software tutorials

- Course overview
- Class objectives
- First 'presentations' on Thursday!
- Motivating example
- Review from undergraduate Mathematical Statistics
- Software tutorials

- Lecture: TR 9:40 - 11:00 Olmsted 1429
- Discussion: R 1:10 - 2:00 Olmsted 1431 (No Discussion Week 1)
- Instructor: James M. Flegal, jflegal@ucr.edu
- Office Hours: R 11:00 - 12:00 Olmsted 1428
- Zoom Hours: M 9:00 - 10:00 and R 11:10 - 12:00
- Internet: iLearn STAT 203

This class is an introduction to Bayesian statistics including "subjective probability, Renyi axiom system, Savage axioms, coherence, Bayes theorem, credibility intervals, Lindley paradox, empirical Bayes estimation, natural conjugate priors, de Finetti's theorem, approximation methods, Bayesian bootstrap, Bayesian computer programs".

The class will be taught in the R language.

There will approximately eight participation / presentation exercises, five homework, and a final exam. Grades will be calculated as follows:

- Participation and weekly presentations: 30%
- Software tutorial: 20%
- Homework: 20%
- Final exam: 30%

Strongly recommended books:

- Ronald Christensen, Wesley O. Johnson, Adam J. Branscum, and Timothy E. Hanson, Bayesian Ideas and Data Analysis: An Introduction for Scientists and Statisticians

Other books that may be helpful:

- Peter M. Lee, Bayesian Statistics: An Introduction

There are many online resources for learning about it and working with it, in addition to the textbooks:

- The official intro, "An Introduction to R", available online in HTML and PDF
- John Verzani, "simpleR", in PDF
- Patrick Burns, The R Inferno. "If you are using R and you think you're in hell, this is a map for you."

You are encouraged to discuss course material, including assignments, with your classmates. All work you turn in, however, must be your own. This includes both writing and code. Copying from other students, from books, or from websites (1) does nothing to help you learn, (2) is easy to detect, and (3) has serious negative consequences.

- I will try to cover Chapters 1-3 and 5-9 from Ronald Christensen, Wesley O. Johnson, Adam J. Branscum, and Timothy E. Hanson, Bayesian Ideas and Data Analysis: An Introduction for Scientists and Statisticians.
- Topics include Fundamental Ideas, Integration Versus Simulation, Comparing Populations, Simulations, Binomial Regression, and Linear Regression.

- Describe Bayesian statistics, where one's inferences about parameters or hypotheses are updated as evidence accumulates. Learn to use Bayes’ rule to transform prior probabilities into posterior probabilities, and introduce the underlying theory and perspective of the Bayesian paradigm.
- Apply Bayesian methods to several practical problems, to show end-to-end Bayesian analyses that move from framing the question to building models to eliciting prior probabilities to implementing in R the final posterior distribution.
- Develop a 'Software Tutorial' …
- Have fun! Class is an elective!

- I will attempt to have a 'weekly' participation or presentation activity.
- This will be based on readings, homework, data analysis, and so on. I anticipate some of these will be group activities.
- Goal is to 'flip' the classroom … "Instructional strategy that reverses the traditional learning environment by delivering instructional content, often online, outside of the classroom. It moves activities, including those that may have traditionally been considered homework, into the classroom."

- Everyone must be prepared to present, but we don't have time for everyone to present every week.
- Use 'The Wheel' implemented at Amazon Web Services.
- We will use the
`sample()`

function in`R`

and weight away from participants recently chosen.

names <- c("VJ", "Lauren", "Cangao", "Yuting", "Ying", "Huiling", "Samantha", "Jinhui", "Song", "Mi") count <- rep(1, length(names)) chosen <- sample(names, 1, prob=1/count) chosen

## [1] "Lauren"

count[chosen==names] <- count[chosen==names] + 1 # +1 subject to change count

## [1] 1 2 1 1 1 1 1 1 1 1

- Assignment - Read Chapter 1 and "Why isn't everyone a Bayesian?" available on iLearn.
- Everyone is expected to contribute to Thursday's discussion, this will be part of your grade.
- Be ready to answer questions such as which example was most compelling? What do you
`like`

or`dislike`

about Bayesian statistics? Are you a Bayesian? What are some lurking complexities? What did you find interesting in reading Chapter 1? What was interesting in Efron's paper? - Prepare one question for the class or instructor.

- Bayesian statistical analysis is based on the premise that all uncertainty should be modeled using probabilities and that all statistical inferences should be based on the laws of probabilities.
- Probability models long embraced for data.
- Given a probability model for the data, Bayesian methods mandate an additional probability model for the unknown parameters in the data model.
- Uncertainty about parameters is modeled using scientific expert information, called 'prior' information acquired 'a priori' to obtaining the data.

- Binomial data example from Section 1.1 of Christensen et al.
- The Par-Aide Corporation in Lino Lakes, Minnesota, makes ball washers for golf courses. St. Paul Brass and Aluminum Foundry makes a part called a “push rod eye,” an integral component of the golf ball washer. Out of 2,430 push rod eyes poured over seven days in May, 2003, only 2,211 actually shipped. It is of interest to estimate the probability of pouring a defective part. The Vice President of Operations thinks that for this particular part a plausible range for the proportion of scrap is 5% to 15%.

- Assume production process determines a proportion of defective parts that we call \(\theta\). Treat the available data as a random sample from the production process, so that the probability of any part being scrapped is \(\theta\) and further we assume that all 2,430 parts being examined are independent. With these assumptions \[ y | \theta \sim \text{Bin}(2430, \theta). \]
- The proportion defective, \(\theta\), is unknown. Use probability to reflect the Vice President's knowledge about it. We interpreted the Vice President as specifying \(Pr(\theta \le 0.05) = 0.025\) and \(Pr(\theta \ge 0.15) = 0.025\). Although these statements do not completely determine a probability distribution for \(\theta\), we restricted our attention to the beta probability distribution for \(\theta\). Then we have \[ \theta \sim \text{Beta}(12.05, 116.06). \]
- This is the
**prior distribution**.

- A Bayesian analysis uses
**Bayes Theorem**to combine the data with the prior distribution to update the Vice President's probability distribution about \(\theta\). This new probability distribution, called the

**posterior distribution**, describes knowledge about \(\theta\) and is the fundamental tool in Bayesian statistical analysis. Typically, we use computer simulations to approximate the posterior distribution. Occasionally, we can find it mathematically such as in this example.- With \(y = 219\), the estimated median of the posterior distribution is 0.09.
A 95% equal-tailed probability interval for \(\theta\) is then (0.08,0.10).

- Suppose you are uncertain about something, which is described by a probability distribution called your prior distribution
- Suppose you obtain some data relevant to that thing
The data changes your uncertainty, which is then described by a new probability distribution called your posterior distribution

- Posterior distribution reflects the information both in the prior distribution and the data
Most of Bayesian inference is about how to go from prior to posterior

- Bayesians go from prior to posterior is to use the laws of conditional probability, sometimes called in this context Bayes rule or Bayes theorem
- Suppose we have a PDF \(g\) for the prior distribution of the parameter \(\theta\), and suppose we obtain data \(x\) whose conditional PDF given \(\theta\) is \(f\)
- Then the joint distribution of data and parameters is conditional times marginal \[ f( x | \theta) g(\theta) \]
- May look strange because most of your training on considers the frequentist paradigm
- Here both \(x\) and \(\theta\) are random variables

- The correct posterior distribution, according to the Bayesian paradigm, is the conditional distribution of \(\theta\) given \(x\), which is joint divided by marginal \[ h (\theta | x) = \frac{f( x | \theta) g(\theta)}{\int f( x | \theta) g(\theta) d \theta} \]
- Often we do not need to do the integral if we recognize that \[ \theta \mapsto f( x | \theta) g(\theta) \] is, except for constants, the PDF of a brand name distribution, then that distribution must be the posterior

Suppose the prior distribution for \(p\) is Beta(\(\alpha_1, \alpha_2\)) and the conditional distribution of \(x\) given \(p\) is Bin(\(n\), \(p\)). Then \[ f(x|p) = {n \choose p} p^x (1-p)^{n-x} \] and \[ g(p) = \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} p^{\alpha_1 -1} (1-p)^{\alpha_2 - 1}. \] Then \[ f(x|p) g(p) = {n \choose p} \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} \] and this, considered as a function of \(p\) for fixed \(x\) is, except for constants, the PDF of a Beta(\(x + \alpha_1, n - x + \alpha_2\)) distribution. So that is the posterior.

Why? \[ \begin{aligned} h (p | x) & = \frac{f( x | p) g(p)}{\int f( x | p) g(p) d p} \\ & \propto f( x | p) g(p) \\ & = {n \choose p} \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} \\ & \propto p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} \end{aligned} \] And there is only one PDF with support \([0,1]\) of that form, i.e. a Beta(\(x + \alpha_1, n - x + \alpha_2\)) distribution. So that is the posterior.

- In Bayes rule,
**constants**, meaning anything that doesn’t depend on the parameter, are irrelevant - We can drop multiplicative constants that do not depend on the parameter from \(f( x | \theta)\) obtaining the likelihood \(L(\theta)\)
- We can also drop multiplicative constants that do not depend on the parameter from \(g(\theta)\) obtaining the unnormalized prior
- Multiplying them together gives the unnormalized posterior

\[ \text{likelihood } × \text{ unnormalized prior } = \text{ unnormalized posterior} \]

In our example we could have multiplied likelihood \[ p^x (1-p)^{n-x} \] times unnormalized prior \[ p^{\alpha_1 -1} (1-p)^{\alpha_2 - 1} \] to get unnormalized posterior \[ p^{x + \alpha_1 -1} (1-p)^{n - x + \alpha_2 - 1} \] which, as before, can be recognized as an unnormalized beta PDF.

- It is convenient to have a name for the parameters of the prior and posterior. If we call them parameters, then we get confused because they play a different role from the parameters of the distribution of the data.
- The parameters of the distribution of the data, \(p\) in our example, the Bayesian treats as random variables. They are the random variables whose distributions are the prior and posterior.
- The parameters of the prior, \(\alpha_1\) and \(\alpha_2\) in our example, the Bayesian treats as known constants. They determine the particular prior distribution used for a particular problem. To avoid confusion we call them
**hyperparameters**.

- Parameters, meaning the parameters of the distribution of the data and the variables of the prior and posterior, are unknown constants. The Bayesian treats them as random variables because probability theory is the correct description of uncertainty.
- Hyperparameters, meaning the parameters of the prior and posterior, are known constants. The Bayesian treats them as non- random variables because there is no uncertainty about their values.
- In our example, the hyperparameters of the prior are \(\alpha_1\) and \(\alpha_2\), and the hyperparameters of the posterior are \(x + \alpha_1\) and \(n - x + \alpha_2\).

Suppose \(X_1 , \dots, X_n\) are i.i.d. \(N(\theta , \sigma^2)\) where \(\sigma^2\) is known. Suppose further we have a prior \(\theta \sim N(\mu, \tau^2)\). Then the posterior can be obtained as follows, \[ \begin{aligned} f (\theta | x) & \propto f(\theta) \prod_{i=1}^n f(x_i | \theta) \\ & \propto \exp \left \{ -\frac{1}{2} \left( \frac{(\theta-\mu)^2}{\tau^2} + \frac{\sum_{i=1}^{n} (x_i - \theta)^2}{\sigma^2} \right) \right\} \\ & \propto \exp \left \{ -\frac{1}{2} \frac{\left( \theta - \displaystyle \frac{\mu / \tau^2 + n\bar{x} / \sigma^2}{1/\tau^2 + n/\sigma^2} \right)^2}{\displaystyle \frac{1}{1/\tau^2 + n/\sigma^2}} \right\}. \end{aligned} \]

Or \(f(\theta | x) \sim N( \mu_n, \tau_n^2)\) where \[
\mu_n = \left( \frac{\mu}{\tau^2} + \frac{n \bar{x}}{\sigma^2} \right) \tau_n ^2 \quad \mbox{and} \quad \tau_n^2 = \frac{1}{1/\tau^2 + n/\sigma^2} .
\] We will call this a **conjugate** Bayes model. Also note a 95% credible region for \(\theta\) is given by (this is also the HPD, highest posterior density) \[
\left( \mu_n - 1.96 \tau_n, \mu_n + 1.96 \tau_n \right) .
\] For large \(n\), the data will overwhelm the prior.

- If \(f(\theta) \propto 1\), an improper prior, then a 95% credible region for \(\theta\) is the same as a 95% confidence interval since \(f(\theta | x) \sim N( \bar{x}, \sigma^2 / n)\) (try to show this at home).
- Usually, we specify a prior and likelihood that result in an posterior that is intractable. That is, we can't work with it analytically or even calculate the appropriate normalizing constant \(c\).
- However, it is often easy to
**simulate a Markov chain**with \(f(\theta|x)\) as its stationary distribution.

- Bayesian analysis of complex statistical models usually requires Markov chain Monte Carlo (MCMC) methods.
- Lots of tools available … Each of you will present a 'software tutorial' for a tool of your choice.
- Presentations will be approximately 30 minutes preliminary scheduled for Weeks 6, 8, and 10.
- Tutorials can be in
`R`

or a package in`R`

, JAGS, MATLAB, OpenBUGS, or Stan.

- Example tutorial using
`mcmc`

and`mcmcse`

`R`

packages will be presented during Week 5 (potentially Week 4). My tutorial will be made using Markdown. - Assume the audience has not used the software before, but is obviously a student in this class.
- Example tutorials for statistical software can be found here.