--- title: 'Bootstrap' author: "James M. Flegal" output: ioslides_presentation: smaller: yes --- ## Agenda - Toy collector solution - Plug-In and the Bootstrap - Nonparametric and Parametric Bootstraps - Examples ## Exercise: Toy Collector Children (and some adults) are frequently enticed to buy breakfast cereal in an effort to collect all the action figures. Assume there are 15 action figures and each cereal box contains exactly one with each figure being equally likely. - Find the expected number of boxes needed to collect all 15 action figures. - Find the standard deviation of the number of boxes needed to collect all 15 action figures. - Now suppose we no longer have equal probabilities, instead let Figure | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O --- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - Probability | .2 | .1| .1| .1| .1| .1| .05| .05| .05| .05| .02| .02| .02| .02| .02 - Estimate the expected number of boxes needed to collect all 15 action figures. - What is the uncertainty of your estimate? - What is the probability you bought more than 300 boxes? 500 boxes? 800 boxes? ## Exercise: Toy Collector - Consider the probability of the "new toy" given we already have $i$ toys - Then \[ P(\text{New Toy} | i) = \frac{15-i}{15} \] - Then since each box is independent, our waiting time until a "new toy" is a geometric random variable - The mean is \[ \frac{15}{15} + \frac{15}{14} + \dots + \frac{15}{1} \approx 49.77 \] - The variance is \[ \frac{15(1-15/15)}{15} + \frac{15(1-14/15)}{14} + \dots + \frac{15(1-1/15)}{1} \approx 34.77 \] with standard deviation 5.90 ## Exercise: Toy Collector ```{r} prob.table <- c(.2, .1, .1, .1, .1, .1, .05, .05, .05, .05, .02, .02, .02, .02, .02) boxes <- seq(1,15) box.count <- function(prob=prob.table){ check <- double(length(prob)) i <- 0 while(sum(check) 44.29))/1000 ``` - Since our significance level is 5%, we reject $H_0$ and conclude that Newcomb’s measurements were not consistent with the currently accepted figure ## Example: Sleep study - The two sample $t$-test checks for differences in means according to a known null distribution - Similar to permutation tests - Let's resample and generate the sampling distribution under the bootstrap assumption ```{r} bootstrap.resample <- function (object) sample (object, length(object), replace=TRUE) diff.in.means <- function(df) { mean(df[df$group==1,"extra"]) - mean(df[df$group==2,"extra"]) } resample.diffs <- replicate(2000, diff.in.means(sleep[bootstrap.resample(1:nrow(sleep)),])) ``` ## Example: Sleep study ```{r} hist(resample.diffs, main="Bootstrap Sampling Distribution") abline(v=diff.in.means(sleep), col=2, lwd=3) ``` ## Bootstrapping functions - R has numerous built in bootstrapping functions, too many to mention, see `boot` library - Example of the function `boot` - Bootstrap of the **ratio of means** using the `city` data included in the `boot` package ```{r} library(boot) data(city) ratio <- function(d, w) sum(d$x * w)/sum(d$u * w) results <- boot(city, ratio, R=1000, stype="w") ``` ## Bootstrapping functions ```{r} results ``` ## Bootstrapping functions ```{r} boot.ci(results, type="bca") ``` ## Bootstrapping a single statistic - Can use the bootstrap to generate a 95% confidence interval for R-squared - Linear regression of miles per gallon (`mpg`) on car weight (`wt`) and displacement (`disp`) - Data source is `mtcars` - The bootstrapped confidence interval is based on 1000 replications ```{r} rsq <- function(formula, data, indices) { d <- data[indices,] fit <- lm(formula, data=d) return(summary(fit)$r.square) } results <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt+disp) ``` ## Bootstrapping a single statistic ```{r} results ``` ## Bootstrapping a single statistic ```{r} boot.ci(results, type="bca") ``` ## Bootstrapping a single statistic ```{r} plot(results) ``` ## Summary - Bootstrapping provides a nonparametric approach to statistical inference when distributional assumptions may not be met - Enables calculation of standard errors and confidence intervals in a variety of situations, e.g. medians, correlation coefficients, regression parameters, ... - Hypothesis tests are a little more challenging - The bootstrap is large sample, approximate, and asymptotic! - Works when the empirical distribution $\hat{F}_n$ is close to the true unknown distribution $F$ - Usually the case when the sample size $n$ is large and not otherwise, no method can save bad data!