--- title: 'Density Estimation' author: "James M. Flegal" output: ioslides_presentation: smaller: yes --- ## Agenda - Histograms - Glivenko-Cantelli theorem - Error for density estimates - Kernel density estimates - Bivariate density estimates ## Histograms - Histograms are one of the first things learned in "Introduction to Statistics" - Simple way of estimating a distribution + Split the sample space up into bins + Count how many samples fall into each bin - If we hold the bins fixed and take more and more data, then the relative frequency for each bin will converge on the bin’s probability ## Example: Old Faithful Geyser Data ```{r} data(faithful); x0 <- 0; x1 <- 8; h <- .5 my.breaks<-seq(from=x0, to=x1, by=h) myhist<-hist(faithful$eruptions, breaks=my.breaks, right=F, plot=F) plot(myhist$mids, myhist$density, type="s", xlab="Eruption length", ylab="Frequency", main="Histogram of Eruption Lengths") ``` ## Histograms - What about a density function? - Could take our histogram estimate and say that the probability density is uniform within each bin - Gives us a piecewise-constant estimate of the density - ***Problem***: Will not converge on the true density -- unless we shrink the bins as we get more and more data - Bias-variance trade-off + A large number of very small bins, the minimum bias in our estimate of any density becomes small + But the variance grows for very small bins ## Example: Old Faithful Geyser Data ```{r, echo=FALSE} par(mfrow = c(1, 2)) myhist<-hist(faithful$eruptions, breaks=my.breaks, right=F, plot=F) plot(myhist$mids, myhist$density, type="s", xlab="Eruption length", ylab="Frequency", main="Large Bin Width") h <- .1 my.breaks<-seq(from=x0, to=x1, by=h) myhist<-hist(faithful$eruptions, breaks=my.breaks, right=F, plot=F) plot(myhist$mids, myhist$density, type="s", xlab="Eruption length", ylab="Frequency", main="Small Bin Width") par(mfrow = c(1, 1)) ``` ## Histograms Bin width primarily controls the amount of smoothing, lots of guidance available 1. **Sturges' rule**: Optimal width of class intervals is given by \[ \frac{R}{1+\log_2 n} \] where $R$ is the sample range -- Designed for data sampled from symmetric, unimodal populations 2. **Scott's Normal reference rule**: Specifies a bin width \[ 3.49\hat\sigma n^{-1/3} \] where $\hat \sigma$ is an estimate of the population standard deviation $\sigma$ 3. **Freedman-Diaconis rule**: Specifies the bin width to be \[ 2(IQR)n^{-1/3} \] where the $IQR$ is the sample inter-quartile range ## Histograms - Is learning the whole distribution non-parametrically even feasible? - How can we measure error to deal with the bias-variance trade-off? ## Empirical CDF - Learning the whole distribution is ***feasible*** - Something even dumber than shrinking histograms will work - Suppose we have one-dimensional samples $x_1, \dots, x_n$ with CDF $F$ - Define the empirical cumulative distribution function on $n$ samples as \[ \hat{F}_n(a) = \frac{1}{n} \sum_{i=1}^{n} I(-\infty < x_i \le a) \] - Just the fraction of the samples which are less than or equal to $a$ ## Glivenko-Cantelli theorem - Then the ***Glivenko-Cantelli theorem*** says \[ \max_{a} | \hat{F}_n(a) - F(a) | \to 0 \] - So the empirical CDF converges to the true CDF ***everywhere***, i.e. the maximum gap between the two of them goes to zero - Pitman (1979) calls this the “fundamental theorem of statistics” - Can learn distributions just by collecting enough data ## Glivenko-Cantelli theorem - Can we use the empirical CDF to estimate a density? - Yes, but it's discrete and doesn't estimate a density well - Usually we can expect to find some new samples between our old ones - So we want a non-zero density between our observations - Uniform distribution within each bin of a histogram doesn’t have this issue - Can we do better? ## Error for density estimates - Yes, but what do we mean by “better” density estimates? - Three ideas: 1. Squared deviation from the true density should be small \[ \int \left( f(x) - \hat{f}(x) \right)^2 dx \] 2. Absolute deviation from the true density should be small \[ \int \left| f(x) - \hat{f}(x) \right| dx \] 3. Average log-likelihood ratio should be kept low \[ \int f(x) \log \frac{f(x)}{\hat{f}(x)} dx \] ## Error for density estimates - Squared deviation is similar to MSE criterion used in regression + Used most frequently since it's mathematically tractable - Absolute deviation considers $L_1$ or total variation distance between the true and the estimated density + Nice property that $\frac{1}{2} \int \left| f(x) - \hat{f}(x) \right| dx$ is exactly the maximum error in our estimate of the probability of any set + But it’s tricky to work with, so we’ll skip it - Minimizing the log-likelihood ratio is intimately connected both to maximizing the likelihood and to minimizing entropy + Called Kullback-Leibler divergence or relative entropy ## Error for density estimates - Notice that \[ \int \left( f(x) - \hat{f}(x) \right)^2 dx = \int f^2(x) dx - 2 \int f(x) \hat{f}(x) dx + \int \hat{f}^2(x) dx \] - First term doesn’t depend on the estimate, so we can ignore it for purposes of optimization - Third term only involves $\hat{f}(x)$, and is just an integral, which we can do numerically - Second term involves both the true and the estimated density; we can approximate it using Monte Carlo by \[ \frac{2}{n} \sum_{i=1}^{n} \hat{f}(x_i) \] ## Error for density estimates - Then our error measure is \[ \frac{2}{n} \sum_{i=1}^{n} \hat{f}(x_i) + \int \hat{f}^2(x) dx \] - In fact, this error measure does not depend on having one-dimension data - For purposes of cross-validation, we can estimate $\hat{f}(x)$ on the training set and then restrict the sum to points in the testing set ## Naive estimator - If a random variable $X$ has probability density $f$, then \[ f(x)=\lim_{h\rightarrow 0} \frac{1}{2h}P(x-h < X < x+h) \] - Thus, a naive estimator would be \[ \widehat f(x) = \frac{1}{2nh} \left[ \# \textrm{ of } x_i \textrm{ falling in } (x-h, x+h) \right] \] ## Naive estimator - Or, equivalently \[ \widehat f(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}w\left( \frac{x-x_i}{h} \right) \] where $w$ is a weight function defined as \[ w(x)=\begin{cases} 1/2 \quad |x| <1 \\ 0 \quad \text{ otherwise} \end{cases} \] - In short, a naive estimate is constructed by placing a box of width $2h$ and height $\frac{1}{2nh}$ on each observation, then summing to obtain the estimate ## Example: Old Faithful Geyser Data ```{r} my.w<-function(x){ if (abs(x) < 1) w <- 1/2 else w <- 0 w } x <- seq(0,6,0.2) m <- length(x) n <- length(faithful$eruptions) h <- .5 fhat <- rep(0,m) for (i in 1:m){ S <- 0 for (j in 1:n){ S <- S+(1/h)*my.w((faithful$eruptions[j]-x[i])/h) } fhat[i] <- (1/n)*S } ``` ## Example: Old Faithful Geyser Data ```{r} plot(x,fhat, type="s", xlab="Eruption length", ylab="Density Estimate", main="Naive Density Estimator") ``` ## Naive estimator - Not wholly satisfactory, from the point of view of using density estimates for presentation - Estimate $\widehat f$ is a step function - In the formula for the naive estimate, we can replace the weight function $w$ by another function $K$ with more desirable properties - Function $K$ is called a **kernel** ## Kernel density estimates - Resulting estimate is a **kernel estimator**: $$ \hat{f}(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}K\left( \frac{x-x_i}{h} \right). $$ - $h$ is the **window width**, **smoothing parameter**, or **bandwidth** - Usually the kernel $K$ is taken to be a probability density function itself (i.e., normal density) - Resulting estimate will inherit all the smoothness properties of $K$ ## Kernel density estimates Most popular choices for the kernel $K$ are Family | Kernel :------:|:------: Gaussian | $K(t)=\frac{1}{\sqrt{2\pi}}e^{-t^2/2}$ Rectangular | $K(t) = 1/2$ for $|t|<1$ Triangular | $K(t) = 1-|t|$ for $|t|<1$ Epanechnikov | $K(t)=\frac{3}{4}(1-(1/5) t^2)$ for $|t|<\sqrt{5}$ ## Kernel density estimates ```{r} my.w<-function(x, type="gaussian"){ if(type=="gaussian"){ w <- dnorm(x) return(w) } if(type=="naive"){ if (abs(x) < 1) w <- 1/2 else w <- 0 return(w) } print("You have asked for an undefined kernel.") return(NULL) } ``` ## Example: Old Faithful Geyser Data ```{r} x <- seq(0,6,0.02) m <- length(x) n <- length(faithful$eruptions) h <- .1 fhat <- rep(0,m) for (i in 1:m){ S <- 0 for (j in 1:n){ S <- S+(1/h)*my.w((faithful$eruptions[j]-x[i])/h) } fhat[i] <- (1/n)*S } ``` ## Example: Old Faithful Geyser Data ```{r} plot(x,fhat, type="l", xlab="Eruption length", ylab="Density Estimate", main="Naive Density Estimator") ``` ## Bandwidth selection - Cross-validation, which could be time consuming - Optimal bandwidth for a Gaussian kernel to estimate a Gaussian distribution is $1.06\sigma / n^{1/5}$ - Called the **Gaussian reference rule** or the **rule-of-thumb** bandwidth - When you call `density` in R, this is basically what it does ## Kernel density estimate samples - There are times when one wants to draw a random sample from the estimated distribution - Easy with kernel density estimates, because each kernel is itself a probability density - Suppose the kernel is Gaussian, that we have scalar observations $x_1, \dots, x_n$ and bandwidth $h$ 1. Pick an integer uniformly at random from $1$ to $n$ 2. Use `rnorm(1,x[i],h)`, or `rnorm(q,sample(x,q,replace=TRUE),h)` for $q$ draws - Using a different kernel, we’d just need to use the random number generator function for the corresponding distribution ## Other Approaches - Histograms and kernels are not the only possible way of estimating densities - Can try the local polynomial trick, series expansions, splines, penalized likelihood approaches, etc - For some of these, avoid negative probability density estimates using the log density ## Density estimation in R - `density()` function is the most common ```{r, eval=FALSE} density(x, ...) ## Default S3 method: density(x, bw = "nrd0", adjust = 1, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), weights = NULL, window = kernel, width, give.Rkern = FALSE, n = 512, from, to, cut = 3, na.rm = FALSE, ...) ``` - `ASH` and `KernSmooth` are both fast, accurate, and well-maintained (Deng and Wickham, 2011) ## Bivariate density estimation - To construct a bivariate density histogram, it is necessary to define two-dimensional bins and count the number of observations in each bin - Can use `bin2d` function in R will bin a bivariate data set ```{r} bin2d <- function(x, breaks1 = "Sturges", breaks2 = "Sturges"){ histg1 <- hist(x[,1], breaks = breaks1, plot = FALSE) histg2 <- hist(x[,2], breaks = breaks2, plot = FALSE) brx <- histg1$breaks bry <- histg2$breaks freq <- table(cut(x[,1], brx), cut(x[,2], bry)) return(list(call = match.call(), freq = freq, breaks1 = brx, breaks2 = bry, mids1 = histg1$mids, mids2 = histg2$mids)) } ``` ## Bivariate density estimation - Following example computes the bivariate frequency table - After binning the data, the `persp` function plots the density histogram ```{r} data(iris) fit1=bin2d(iris[1:50, 1:2]) persp(x=fit1$mids1, y=fit1$mids2, z=fit1$freq, shade=T, theta=45, phi=30, ltheta=60) ``` ## Bivariate kernel methods - Suppose the data is $X_1, \dots, X_n$, where each $X_i\in \mathbb{R}^2$ - Kernel density estimates can be extended to a multivariate (bivariate) setting - Let $K(\cdot)$ be a bivariate kernel (typically a bivariate density function), then bivariate kernel density estimate is \[ \hat f(X) = \frac{1}{nh^d}\sum_{i=1}^n K\left(\frac{X-X_i}{h}\right) \] ## Example: Bivariate normal - Estimate the bivariate density when the data is generated from a mixture model with three components with identical covariance $\Sigma=I_2$ and different means \[ \mu_1=(0,0)\quad\mu_2=(1,3),\quad\mu_3=(4,-1). \] - Mixture probabilities are $p=(0.2, 0.3, 0.5)$ ## Example: Bivariate normal ```{r} library(MASS) n <- 2000 p <- c(.2, .3, .5) mu <- matrix(c(0, 1, 4, 0, 3, -1), 3, 2) Sigma <- diag(2) i <- sample(1:3, replace = TRUE, prob = p, size = n) k <- table(i) x1 <- mvrnorm(k[1], mu = mu[1,], Sigma) x2 <- mvrnorm(k[2], mu = mu[2,], Sigma) x3 <- mvrnorm(k[3], mu = mu[3,], Sigma) X <- rbind(x1, x2, x3) #the mixture data x <- X[,1] y <- X[,2] fhat <- kde2d(x, y, h=c(1.87, 1.84)) ``` ## Example: Bivariate normal ```{r} par(mfrow = c(1, 2)) contour(fhat) persp(fhat, phi = 30, theta = 20, d = 5, xlab = "x") par(mfrow = c(1, 1)) ``` ## Summary - Over 20 packages that perform density estimation (Deng and Wickham, 2011) - Kernel density estimation is the most common approach - Density estimation can be parametric, where the data is from a known family - Bayesian approaches are also available ## Discussion - How to use cross validation to choose the bandwidth?