---
title: 'Density Estimation'
author: "James M. Flegal"
output:
ioslides_presentation:
smaller: yes
---
## Agenda
- Histograms
- Glivenko-Cantelli theorem
- Error for density estimates
- Kernel density estimates
- Bivariate density estimates
## Histograms
- Histograms are one of the first things learned in "Introduction to Statistics"
- Simple way of estimating a distribution
+ Split the sample space up into bins
+ Count how many samples fall into each bin
- If we hold the bins fixed and take more and more data, then the relative frequency for each bin will converge on the bin’s probability
## Example: Old Faithful Geyser Data
```{r}
data(faithful); x0 <- 0; x1 <- 8; h <- .5
my.breaks<-seq(from=x0, to=x1, by=h)
myhist<-hist(faithful$eruptions, breaks=my.breaks, right=F, plot=F)
plot(myhist$mids, myhist$density, type="s", xlab="Eruption length",
ylab="Frequency", main="Histogram of Eruption Lengths")
```
## Histograms
- What about a density function?
- Could take our histogram estimate and say that the probability density is uniform within each bin
- Gives us a piecewise-constant estimate of the density
- ***Problem***: Will not converge on the true density -- unless we shrink the bins as we get more and more data
- Bias-variance trade-off
+ A large number of very small bins, the minimum bias in our estimate of any density becomes small
+ But the variance grows for very small bins
## Example: Old Faithful Geyser Data
```{r, echo=FALSE}
par(mfrow = c(1, 2))
myhist<-hist(faithful$eruptions, breaks=my.breaks, right=F, plot=F)
plot(myhist$mids, myhist$density, type="s", xlab="Eruption length",
ylab="Frequency", main="Large Bin Width")
h <- .1
my.breaks<-seq(from=x0, to=x1, by=h)
myhist<-hist(faithful$eruptions, breaks=my.breaks, right=F, plot=F)
plot(myhist$mids, myhist$density, type="s", xlab="Eruption length",
ylab="Frequency", main="Small Bin Width")
par(mfrow = c(1, 1))
```
## Histograms
Bin width primarily controls the amount of smoothing, lots of guidance available
1. **Sturges' rule**: Optimal width of class intervals is given by
\[
\frac{R}{1+\log_2 n}
\]
where $R$ is the sample range -- Designed for data sampled from symmetric, unimodal populations
2. **Scott's Normal reference rule**: Specifies a bin width
\[
3.49\hat\sigma n^{-1/3}
\]
where $\hat \sigma$ is an estimate of the population standard deviation $\sigma$
3. **Freedman-Diaconis rule**: Specifies the bin width to be
\[
2(IQR)n^{-1/3}
\]
where the $IQR$ is the sample inter-quartile range
## Histograms
- Is learning the whole distribution non-parametrically even feasible?
- How can we measure error to deal with the bias-variance trade-off?
## Empirical CDF
- Learning the whole distribution is ***feasible***
- Something even dumber than shrinking histograms will work
- Suppose we have one-dimensional samples $x_1, \dots, x_n$ with CDF $F$
- Define the empirical cumulative distribution function on $n$ samples as
\[
\hat{F}_n(a) = \frac{1}{n} \sum_{i=1}^{n} I(-\infty < x_i \le a)
\]
- Just the fraction of the samples which are less than or equal to $a$
## Glivenko-Cantelli theorem
- Then the ***Glivenko-Cantelli theorem*** says
\[
\max_{a} | \hat{F}_n(a) - F(a) | \to 0
\]
- So the empirical CDF converges to the true CDF ***everywhere***, i.e. the maximum gap between the two of them goes to zero
- Pitman (1979) calls this the “fundamental theorem of statistics”
- Can learn distributions just by collecting enough data
## Glivenko-Cantelli theorem
- Can we use the empirical CDF to estimate a density?
- Yes, but it's discrete and doesn't estimate a density well
- Usually we can expect to find some new samples between our old ones
- So we want a non-zero density between our observations
- Uniform distribution within each bin of a histogram doesn’t have this issue
- Can we do better?
## Error for density estimates
- Yes, but what do we mean by “better” density estimates?
- Three ideas:
1. Squared deviation from the true density should be small
\[
\int \left( f(x) - \hat{f}(x) \right)^2 dx
\]
2. Absolute deviation from the true density should be small
\[
\int \left| f(x) - \hat{f}(x) \right| dx
\]
3. Average log-likelihood ratio should be kept low
\[
\int f(x) \log \frac{f(x)}{\hat{f}(x)} dx
\]
## Error for density estimates
- Squared deviation is similar to MSE criterion used in regression
+ Used most frequently since it's mathematically tractable
- Absolute deviation considers $L_1$ or total variation distance between the true and the estimated density
+ Nice property that $\frac{1}{2} \int \left| f(x) - \hat{f}(x) \right| dx$ is exactly the maximum error in our estimate of the probability of any set
+ But it’s tricky to work with, so we’ll skip it
- Minimizing the log-likelihood ratio is intimately connected both to maximizing the likelihood and to minimizing entropy
+ Called Kullback-Leibler divergence or relative entropy
## Error for density estimates
- Notice that
\[
\int \left( f(x) - \hat{f}(x) \right)^2 dx = \int f^2(x) dx - 2 \int f(x) \hat{f}(x) dx + \int \hat{f}^2(x) dx
\]
- First term doesn’t depend on the estimate, so we can ignore it for purposes of optimization
- Third term only involves $\hat{f}(x)$, and is just an integral, which we can do numerically
- Second term involves both the true and the estimated density; we can approximate it using Monte Carlo by
\[
\frac{2}{n} \sum_{i=1}^{n} \hat{f}(x_i)
\]
## Error for density estimates
- Then our error measure is
\[
\frac{2}{n} \sum_{i=1}^{n} \hat{f}(x_i) + \int \hat{f}^2(x) dx
\]
- In fact, this error measure does not depend on having one-dimension data
- For purposes of cross-validation, we can estimate $\hat{f}(x)$ on the training set and then restrict the sum to points in the testing set
## Naive estimator
- If a random variable $X$ has probability density $f$, then
\[
f(x)=\lim_{h\rightarrow 0} \frac{1}{2h}P(x-h < X < x+h)
\]
- Thus, a naive estimator would be
\[
\widehat f(x) = \frac{1}{2nh} \left[ \# \textrm{ of } x_i \textrm{ falling in } (x-h, x+h) \right]
\]
## Naive estimator
- Or, equivalently
\[
\widehat f(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}w\left( \frac{x-x_i}{h} \right)
\]
where $w$ is a weight function defined as
\[
w(x)=\begin{cases} 1/2 \quad |x| <1 \\ 0 \quad \text{ otherwise} \end{cases}
\]
- In short, a naive estimate is constructed by placing a box of width $2h$ and height $\frac{1}{2nh}$ on each observation, then summing to obtain the estimate
## Example: Old Faithful Geyser Data
```{r}
my.w<-function(x){
if (abs(x) < 1)
w <- 1/2
else
w <- 0
w
}
x <- seq(0,6,0.2)
m <- length(x)
n <- length(faithful$eruptions)
h <- .5
fhat <- rep(0,m)
for (i in 1:m){
S <- 0
for (j in 1:n){
S <- S+(1/h)*my.w((faithful$eruptions[j]-x[i])/h)
}
fhat[i] <- (1/n)*S
}
```
## Example: Old Faithful Geyser Data
```{r}
plot(x,fhat, type="s", xlab="Eruption length", ylab="Density Estimate",
main="Naive Density Estimator")
```
## Naive estimator
- Not wholly satisfactory, from the point of view of using density estimates for presentation
- Estimate $\widehat f$ is a step function
- In the formula for the naive estimate, we can replace the weight function $w$ by another function $K$ with more desirable properties
- Function $K$ is called a **kernel**
## Kernel density estimates
- Resulting estimate is a **kernel estimator**:
$$
\hat{f}(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}K\left( \frac{x-x_i}{h} \right).
$$
- $h$ is the **window width**, **smoothing parameter**, or **bandwidth**
- Usually the kernel $K$ is taken to be a probability density function itself (i.e., normal density)
- Resulting estimate will inherit all the smoothness properties of $K$
## Kernel density estimates
Most popular choices for the kernel $K$ are
Family | Kernel
:------:|:------:
Gaussian | $K(t)=\frac{1}{\sqrt{2\pi}}e^{-t^2/2}$
Rectangular | $K(t) = 1/2$ for $|t|<1$
Triangular | $K(t) = 1-|t|$ for $|t|<1$
Epanechnikov | $K(t)=\frac{3}{4}(1-(1/5) t^2)$ for $|t|<\sqrt{5}$
## Kernel density estimates
```{r}
my.w<-function(x, type="gaussian"){
if(type=="gaussian"){
w <- dnorm(x)
return(w)
}
if(type=="naive"){
if (abs(x) < 1)
w <- 1/2
else
w <- 0
return(w)
}
print("You have asked for an undefined kernel.")
return(NULL)
}
```
## Example: Old Faithful Geyser Data
```{r}
x <- seq(0,6,0.02)
m <- length(x)
n <- length(faithful$eruptions)
h <- .1
fhat <- rep(0,m)
for (i in 1:m){
S <- 0
for (j in 1:n){
S <- S+(1/h)*my.w((faithful$eruptions[j]-x[i])/h)
}
fhat[i] <- (1/n)*S
}
```
## Example: Old Faithful Geyser Data
```{r}
plot(x,fhat, type="l", xlab="Eruption length", ylab="Density Estimate",
main="Naive Density Estimator")
```
## Bandwidth selection
- Cross-validation, which could be time consuming
- Optimal bandwidth for a Gaussian kernel to estimate a Gaussian distribution is $1.06\sigma / n^{1/5}$
- Called the **Gaussian reference rule** or the **rule-of-thumb** bandwidth
- When you call `density` in R, this is basically what it does
## Kernel density estimate samples
- There are times when one wants to draw a random sample from the estimated distribution
- Easy with kernel density estimates, because each kernel is itself a probability density
- Suppose the kernel is Gaussian, that we have scalar observations $x_1, \dots, x_n$ and bandwidth $h$
1. Pick an integer uniformly at random from $1$ to $n$
2. Use `rnorm(1,x[i],h)`, or `rnorm(q,sample(x,q,replace=TRUE),h)` for $q$ draws
- Using a different kernel, we’d just need to use the random number generator function for the corresponding distribution
## Other Approaches
- Histograms and kernels are not the only possible way of estimating densities
- Can try the local polynomial trick, series expansions, splines, penalized likelihood approaches, etc
- For some of these, avoid negative probability density estimates using the log density
## Density estimation in R
- `density()` function is the most common
```{r, eval=FALSE}
density(x, ...)
## Default S3 method:
density(x, bw = "nrd0", adjust = 1,
kernel = c("gaussian", "epanechnikov", "rectangular",
"triangular", "biweight",
"cosine", "optcosine"),
weights = NULL, window = kernel, width,
give.Rkern = FALSE,
n = 512, from, to, cut = 3, na.rm = FALSE, ...)
```
- `ASH` and `KernSmooth` are both fast, accurate, and well-maintained (Deng and Wickham, 2011)
## Bivariate density estimation
- To construct a bivariate density histogram, it is necessary to define two-dimensional bins and count the number of observations in each bin
- Can use `bin2d` function in R will bin a bivariate data set
```{r}
bin2d <- function(x, breaks1 = "Sturges", breaks2 = "Sturges"){
histg1 <- hist(x[,1], breaks = breaks1, plot = FALSE)
histg2 <- hist(x[,2], breaks = breaks2, plot = FALSE)
brx <- histg1$breaks
bry <- histg2$breaks
freq <- table(cut(x[,1], brx), cut(x[,2], bry))
return(list(call = match.call(), freq = freq,
breaks1 = brx, breaks2 = bry,
mids1 = histg1$mids, mids2 = histg2$mids))
}
```
## Bivariate density estimation
- Following example computes the bivariate frequency table
- After binning the data, the `persp` function plots the density histogram
```{r}
data(iris)
fit1=bin2d(iris[1:50, 1:2])
persp(x=fit1$mids1, y=fit1$mids2, z=fit1$freq, shade=T, theta=45, phi=30, ltheta=60)
```
## Bivariate kernel methods
- Suppose the data is $X_1, \dots, X_n$, where each $X_i\in \mathbb{R}^2$
- Kernel density estimates can be extended to a multivariate (bivariate) setting
- Let $K(\cdot)$ be a bivariate kernel (typically a bivariate density function), then bivariate kernel density estimate is
\[
\hat f(X) = \frac{1}{nh^d}\sum_{i=1}^n K\left(\frac{X-X_i}{h}\right)
\]
## Example: Bivariate normal
- Estimate the bivariate density when the data is generated from a mixture model with three components with identical covariance $\Sigma=I_2$ and different means
\[
\mu_1=(0,0)\quad\mu_2=(1,3),\quad\mu_3=(4,-1).
\]
- Mixture probabilities are $p=(0.2, 0.3, 0.5)$
## Example: Bivariate normal
```{r}
library(MASS)
n <- 2000
p <- c(.2, .3, .5)
mu <- matrix(c(0, 1, 4, 0, 3, -1), 3, 2)
Sigma <- diag(2)
i <- sample(1:3, replace = TRUE, prob = p, size = n)
k <- table(i)
x1 <- mvrnorm(k[1], mu = mu[1,], Sigma)
x2 <- mvrnorm(k[2], mu = mu[2,], Sigma)
x3 <- mvrnorm(k[3], mu = mu[3,], Sigma)
X <- rbind(x1, x2, x3) #the mixture data
x <- X[,1]
y <- X[,2]
fhat <- kde2d(x, y, h=c(1.87, 1.84))
```
## Example: Bivariate normal
```{r}
par(mfrow = c(1, 2))
contour(fhat)
persp(fhat, phi = 30, theta = 20, d = 5, xlab = "x")
par(mfrow = c(1, 1))
```
## Summary
- Over 20 packages that perform density estimation (Deng and Wickham, 2011)
- Kernel density estimation is the most common approach
- Density estimation can be parametric, where the data is from a known family
- Bayesian approaches are also available
## Discussion
- How to use cross validation to choose the bandwidth?