---
title: 'Density Estimation'
author: "James M. Flegal"
output:
  ioslides_presentation:
    smaller: yes
---

## Agenda

- Histograms
- Glivenko-Cantelli theorem
- Error for density estimates
- Kernel density estimates
- Bivariate density estimates

## Histograms

- Histograms are one of the first things learned in "Introduction to Statistics"
- Simple way of estimating a distribution
    + Split the sample space up into bins
    + Count how many samples fall into each bin
- If we hold the bins fixed and take more and more data, then the relative frequency for each bin will converge on the bin’s probability

## Example: Old Faithful Geyser Data

```{r}
data(faithful); x0 <- 0; x1 <- 8; h <- .5
my.breaks<-seq(from=x0, to=x1, by=h)
myhist<-hist(faithful$eruptions, breaks=my.breaks, right=F, plot=F)
plot(myhist$mids, myhist$density, type="s", xlab="Eruption length", 
     ylab="Frequency", main="Histogram of Eruption Lengths")
```

## Histograms

- What about a density function?
- Could take our histogram estimate and say that the probability density is uniform within each bin
- Gives us a piecewise-constant estimate of the density
- ***Problem***: Will not converge on the true density -- unless we shrink the bins as we get more and more data
- Bias-variance trade-off
    + A large number of very small bins, the minimum bias in our estimate of any density becomes small
    + But the variance grows for very small bins
    
## Example: Old Faithful Geyser Data

```{r, echo=FALSE}
par(mfrow = c(1, 2))
myhist<-hist(faithful$eruptions, breaks=my.breaks, right=F, plot=F)
plot(myhist$mids, myhist$density, type="s", xlab="Eruption length", 
     ylab="Frequency", main="Large Bin Width")
h <- .1
my.breaks<-seq(from=x0, to=x1, by=h)
myhist<-hist(faithful$eruptions, breaks=my.breaks, right=F, plot=F)
plot(myhist$mids, myhist$density, type="s", xlab="Eruption length", 
     ylab="Frequency", main="Small Bin Width")
par(mfrow = c(1, 1))
```

## Histograms

Bin width primarily controls the amount of smoothing, lots of guidance available

1. **Sturges' rule**: Optimal width of class intervals is given by
\[
\frac{R}{1+\log_2 n}
\]
where $R$ is the sample range -- Designed for data sampled from symmetric, unimodal populations
2. **Scott's Normal reference rule**: Specifies a bin width
\[
3.49\hat\sigma n^{-1/3}
\]
where $\hat \sigma$ is an estimate of the population standard deviation $\sigma$
3. **Freedman-Diaconis rule**: Specifies the bin width to be
\[
2(IQR)n^{-1/3}
\]
where the $IQR$ is the sample inter-quartile range

## Histograms

- Is learning the whole distribution non-parametrically even feasible?
- How can we measure error to deal with the bias-variance trade-off?

## Empirical CDF

- Learning the whole distribution is ***feasible***
- Something even dumber than shrinking histograms will work
- Suppose we have one-dimensional samples $x_1, \dots, x_n$ with CDF $F$
- Define the empirical cumulative distribution function on $n$ samples as 
\[
\hat{F}_n(a) = \frac{1}{n} \sum_{i=1}^{n} I(-\infty < x_i \le a)
\]
- Just the fraction of the samples which are less than or equal to $a$

## Glivenko-Cantelli theorem

- Then the ***Glivenko-Cantelli theorem*** says
\[
\max_{a} | \hat{F}_n(a) - F(a) | \to 0
\]
- So the empirical CDF converges to the true CDF ***everywhere***, i.e. the maximum gap between the two of them goes to zero
- Pitman (1979) calls this the “fundamental theorem of statistics”
- Can learn distributions just by collecting enough data

## Glivenko-Cantelli theorem

- Can we use the empirical CDF to estimate a density?
- Yes, but it's discrete and doesn't estimate a density well
- Usually we can expect to find some new samples between our old ones
- So we want a non-zero density between our observations
- Uniform distribution within each bin of a histogram doesn’t have this issue
- Can we do better?

## Error for density estimates

- Yes, but what do we mean by “better” density estimates?
- Three ideas:
    1. Squared deviation from the true density should be small
    \[
    \int \left( f(x) - \hat{f}(x) \right)^2 dx
    \]
    2. Absolute deviation from the true density should be small
    \[
    \int \left| f(x) - \hat{f}(x) \right| dx
    \]
    3. Average log-likelihood ratio should be kept low
    \[
    \int f(x) \log \frac{f(x)}{\hat{f}(x)} dx
    \]

## Error for density estimates

- Squared deviation is similar to MSE criterion used in regression
    + Used most frequently since it's mathematically tractable
- Absolute deviation considers $L_1$ or total variation distance between the true and the estimated density
    + Nice property that $\frac{1}{2} \int \left| f(x) - \hat{f}(x) \right| dx$ is exactly the maximum error in our estimate of the probability of any set
    + But it’s tricky to work with, so we’ll skip it
- Minimizing the log-likelihood ratio is intimately connected both to maximizing the likelihood and to minimizing entropy
    + Called Kullback-Leibler divergence or relative entropy

## Error for density estimates

- Notice that
\[
\int \left( f(x) - \hat{f}(x) \right)^2 dx = \int f^2(x) dx - 2 \int f(x) \hat{f}(x) dx + \int \hat{f}^2(x) dx
\]
- First term doesn’t depend on the estimate, so we can ignore it for purposes of optimization
- Third term only involves $\hat{f}(x)$, and is just an integral, which we can do numerically
- Second term involves both the true and the estimated density; we can approximate it using Monte Carlo by
\[
\frac{2}{n} \sum_{i=1}^{n} \hat{f}(x_i) 
\]

## Error for density estimates

- Then our error measure is
\[
\frac{2}{n} \sum_{i=1}^{n} \hat{f}(x_i) + \int \hat{f}^2(x) dx
\]
- In fact, this error measure does not depend on having one-dimension data
- For purposes of cross-validation, we can estimate $\hat{f}(x)$ on the training set and then restrict the sum to points in the testing set

## Naive estimator

- If a random variable $X$ has probability density $f$, then 
\[
f(x)=\lim_{h\rightarrow 0} \frac{1}{2h}P(x-h < X < x+h)
\]
- Thus, a naive estimator would be
\[
\widehat f(x) = \frac{1}{2nh} \left[ \# \textrm{ of } x_i \textrm{ falling in } (x-h, x+h) \right]
\]

## Naive estimator

- Or, equivalently
\[
\widehat f(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}w\left( \frac{x-x_i}{h} \right)
\]
where $w$ is a weight function defined as
\[
w(x)=\begin{cases} 1/2 \quad |x| <1 \\ 0 \quad \text{ otherwise} \end{cases}
\]
- In short, a naive estimate is constructed by placing a box of width $2h$ and height $\frac{1}{2nh}$ on each observation, then summing to obtain the estimate

## Example: Old Faithful Geyser Data

```{r}
my.w<-function(x){
	if (abs(x) < 1) 
		w <- 1/2
	else
		w <- 0
	w
	}
x <- seq(0,6,0.2)
m <- length(x)
n <- length(faithful$eruptions)
h <- .5
fhat <- rep(0,m)
for (i in 1:m){
	S <- 0
	for (j in 1:n){
		S <- S+(1/h)*my.w((faithful$eruptions[j]-x[i])/h)
		}
	fhat[i] <- (1/n)*S
	}
```

## Example: Old Faithful Geyser Data

```{r}
plot(x,fhat, type="s", xlab="Eruption length", ylab="Density Estimate", 
     main="Naive Density Estimator")
```

## Naive estimator

- Not wholly satisfactory, from the point of view of using density estimates for presentation
- Estimate $\widehat f$ is a step function
- In the formula for the naive estimate, we can replace the weight function $w$ by another function $K$ with more desirable properties
- Function $K$ is called a **kernel**

## Kernel density estimates

- Resulting estimate is a **kernel estimator**:
$$
\hat{f}(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}K\left( \frac{x-x_i}{h} \right).
$$
- $h$ is the **window width**, **smoothing parameter**, or **bandwidth**
- Usually the kernel $K$ is taken to be a probability density function itself (i.e., normal density)
- Resulting estimate will inherit all the smoothness properties of $K$

## Kernel density estimates

Most popular choices for the kernel $K$ are

Family | Kernel
:------:|:------:
Gaussian | $K(t)=\frac{1}{\sqrt{2\pi}}e^{-t^2/2}$
Rectangular | $K(t) = 1/2$ for $|t|<1$ 
Triangular | $K(t) = 1-|t|$ for $|t|<1$  
Epanechnikov | $K(t)=\frac{3}{4}(1-(1/5) t^2)$ for $|t|<\sqrt{5}$

## Kernel density estimates

```{r}
my.w<-function(x, type="gaussian"){
if(type=="gaussian"){
  w <- dnorm(x)
return(w)
}
if(type=="naive"){
  if (abs(x) < 1) 
		w <- 1/2
	else
		w <- 0
return(w)
}
print("You have asked for an undefined kernel.")
return(NULL)
}
```

## Example: Old Faithful Geyser Data

```{r}
x <- seq(0,6,0.02)
m <- length(x)
n <- length(faithful$eruptions)
h <- .1
fhat <- rep(0,m)
for (i in 1:m){
	S <- 0
	for (j in 1:n){
		S <- S+(1/h)*my.w((faithful$eruptions[j]-x[i])/h)
		}
	fhat[i] <- (1/n)*S
	}
```

## Example: Old Faithful Geyser Data

```{r}
plot(x,fhat, type="l", xlab="Eruption length", ylab="Density Estimate", 
     main="Naive Density Estimator")
```

## Bandwidth selection

- Cross-validation, which could be time consuming
- Optimal bandwidth for a Gaussian kernel to estimate a Gaussian distribution is $1.06\sigma / n^{1/5}$
- Called the **Gaussian reference rule** or the **rule-of-thumb** bandwidth
- When you call `density` in R, this is basically what it does

## Kernel density estimate samples

- There are times when one wants to draw a random sample from the estimated distribution
- Easy with kernel density estimates, because each kernel is itself a probability density
- Suppose the kernel is Gaussian, that we have scalar observations $x_1, \dots, x_n$ and bandwidth $h$
    1. Pick an integer uniformly at random from $1$ to $n$
    2. Use `rnorm(1,x[i],h)`, or `rnorm(q,sample(x,q,replace=TRUE),h)` for $q$ draws
- Using a different kernel, we’d just need to use the random number generator function for the corresponding distribution

## Other Approaches

- Histograms and kernels are not the only possible way of estimating densities
- Can try the local polynomial trick, series expansions, splines, penalized likelihood approaches, etc
- For some of these, avoid negative probability density estimates using the log density

## Density estimation in R

- `density()` function is the most common

```{r, eval=FALSE}
density(x, ...)
## Default S3 method:
density(x, bw = "nrd0", adjust = 1,
        kernel = c("gaussian", "epanechnikov", "rectangular",
                   "triangular", "biweight",
                   "cosine", "optcosine"),
        weights = NULL, window = kernel, width,
        give.Rkern = FALSE,
        n = 512, from, to, cut = 3, na.rm = FALSE, ...)
```

- `ASH` and `KernSmooth` are both fast, accurate, and well-maintained (Deng and Wickham, 2011)

## Bivariate density estimation

- To construct a bivariate density histogram, it is necessary to define two-dimensional bins and count the number of observations in each bin
- Can use `bin2d` function in R will bin a bivariate data set
```{r}
bin2d <- function(x, breaks1 = "Sturges", breaks2 = "Sturges"){
      histg1 <- hist(x[,1], breaks = breaks1, plot = FALSE)
      histg2 <- hist(x[,2], breaks = breaks2, plot = FALSE)
      brx <- histg1$breaks
      bry <- histg2$breaks
      freq <- table(cut(x[,1], brx),  cut(x[,2], bry))

      return(list(call = match.call(), freq = freq,
                breaks1 = brx, breaks2 = bry,
                mids1 = histg1$mids, mids2 = histg2$mids))
      }
```

## Bivariate density estimation

- Following example computes the bivariate frequency table
- After binning the data, the `persp` function plots the density histogram
```{r}
data(iris)
fit1=bin2d(iris[1:50, 1:2])
persp(x=fit1$mids1, y=fit1$mids2, z=fit1$freq, shade=T, theta=45, phi=30, ltheta=60)
```

## Bivariate kernel methods

- Suppose the data is $X_1, \dots, X_n$, where each $X_i\in \mathbb{R}^2$ 
- Kernel density estimates can be extended to a multivariate (bivariate) setting
- Let $K(\cdot)$ be a bivariate kernel (typically a bivariate density function), then bivariate kernel density estimate is
\[
\hat f(X) = \frac{1}{nh^d}\sum_{i=1}^n K\left(\frac{X-X_i}{h}\right)
\]

## Example: Bivariate normal

- Estimate the bivariate density when the data is generated from a mixture model with three components with identical covariance $\Sigma=I_2$ and different means
\[
\mu_1=(0,0)\quad\mu_2=(1,3),\quad\mu_3=(4,-1).
\]
- Mixture probabilities are $p=(0.2, 0.3, 0.5)$

## Example: Bivariate normal

```{r}
library(MASS)
n <- 2000
p <- c(.2, .3, .5)
mu <- matrix(c(0, 1, 4, 0, 3, -1), 3, 2)
Sigma <- diag(2)
i <- sample(1:3, replace = TRUE, prob = p, size = n)
k <- table(i)

x1 <- mvrnorm(k[1], mu = mu[1,], Sigma)
x2 <- mvrnorm(k[2], mu = mu[2,], Sigma)
x3 <- mvrnorm(k[3], mu = mu[3,], Sigma)
X <-  rbind(x1, x2, x3)   #the mixture data
x <- X[,1]
y <- X[,2]
    
fhat <- kde2d(x, y, h=c(1.87, 1.84))
```

## Example: Bivariate normal

```{r}
par(mfrow = c(1, 2))
contour(fhat)
persp(fhat, phi = 30, theta = 20, d = 5, xlab = "x")
par(mfrow = c(1, 1))
```

## Summary

- Over 20 packages that perform density estimation (Deng and Wickham, 2011)
- Kernel density estimation is the most common approach
- Density estimation can be parametric, where the data is from a known family
- Bayesian approaches are also available

## Discussion

- How to use cross validation to choose the bandwidth?