Density Estimation

Agenda

Histograms
Glivenko-Cantelli theorem
Error for density estimates
Kernel density estimates
Bivariate density estimates

Histograms

Histograms are one of the first things learned in "Introduction to Statistics"
Simple way of estimating a distribution
- Split the sample space up into bins
- Count how many samples fall into each bin
If we hold the bins fixed and take more and more data, then the relative frequency for each bin will converge on the bin’s probability

Example: Old Faithful Geyser Data

data(faithful); x0 <- 0; x1 <- 8; h <- .5
my.breaks<-seq(from=x0, to=x1, by=h)
myhist<-hist(faithful$eruptions, breaks=my.breaks, right=F, plot=F)
plot(myhist$mids, myhist$density, type="s", xlab="Eruption length", 
     ylab="Frequency", main="Histogram of Eruption Lengths")

Histograms

What about a density function?
Could take our histogram estimate and say that the probability density is uniform within each bin
Gives us a piecewise-constant estimate of the density
Problem: Will not converge on the true density – unless we shrink the bins as we get more and more data
Bias-variance trade-off
- A large number of very small bins, the minimum bias in our estimate of any density becomes small
- But the variance grows for very small bins

Example: Old Faithful Geyser Data

Histograms

Bin width primarily controls the amount of smoothing, lots of guidance available

Sturges' rule: Optimal width of class intervals is given by \[ \frac{R}{1+\log_2 n} \] where $R$ is the sample range – Designed for data sampled from symmetric, unimodal populations
Scott's Normal reference rule: Specifies a bin width \[ 3.49\hat\sigma n^{-1/3} \] where $\hat \sigma$ is an estimate of the population standard deviation $\sigma$
Freedman-Diaconis rule: Specifies the bin width to be \[ 2(IQR)n^{-1/3} \] where the $IQR$ is the sample inter-quartile range

Histograms

Is learning the whole distribution non-parametrically even feasible?
How can we measure error to deal with the bias-variance trade-off?

Empirical CDF

Learning the whole distribution is feasible
Something even dumber than shrinking histograms will work
Suppose we have one-dimensional samples $x_1, \dots, x_n$ with CDF $F$
Define the empirical cumulative distribution function on $n$ samples as \[ \hat{F}_n(a) = \frac{1}{n} \sum_{i=1}^{n} I(-\infty < x_i \le a) \]
Just the fraction of the samples which are less than or equal to $a$

Glivenko-Cantelli theorem

Then the Glivenko-Cantelli theorem says \[ \max_{a} | \hat{F}_n(a) - F(a) | \to 0 \]
So the empirical CDF converges to the true CDF everywhere, i.e. the maximum gap between the two of them goes to zero
Pitman (1979) calls this the “fundamental theorem of statistics”
Can learn distributions just by collecting enough data

Glivenko-Cantelli theorem

Can we use the empirical CDF to estimate a density?
Yes, but it's discrete and doesn't estimate a density well
Usually we can expect to find some new samples between our old ones
So we want a non-zero density between our observations
Uniform distribution within each bin of a histogram doesn’t have this issue
Can we do better?

Error for density estimates

Yes, but what do we mean by “better” density estimates?
Three ideas:
1. Squared deviation from the true density should be small \[ \int \left( f(x) - \hat{f}(x) \right)^2 dx \]
2. Absolute deviation from the true density should be small \[ \int \left| f(x) - \hat{f}(x) \right| dx \]
3. Average log-likelihood ratio should be kept low \[ \int f(x) \log \frac{f(x)}{\hat{f}(x)} dx \]

Error for density estimates

Squared deviation is similar to MSE criterion used in regression
- Used most frequently since it's mathematically tractable
Absolute deviation considers $L_1$ or total variation distance between the true and the estimated density
- Nice property that $\frac{1}{2} \int \left| f(x) - \hat{f}(x) \right| dx$ is exactly the maximum error in our estimate of the probability of any set
- But it’s tricky to work with, so we’ll skip it
Minimizing the log-likelihood ratio is intimately connected both to maximizing the likelihood and to minimizing entropy
- Called Kullback-Leibler divergence or relative entropy

Error for density estimates

Notice that \[ \int \left( f(x) - \hat{f}(x) \right)^2 dx = \int f^2(x) dx - 2 \int f(x) \hat{f}(x) dx + \int \hat{f}^2(x) dx \]
First term doesn’t depend on the estimate, so we can ignore it for purposes of optimization
Third term only involves $\hat{f}(x)$, and is just an integral, which we can do numerically
Second term involves both the true and the estimated density; we can approximate it using Monte Carlo by \[ \frac{2}{n} \sum_{i=1}^{n} \hat{f}(x_i) \]

Error for density estimates

Then our error measure is \[ \frac{2}{n} \sum_{i=1}^{n} \hat{f}(x_i) + \int \hat{f}^2(x) dx \]
In fact, this error measure does not depend on having one-dimension data
For purposes of cross-validation, we can estimate $\hat{f}(x)$ on the training set and then restrict the sum to points in the testing set

Naive estimator

If a random variable $X$ has probability density $f$, then \[ f(x)=\lim_{h\rightarrow 0} \frac{1}{2h}P(x-h < X < x+h) \]
Thus, a naive estimator would be \[ \widehat f(x) = \frac{1}{2nh} \left[ \# \textrm{ of } x_i \textrm{ falling in } (x-h, x+h) \right] \]

Naive estimator

Or, equivalently \[ \widehat f(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}w\left( \frac{x-x_i}{h} \right) \] where $w$ is a weight function defined as \[ w(x)=\begin{cases} 1/2 \quad |x| <1 \\ 0 \quad \text{ otherwise} \end{cases} \]
In short, a naive estimate is constructed by placing a box of width $2h$ and height $\frac{1}{2nh}$ on each observation, then summing to obtain the estimate

Example: Old Faithful Geyser Data

my.w<-function(x){
    if (abs(x) < 1) 
        w <- 1/2
    else
        w <- 0
    w
    }
x <- seq(0,6,0.2)
m <- length(x)
n <- length(faithful$eruptions)
h <- .5
fhat <- rep(0,m)
for (i in 1:m){
    S <- 0
    for (j in 1:n){
        S <- S+(1/h)*my.w((faithful$eruptions[j]-x[i])/h)
        }
    fhat[i] <- (1/n)*S
    }

Example: Old Faithful Geyser Data

plot(x,fhat, type="s", xlab="Eruption length", ylab="Density Estimate", 
     main="Naive Density Estimator")

Naive estimator

Not wholly satisfactory, from the point of view of using density estimates for presentation
Estimate $\widehat f$ is a step function
In the formula for the naive estimate, we can replace the weight function $w$ by another function $K$ with more desirable properties
Function $K$ is called a kernel

Kernel density estimates

Resulting estimate is a kernel estimator: \[ \hat{f}(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}K\left( \frac{x-x_i}{h} \right). \]
$h$ is the window width, smoothing parameter, or bandwidth
Usually the kernel $K$ is taken to be a probability density function itself (i.e., normal density)
Resulting estimate will inherit all the smoothness properties of $K$

Kernel density estimates

Most popular choices for the kernel $K$ are

Family	Kernel
Gaussian	$K(t)=\frac{1}{\sqrt{2\pi}}e^{-t^2/2}$
Rectangular	$K(t) = 1/2$ for $
Triangular	$K(t) = 1-
Epanechnikov	$K(t)=\frac{3}{4}(1-(1/5) t^2)$ for $

Kernel density estimates

my.w<-function(x, type="gaussian"){
if(type=="gaussian"){
  w <- dnorm(x)
return(w)
}
if(type=="naive"){
  if (abs(x) < 1) 
        w <- 1/2
    else
        w <- 0
return(w)
}
print("You have asked for an undefined kernel.")
return(NULL)
}

Example: Old Faithful Geyser Data

x <- seq(0,6,0.02)
m <- length(x)
n <- length(faithful$eruptions)
h <- .1
fhat <- rep(0,m)
for (i in 1:m){
    S <- 0
    for (j in 1:n){
        S <- S+(1/h)*my.w((faithful$eruptions[j]-x[i])/h)
        }
    fhat[i] <- (1/n)*S
    }

Example: Old Faithful Geyser Data

plot(x,fhat, type="l", xlab="Eruption length", ylab="Density Estimate", 
     main="Naive Density Estimator")

Bandwidth selection

Cross-validation, which could be time consuming
Optimal bandwidth for a Gaussian kernel to estimate a Gaussian distribution is $1.06\sigma / n^{1/5}$
Called the Gaussian reference rule or the rule-of-thumb bandwidth
When you call density in R, this is basically what it does

Kernel density estimate samples

There are times when one wants to draw a random sample from the estimated distribution
Easy with kernel density estimates, because each kernel is itself a probability density
Suppose the kernel is Gaussian, that we have scalar observations $x_1, \dots, x_n$ and bandwidth $h$
1. Pick an integer uniformly at random from $1$ to $n$
2. Use rnorm(1,x[i],h), or rnorm(q,sample(x,q,replace=TRUE),h) for $q$ draws
Using a different kernel, we’d just need to use the random number generator function for the corresponding distribution

Other Approaches

Histograms and kernels are not the only possible way of estimating densities
Can try the local polynomial trick, series expansions, splines, penalized likelihood approaches, etc
For some of these, avoid negative probability density estimates using the log density

Density estimation in R

density() function is the most common

density(x, ...)
## Default S3 method:
density(x, bw = "nrd0", adjust = 1,
        kernel = c("gaussian", "epanechnikov", "rectangular",
                   "triangular", "biweight",
                   "cosine", "optcosine"),
        weights = NULL, window = kernel, width,
        give.Rkern = FALSE,
        n = 512, from, to, cut = 3, na.rm = FALSE, ...)

ASH and KernSmooth are both fast, accurate, and well-maintained (Deng and Wickham, 2011)

Bivariate density estimation

To construct a bivariate density histogram, it is necessary to define two-dimensional bins and count the number of observations in each bin
Can use bin2d function in R will bin a bivariate data set

bin2d <- function(x, breaks1 = "Sturges", breaks2 = "Sturges"){
      histg1 <- hist(x[,1], breaks = breaks1, plot = FALSE)
      histg2 <- hist(x[,2], breaks = breaks2, plot = FALSE)
      brx <- histg1$breaks
      bry <- histg2$breaks
      freq <- table(cut(x[,1], brx),  cut(x[,2], bry))

      return(list(call = match.call(), freq = freq,
                breaks1 = brx, breaks2 = bry,
                mids1 = histg1$mids, mids2 = histg2$mids))
      }

Bivariate density estimation

Following example computes the bivariate frequency table
After binning the data, the persp function plots the density histogram

data(iris)
fit1=bin2d(iris[1:50, 1:2])
persp(x=fit1$mids1, y=fit1$mids2, z=fit1$freq, shade=T, theta=45, phi=30, ltheta=60)

Bivariate kernel methods

Suppose the data is $X_1, \dots, X_n$, where each $X_i\in \mathbb{R}^2$
Kernel density estimates can be extended to a multivariate (bivariate) setting
Let $K(\cdot)$ be a bivariate kernel (typically a bivariate density function), then bivariate kernel density estimate is \[ \hat f(X) = \frac{1}{nh^d}\sum_{i=1}^n K\left(\frac{X-X_i}{h}\right) \]

Example: Bivariate normal

Estimate the bivariate density when the data is generated from a mixture model with three components with identical covariance $\Sigma=I_2$ and different means \[ \mu_1=(0,0)\quad\mu_2=(1,3),\quad\mu_3=(4,-1). \]
Mixture probabilities are $p=(0.2, 0.3, 0.5)$

Example: Bivariate normal

library(MASS)
n <- 2000
p <- c(.2, .3, .5)
mu <- matrix(c(0, 1, 4, 0, 3, -1), 3, 2)
Sigma <- diag(2)
i <- sample(1:3, replace = TRUE, prob = p, size = n)
k <- table(i)

x1 <- mvrnorm(k[1], mu = mu[1,], Sigma)
x2 <- mvrnorm(k[2], mu = mu[2,], Sigma)
x3 <- mvrnorm(k[3], mu = mu[3,], Sigma)
X <-  rbind(x1, x2, x3)   #the mixture data
x <- X[,1]
y <- X[,2]
    
fhat <- kde2d(x, y, h=c(1.87, 1.84))

Example: Bivariate normal

par(mfrow = c(1, 2))
contour(fhat)
persp(fhat, phi = 30, theta = 20, d = 5, xlab = "x")

par(mfrow = c(1, 1))

Summary

Over 20 packages that perform density estimation (Deng and Wickham, 2011)
Kernel density estimation is the most common approach
Density estimation can be parametric, where the data is from a known family
Bayesian approaches are also available

Discussion

How to use cross validation to choose the bandwidth?

Family	Kernel
Gaussian	\(K(t)=\frac{1}{\sqrt{2\pi}}e^{-t^2/2}\)
Rectangular	\(K(t) = 1/2\) for $
Triangular	$K(t) = 1-
Epanechnikov	\(K(t)=\frac{3}{4}(1-(1/5) t^2)\) for $