Agenda

  • Histograms
  • Glivenko-Cantelli theorem
  • Error for density estimates
  • Kernel density estimates
  • Bivariate density estimates

Histograms

  • Histograms are one of the first things learned in "Introduction to Statistics"
  • Simple way of estimating a distribution
    • Split the sample space up into bins
    • Count how many samples fall into each bin
  • If we hold the bins fixed and take more and more data, then the relative frequency for each bin will converge on the bin’s probability

Example: Old Faithful Geyser Data

data(faithful); x0 <- 0; x1 <- 8; h <- .5
my.breaks<-seq(from=x0, to=x1, by=h)
myhist<-hist(faithful$eruptions, breaks=my.breaks, right=F, plot=F)
plot(myhist$mids, myhist$density, type="s", xlab="Eruption length", 
     ylab="Frequency", main="Histogram of Eruption Lengths")

Histograms

  • What about a density function?
  • Could take our histogram estimate and say that the probability density is uniform within each bin
  • Gives us a piecewise-constant estimate of the density
  • Problem: Will not converge on the true density – unless we shrink the bins as we get more and more data
  • Bias-variance trade-off
    • A large number of very small bins, the minimum bias in our estimate of any density becomes small
    • But the variance grows for very small bins

Example: Old Faithful Geyser Data

Histograms

Bin width primarily controls the amount of smoothing, lots of guidance available

  1. Sturges' rule: Optimal width of class intervals is given by \[ \frac{R}{1+\log_2 n} \] where \(R\) is the sample range – Designed for data sampled from symmetric, unimodal populations
  2. Scott's Normal reference rule: Specifies a bin width \[ 3.49\hat\sigma n^{-1/3} \] where \(\hat \sigma\) is an estimate of the population standard deviation \(\sigma\)
  3. Freedman-Diaconis rule: Specifies the bin width to be \[ 2(IQR)n^{-1/3} \] where the \(IQR\) is the sample inter-quartile range

Histograms

  • Is learning the whole distribution non-parametrically even feasible?
  • How can we measure error to deal with the bias-variance trade-off?

Empirical CDF

  • Learning the whole distribution is feasible
  • Something even dumber than shrinking histograms will work
  • Suppose we have one-dimensional samples \(x_1, \dots, x_n\) with CDF \(F\)
  • Define the empirical cumulative distribution function on \(n\) samples as \[ \hat{F}_n(a) = \frac{1}{n} \sum_{i=1}^{n} I(-\infty < x_i \le a) \]
  • Just the fraction of the samples which are less than or equal to \(a\)

Glivenko-Cantelli theorem

  • Then the Glivenko-Cantelli theorem says \[ \max_{a} | \hat{F}_n(a) - F(a) | \to 0 \]
  • So the empirical CDF converges to the true CDF everywhere, i.e. the maximum gap between the two of them goes to zero
  • Pitman (1979) calls this the “fundamental theorem of statistics”
  • Can learn distributions just by collecting enough data

Glivenko-Cantelli theorem

  • Can we use the empirical CDF to estimate a density?
  • Yes, but it's discrete and doesn't estimate a density well
  • Usually we can expect to find some new samples between our old ones
  • So we want a non-zero density between our observations
  • Uniform distribution within each bin of a histogram doesn’t have this issue
  • Can we do better?

Error for density estimates

  • Yes, but what do we mean by “better” density estimates?
  • Three ideas:
    1. Squared deviation from the true density should be small \[ \int \left( f(x) - \hat{f}(x) \right)^2 dx \]
    2. Absolute deviation from the true density should be small \[ \int \left| f(x) - \hat{f}(x) \right| dx \]
    3. Average log-likelihood ratio should be kept low \[ \int f(x) \log \frac{f(x)}{\hat{f}(x)} dx \]

Error for density estimates

  • Squared deviation is similar to MSE criterion used in regression
    • Used most frequently since it's mathematically tractable
  • Absolute deviation considers \(L_1\) or total variation distance between the true and the estimated density
    • Nice property that \(\frac{1}{2} \int \left| f(x) - \hat{f}(x) \right| dx\) is exactly the maximum error in our estimate of the probability of any set
    • But it’s tricky to work with, so we’ll skip it
  • Minimizing the log-likelihood ratio is intimately connected both to maximizing the likelihood and to minimizing entropy
    • Called Kullback-Leibler divergence or relative entropy

Error for density estimates

  • Notice that \[ \int \left( f(x) - \hat{f}(x) \right)^2 dx = \int f^2(x) dx - 2 \int f(x) \hat{f}(x) dx + \int \hat{f}^2(x) dx \]
  • First term doesn’t depend on the estimate, so we can ignore it for purposes of optimization
  • Third term only involves \(\hat{f}(x)\), and is just an integral, which we can do numerically
  • Second term involves both the true and the estimated density; we can approximate it using Monte Carlo by \[ \frac{2}{n} \sum_{i=1}^{n} \hat{f}(x_i) \]

Error for density estimates

  • Then our error measure is \[ \frac{2}{n} \sum_{i=1}^{n} \hat{f}(x_i) + \int \hat{f}^2(x) dx \]
  • In fact, this error measure does not depend on having one-dimension data
  • For purposes of cross-validation, we can estimate \(\hat{f}(x)\) on the training set and then restrict the sum to points in the testing set

Naive estimator

  • If a random variable \(X\) has probability density \(f\), then \[ f(x)=\lim_{h\rightarrow 0} \frac{1}{2h}P(x-h < X < x+h) \]
  • Thus, a naive estimator would be \[ \widehat f(x) = \frac{1}{2nh} \left[ \# \textrm{ of } x_i \textrm{ falling in } (x-h, x+h) \right] \]

Naive estimator

  • Or, equivalently \[ \widehat f(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}w\left( \frac{x-x_i}{h} \right) \] where \(w\) is a weight function defined as \[ w(x)=\begin{cases} 1/2 \quad |x| <1 \\ 0 \quad \text{ otherwise} \end{cases} \]
  • In short, a naive estimate is constructed by placing a box of width \(2h\) and height \(\frac{1}{2nh}\) on each observation, then summing to obtain the estimate

Example: Old Faithful Geyser Data

my.w<-function(x){
    if (abs(x) < 1) 
        w <- 1/2
    else
        w <- 0
    w
    }
x <- seq(0,6,0.2)
m <- length(x)
n <- length(faithful$eruptions)
h <- .5
fhat <- rep(0,m)
for (i in 1:m){
    S <- 0
    for (j in 1:n){
        S <- S+(1/h)*my.w((faithful$eruptions[j]-x[i])/h)
        }
    fhat[i] <- (1/n)*S
    }

Example: Old Faithful Geyser Data

plot(x,fhat, type="s", xlab="Eruption length", ylab="Density Estimate", 
     main="Naive Density Estimator")

Naive estimator

  • Not wholly satisfactory, from the point of view of using density estimates for presentation
  • Estimate \(\widehat f\) is a step function
  • In the formula for the naive estimate, we can replace the weight function \(w\) by another function \(K\) with more desirable properties
  • Function \(K\) is called a kernel

Kernel density estimates

  • Resulting estimate is a kernel estimator: \[ \hat{f}(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}K\left( \frac{x-x_i}{h} \right). \]
  • \(h\) is the window width, smoothing parameter, or bandwidth
  • Usually the kernel \(K\) is taken to be a probability density function itself (i.e., normal density)
  • Resulting estimate will inherit all the smoothness properties of \(K\)

Kernel density estimates

Most popular choices for the kernel \(K\) are

Family Kernel
Gaussian \(K(t)=\frac{1}{\sqrt{2\pi}}e^{-t^2/2}\)
Rectangular \(K(t) = 1/2\) for $
Triangular $K(t) = 1-
Epanechnikov \(K(t)=\frac{3}{4}(1-(1/5) t^2)\) for $

Kernel density estimates

my.w<-function(x, type="gaussian"){
if(type=="gaussian"){
  w <- dnorm(x)
return(w)
}
if(type=="naive"){
  if (abs(x) < 1) 
        w <- 1/2
    else
        w <- 0
return(w)
}
print("You have asked for an undefined kernel.")
return(NULL)
}

Example: Old Faithful Geyser Data

x <- seq(0,6,0.02)
m <- length(x)
n <- length(faithful$eruptions)
h <- .1
fhat <- rep(0,m)
for (i in 1:m){
    S <- 0
    for (j in 1:n){
        S <- S+(1/h)*my.w((faithful$eruptions[j]-x[i])/h)
        }
    fhat[i] <- (1/n)*S
    }

Example: Old Faithful Geyser Data

plot(x,fhat, type="l", xlab="Eruption length", ylab="Density Estimate", 
     main="Naive Density Estimator")

Bandwidth selection

  • Cross-validation, which could be time consuming
  • Optimal bandwidth for a Gaussian kernel to estimate a Gaussian distribution is \(1.06\sigma / n^{1/5}\)
  • Called the Gaussian reference rule or the rule-of-thumb bandwidth
  • When you call density in R, this is basically what it does

Kernel density estimate samples

  • There are times when one wants to draw a random sample from the estimated distribution
  • Easy with kernel density estimates, because each kernel is itself a probability density
  • Suppose the kernel is Gaussian, that we have scalar observations \(x_1, \dots, x_n\) and bandwidth \(h\)
    1. Pick an integer uniformly at random from \(1\) to \(n\)
    2. Use rnorm(1,x[i],h), or rnorm(q,sample(x,q,replace=TRUE),h) for \(q\) draws
  • Using a different kernel, we’d just need to use the random number generator function for the corresponding distribution

Other Approaches

  • Histograms and kernels are not the only possible way of estimating densities
  • Can try the local polynomial trick, series expansions, splines, penalized likelihood approaches, etc
  • For some of these, avoid negative probability density estimates using the log density

Density estimation in R

  • density() function is the most common
density(x, ...)
## Default S3 method:
density(x, bw = "nrd0", adjust = 1,
        kernel = c("gaussian", "epanechnikov", "rectangular",
                   "triangular", "biweight",
                   "cosine", "optcosine"),
        weights = NULL, window = kernel, width,
        give.Rkern = FALSE,
        n = 512, from, to, cut = 3, na.rm = FALSE, ...)
  • ASH and KernSmooth are both fast, accurate, and well-maintained (Deng and Wickham, 2011)

Bivariate density estimation

  • To construct a bivariate density histogram, it is necessary to define two-dimensional bins and count the number of observations in each bin
  • Can use bin2d function in R will bin a bivariate data set
bin2d <- function(x, breaks1 = "Sturges", breaks2 = "Sturges"){
      histg1 <- hist(x[,1], breaks = breaks1, plot = FALSE)
      histg2 <- hist(x[,2], breaks = breaks2, plot = FALSE)
      brx <- histg1$breaks
      bry <- histg2$breaks
      freq <- table(cut(x[,1], brx),  cut(x[,2], bry))

      return(list(call = match.call(), freq = freq,
                breaks1 = brx, breaks2 = bry,
                mids1 = histg1$mids, mids2 = histg2$mids))
      }

Bivariate density estimation

  • Following example computes the bivariate frequency table
  • After binning the data, the persp function plots the density histogram
data(iris)
fit1=bin2d(iris[1:50, 1:2])
persp(x=fit1$mids1, y=fit1$mids2, z=fit1$freq, shade=T, theta=45, phi=30, ltheta=60)

Bivariate kernel methods

  • Suppose the data is \(X_1, \dots, X_n\), where each \(X_i\in \mathbb{R}^2\)
  • Kernel density estimates can be extended to a multivariate (bivariate) setting
  • Let \(K(\cdot)\) be a bivariate kernel (typically a bivariate density function), then bivariate kernel density estimate is \[ \hat f(X) = \frac{1}{nh^d}\sum_{i=1}^n K\left(\frac{X-X_i}{h}\right) \]

Example: Bivariate normal

  • Estimate the bivariate density when the data is generated from a mixture model with three components with identical covariance \(\Sigma=I_2\) and different means \[ \mu_1=(0,0)\quad\mu_2=(1,3),\quad\mu_3=(4,-1). \]
  • Mixture probabilities are \(p=(0.2, 0.3, 0.5)\)

Example: Bivariate normal

library(MASS)
n <- 2000
p <- c(.2, .3, .5)
mu <- matrix(c(0, 1, 4, 0, 3, -1), 3, 2)
Sigma <- diag(2)
i <- sample(1:3, replace = TRUE, prob = p, size = n)
k <- table(i)

x1 <- mvrnorm(k[1], mu = mu[1,], Sigma)
x2 <- mvrnorm(k[2], mu = mu[2,], Sigma)
x3 <- mvrnorm(k[3], mu = mu[3,], Sigma)
X <-  rbind(x1, x2, x3)   #the mixture data
x <- X[,1]
y <- X[,2]
    
fhat <- kde2d(x, y, h=c(1.87, 1.84))

Example: Bivariate normal

par(mfrow = c(1, 2))
contour(fhat)
persp(fhat, phi = 30, theta = 20, d = 5, xlab = "x")

par(mfrow = c(1, 1))

Summary

  • Over 20 packages that perform density estimation (Deng and Wickham, 2011)
  • Kernel density estimation is the most common approach
  • Density estimation can be parametric, where the data is from a known family
  • Bayesian approaches are also available

Discussion

  • How to use cross validation to choose the bandwidth?