- Histograms
- Glivenko-Cantelli theorem
- Error for density estimates
- Kernel density estimates
- Bivariate density estimates

- Histograms
- Glivenko-Cantelli theorem
- Error for density estimates
- Kernel density estimates
- Bivariate density estimates

- Histograms are one of the first things learned in "Introduction to Statistics"
- Simple way of estimating a distribution
- Split the sample space up into bins
- Count how many samples fall into each bin

- If we hold the bins fixed and take more and more data, then the relative frequency for each bin will converge on the bin’s probability

data(faithful); x0 <- 0; x1 <- 8; h <- .5 my.breaks<-seq(from=x0, to=x1, by=h) myhist<-hist(faithful$eruptions, breaks=my.breaks, right=F, plot=F) plot(myhist$mids, myhist$density, type="s", xlab="Eruption length", ylab="Frequency", main="Histogram of Eruption Lengths")

- What about a density function?
- Could take our histogram estimate and say that the probability density is uniform within each bin
- Gives us a piecewise-constant estimate of the density
: Will not converge on the true density – unless we shrink the bins as we get more and more data*Problem*- Bias-variance trade-off
- A large number of very small bins, the minimum bias in our estimate of any density becomes small
- But the variance grows for very small bins

Bin width primarily controls the amount of smoothing, lots of guidance available

**Sturges' rule**: Optimal width of class intervals is given by \[ \frac{R}{1+\log_2 n} \] where \(R\) is the sample range – Designed for data sampled from symmetric, unimodal populations**Scott's Normal reference rule**: Specifies a bin width \[ 3.49\hat\sigma n^{-1/3} \] where \(\hat \sigma\) is an estimate of the population standard deviation \(\sigma\)**Freedman-Diaconis rule**: Specifies the bin width to be \[ 2(IQR)n^{-1/3} \] where the \(IQR\) is the sample inter-quartile range

- Is learning the whole distribution non-parametrically even feasible?
- How can we measure error to deal with the bias-variance trade-off?

- Learning the whole distribution is
*feasible* - Something even dumber than shrinking histograms will work
- Suppose we have one-dimensional samples \(x_1, \dots, x_n\) with CDF \(F\)
- Define the empirical cumulative distribution function on \(n\) samples as \[ \hat{F}_n(a) = \frac{1}{n} \sum_{i=1}^{n} I(-\infty < x_i \le a) \]
- Just the fraction of the samples which are less than or equal to \(a\)

- Then the
says \[ \max_{a} | \hat{F}_n(a) - F(a) | \to 0 \]*Glivenko-Cantelli theorem* - So the empirical CDF converges to the true CDF
, i.e. the maximum gap between the two of them goes to zero*everywhere* - Pitman (1979) calls this the “fundamental theorem of statistics”
- Can learn distributions just by collecting enough data

- Can we use the empirical CDF to estimate a density?
- Yes, but it's discrete and doesn't estimate a density well
- Usually we can expect to find some new samples between our old ones
- So we want a non-zero density between our observations
- Uniform distribution within each bin of a histogram doesn’t have this issue
- Can we do better?

- Yes, but what do we mean by “better” density estimates?
- Three ideas:
- Squared deviation from the true density should be small \[ \int \left( f(x) - \hat{f}(x) \right)^2 dx \]
- Absolute deviation from the true density should be small \[ \int \left| f(x) - \hat{f}(x) \right| dx \]
- Average log-likelihood ratio should be kept low \[ \int f(x) \log \frac{f(x)}{\hat{f}(x)} dx \]

- Squared deviation is similar to MSE criterion used in regression
- Used most frequently since it's mathematically tractable

- Absolute deviation considers \(L_1\) or total variation distance between the true and the estimated density
- Nice property that \(\frac{1}{2} \int \left| f(x) - \hat{f}(x) \right| dx\) is exactly the maximum error in our estimate of the probability of any set
- But it’s tricky to work with, so we’ll skip it

- Minimizing the log-likelihood ratio is intimately connected both to maximizing the likelihood and to minimizing entropy
- Called Kullback-Leibler divergence or relative entropy

- Notice that \[ \int \left( f(x) - \hat{f}(x) \right)^2 dx = \int f^2(x) dx - 2 \int f(x) \hat{f}(x) dx + \int \hat{f}^2(x) dx \]
- First term doesn’t depend on the estimate, so we can ignore it for purposes of optimization
- Third term only involves \(\hat{f}(x)\), and is just an integral, which we can do numerically
- Second term involves both the true and the estimated density; we can approximate it using Monte Carlo by \[ \frac{2}{n} \sum_{i=1}^{n} \hat{f}(x_i) \]

- Then our error measure is \[ \frac{2}{n} \sum_{i=1}^{n} \hat{f}(x_i) + \int \hat{f}^2(x) dx \]
- In fact, this error measure does not depend on having one-dimension data
- For purposes of cross-validation, we can estimate \(\hat{f}(x)\) on the training set and then restrict the sum to points in the testing set

- If a random variable \(X\) has probability density \(f\), then \[ f(x)=\lim_{h\rightarrow 0} \frac{1}{2h}P(x-h < X < x+h) \]
- Thus, a naive estimator would be \[ \widehat f(x) = \frac{1}{2nh} \left[ \# \textrm{ of } x_i \textrm{ falling in } (x-h, x+h) \right] \]

- Or, equivalently \[ \widehat f(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}w\left( \frac{x-x_i}{h} \right) \] where \(w\) is a weight function defined as \[ w(x)=\begin{cases} 1/2 \quad |x| <1 \\ 0 \quad \text{ otherwise} \end{cases} \]
- In short, a naive estimate is constructed by placing a box of width \(2h\) and height \(\frac{1}{2nh}\) on each observation, then summing to obtain the estimate

my.w<-function(x){ if (abs(x) < 1) w <- 1/2 else w <- 0 w } x <- seq(0,6,0.2) m <- length(x) n <- length(faithful$eruptions) h <- .5 fhat <- rep(0,m) for (i in 1:m){ S <- 0 for (j in 1:n){ S <- S+(1/h)*my.w((faithful$eruptions[j]-x[i])/h) } fhat[i] <- (1/n)*S }

plot(x,fhat, type="s", xlab="Eruption length", ylab="Density Estimate", main="Naive Density Estimator")

- Not wholly satisfactory, from the point of view of using density estimates for presentation
- Estimate \(\widehat f\) is a step function
- In the formula for the naive estimate, we can replace the weight function \(w\) by another function \(K\) with more desirable properties
- Function \(K\) is called a
**kernel**

- Resulting estimate is a
**kernel estimator**: \[ \hat{f}(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}K\left( \frac{x-x_i}{h} \right). \] - \(h\) is the
**window width**,**smoothing parameter**, or**bandwidth** - Usually the kernel \(K\) is taken to be a probability density function itself (i.e., normal density)
- Resulting estimate will inherit all the smoothness properties of \(K\)

Most popular choices for the kernel \(K\) are

Family | Kernel |
---|---|

Gaussian | \(K(t)=\frac{1}{\sqrt{2\pi}}e^{-t^2/2}\) |

Rectangular | \(K(t) = 1/2\) for $ |

Triangular | $K(t) = 1- |

Epanechnikov | \(K(t)=\frac{3}{4}(1-(1/5) t^2)\) for $ |

my.w<-function(x, type="gaussian"){ if(type=="gaussian"){ w <- dnorm(x) return(w) } if(type=="naive"){ if (abs(x) < 1) w <- 1/2 else w <- 0 return(w) } print("You have asked for an undefined kernel.") return(NULL) }

x <- seq(0,6,0.02) m <- length(x) n <- length(faithful$eruptions) h <- .1 fhat <- rep(0,m) for (i in 1:m){ S <- 0 for (j in 1:n){ S <- S+(1/h)*my.w((faithful$eruptions[j]-x[i])/h) } fhat[i] <- (1/n)*S }

plot(x,fhat, type="l", xlab="Eruption length", ylab="Density Estimate", main="Naive Density Estimator")

- Cross-validation, which could be time consuming
- Optimal bandwidth for a Gaussian kernel to estimate a Gaussian distribution is \(1.06\sigma / n^{1/5}\)
- Called the
**Gaussian reference rule**or the**rule-of-thumb**bandwidth - When you call
`density`

in R, this is basically what it does

- There are times when one wants to draw a random sample from the estimated distribution
- Easy with kernel density estimates, because each kernel is itself a probability density
- Suppose the kernel is Gaussian, that we have scalar observations \(x_1, \dots, x_n\) and bandwidth \(h\)
- Pick an integer uniformly at random from \(1\) to \(n\)
- Use
`rnorm(1,x[i],h)`

, or`rnorm(q,sample(x,q,replace=TRUE),h)`

for \(q\) draws

- Using a different kernel, we’d just need to use the random number generator function for the corresponding distribution

- Histograms and kernels are not the only possible way of estimating densities
- Can try the local polynomial trick, series expansions, splines, penalized likelihood approaches, etc
- For some of these, avoid negative probability density estimates using the log density

`density()`

function is the most common

density(x, ...) ## Default S3 method: density(x, bw = "nrd0", adjust = 1, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), weights = NULL, window = kernel, width, give.Rkern = FALSE, n = 512, from, to, cut = 3, na.rm = FALSE, ...)

`ASH`

and`KernSmooth`

are both fast, accurate, and well-maintained (Deng and Wickham, 2011)

- To construct a bivariate density histogram, it is necessary to define two-dimensional bins and count the number of observations in each bin
- Can use
`bin2d`

function in R will bin a bivariate data set

bin2d <- function(x, breaks1 = "Sturges", breaks2 = "Sturges"){ histg1 <- hist(x[,1], breaks = breaks1, plot = FALSE) histg2 <- hist(x[,2], breaks = breaks2, plot = FALSE) brx <- histg1$breaks bry <- histg2$breaks freq <- table(cut(x[,1], brx), cut(x[,2], bry)) return(list(call = match.call(), freq = freq, breaks1 = brx, breaks2 = bry, mids1 = histg1$mids, mids2 = histg2$mids)) }

- Following example computes the bivariate frequency table
- After binning the data, the
`persp`

function plots the density histogram

data(iris) fit1=bin2d(iris[1:50, 1:2]) persp(x=fit1$mids1, y=fit1$mids2, z=fit1$freq, shade=T, theta=45, phi=30, ltheta=60)

- Suppose the data is \(X_1, \dots, X_n\), where each \(X_i\in \mathbb{R}^2\)
- Kernel density estimates can be extended to a multivariate (bivariate) setting
- Let \(K(\cdot)\) be a bivariate kernel (typically a bivariate density function), then bivariate kernel density estimate is \[ \hat f(X) = \frac{1}{nh^d}\sum_{i=1}^n K\left(\frac{X-X_i}{h}\right) \]

- Estimate the bivariate density when the data is generated from a mixture model with three components with identical covariance \(\Sigma=I_2\) and different means \[ \mu_1=(0,0)\quad\mu_2=(1,3),\quad\mu_3=(4,-1). \]
- Mixture probabilities are \(p=(0.2, 0.3, 0.5)\)

library(MASS) n <- 2000 p <- c(.2, .3, .5) mu <- matrix(c(0, 1, 4, 0, 3, -1), 3, 2) Sigma <- diag(2) i <- sample(1:3, replace = TRUE, prob = p, size = n) k <- table(i) x1 <- mvrnorm(k[1], mu = mu[1,], Sigma) x2 <- mvrnorm(k[2], mu = mu[2,], Sigma) x3 <- mvrnorm(k[3], mu = mu[3,], Sigma) X <- rbind(x1, x2, x3) #the mixture data x <- X[,1] y <- X[,2] fhat <- kde2d(x, y, h=c(1.87, 1.84))

par(mfrow = c(1, 2)) contour(fhat) persp(fhat, phi = 30, theta = 20, d = 5, xlab = "x")