- Functions are objects: can be arguments for or returned by other functions
- Example:
curve()
- Optimization via gradient descent, Newton's method, Nelder-Mead, …
- Curve-fitting by optimizing
curve()
apply()
, sapply()
, etc.: Take this function and use it on all of these objectsnlm()
: Take this function and try to make it small, starting from hereks.test()
: Compare these data to this cumulative distribution functioncurve()
: Evaluate this function over that range, and plot the resultsclass(sin)
## [1] "function"
class(sample)
## [1] "function"
resample <- function(x) { sample(x, size=length(x), replace=TRUE) } class(resample)
## [1] "function"
function
returns a function objectbody(foo)
formals(foo)
environment(foo)
R has separate types for built-in functions and for those written in R:
typeof(resample)
## [1] "closure"
typeof(sample)
## [1] "closure"
typeof(sin)
## [1] "builtin"
Why closure
for written-in-R functions? Because expressions are "closed" by referring to the parent environment
There's also a 2nd class of built-in functions called primitive
function()
returns an object of class function
sapply((-2):2,function(log.ratio){exp(log.ratio)/(1+exp(log.ratio))})
## [1] 0.1192029 0.2689414 0.5000000 0.7310586 0.8807971
apply
and sapply
From the package numDeriv
grad(func, x, ...)
func
is a function which returns a single floating-point valuex
is a vector of arguments to func
x
is a vector and func(x)
is also a vector, then it's assumed func
is vectorized and we get a vector of derivatives...
get passed along to func
require("numDeriv")
## Loading required package: numDeriv
just_a_phase <- runif(n=1,min=-pi,max=pi) all.equal(grad(func=cos,x=just_a_phase),-sin(just_a_phase))
## [1] TRUE
phases <- runif(n=10,min=-pi,max=pi) all.equal(grad(func=cos,x=phases),-sin(phases))
## [1] TRUE
grad(func=function(x){x[1]^2+x[2]^3}, x=c(1,-1))
## [1] 2 3
Note: grad
is perfectly happy with func
being an anonymous function!
Now we can use this as a piece of a larger machine:
gradient.descent <- function(f,x,max.iterations,step.scale, stopping.deriv,...) { for (iteration in 1:max.iterations) { gradient <- grad(f,x,...) if(all(abs(gradient) < stopping.deriv)) { break() } x <- x - step.scale*gradient } fit <- list(argmin=x,final.gradient=gradient,final.value=f(x,...), iterations=iteration) return(fit) }
Works equally well whether f
is mean squared error of a regression, \(\psi\) error of a regression, (negative log) likelihood, cost of a production plan, …
f
takes values for all names which aren't its arguments from the environment where it was defined, not the one where it is called (e.g., not from inside grad
or gradient.descent
)f
and g
are both complicated, avoid debugging g(f)
as a block; divide the work by writing very simple f.dummy
to debug/test g
, and debug/test the real f
separatelymake.linear.predictor <- function(x,y) { linear.fit <- lm(y~x) predictor <- function(x) { return(predict(object=linear.fit,newdata=data.frame(x=x))) } return(predictor) }
library(MASS); data(cats) vet_predictor <- make.linear.predictor(x=cats$Bwt,y=cats$Hwt) rm(cats) # Data set goes away vet_predictor(3.5) # My cat's body mass in kilograms
## 1 ## 13.76256
curve
looks like this:curve(expr, from = a, to = b, ...)
expr
is some expression involving a variable called x
which is swept from
the value a
to
the value b
...
are other plot-control argumentscurve
feeds the expression a vector x
and expects a numeric vector back (so this is fine)
curve(x^2 * sin(x))
We can use our own function in curve
:
psi <- function(x,c=1) {ifelse(abs(x)>c,2*c*abs(x)-c^2,x^2)} curve(psi(x,c=10),from=-20,to=20)
Or this
curve(psi(x=10,c=x),from=-20,to=20)
If our function doesn't take vectors to vectors, curve
becomes unhappy
mse <- function(y0,a,Y=gmp$pcgmp,N=gmp$pop) { mean((Y - y0*(N^a))^2) }
> curve(mse(a=x,y0=6611),from=0.10,to=0.15) Error in curve(mse(a = x, y0 = 6611), from = 0.1, to = 0.15) : 'expr' did not evaluate to an object of length 'n' In addition: Warning message: In N^a : longer object length is not a multiple of shorter object length
How do we solve this?
Define a new, vectorized function, say with sapply
:
sapply(seq(from=0.10,to=0.15,by=0.01),mse,y0=6611)
## [1] 154701953 102322974 68755654 64529166 104079527 207057513
mse(6611,0.10)
## [1] 154701953
mse.plottable <- function(a,...){ return(sapply(a,mse,...)) } mse.plottable(seq(from=0.10,to=0.15,by=0.01),y0=6611)
## [1] 154701953 102322974 68755654 64529166 104079527 207057513
curve(mse.plottable(a=x,y0=6611),from=0.10,to=0.20,xlab="a",ylab="MSE") curve(mse.plottable(a=x,y0=5100),add=TRUE,col="blue")
Alternate strategy: Vectorize()
returns a new, vectorized function
mse.vec <- Vectorize(mse, vectorize.args=c("y0","a")) mse.vec(a=seq(from=0.10,to=0.15,by=0.01),y0=6611)
## [1] 154701953 102322974 68755654 64529166 104079527 207057513
mse.vec(a=1/8,y0=c(5000,6000,7000))
## [1] 134617132 74693733 63732256
curve(mse.vec(a=x,y0=6611),from=0.10,to=0.20,xlab="a",ylab="MSE") curve(mse.vec(a=x,y0=5100),add=TRUE,col="blue")
Given an objective function \(f: \mathcal{D} \mapsto R\), find \[ \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\argmin}{argmin} \theta^* = \argmin_{\theta}{f(\theta)} \]
Basics: maximizing \(f\) is minimizing \(-f\): \[ \argmax_{\theta}{f(\theta)}= \argmin_{\theta}{-f(\theta)} \]
If \(h\) is strictly increasing (e.g., \(\log\)), then \[ \argmin_{\theta}{f(\theta)} = \argmin_{\theta}{h(f(\theta))} \]
Suppose \(x\) is one dimensional and \(f\) is smooth. If \(x^*\) is an interior minimum / maximum / extremum point \[ {\left. \frac{df}{dx} \right|}_{x=x^*} = 0 \]
If \(x^*\) a minimum, \[ {\left. \frac{d^2f}{dx^2}\right|}_{x=x^*} > 0 \]
This all carries over to multiple dimensions: At an interior extremum, \[ \nabla f(\theta^*) = 0 \]
At an interior minimum, \[ \nabla^2 f(\theta^*) \geq 0 \] meaning for any vector \(v\), \[ v^T \nabla^2 f(\theta^*) v \geq 0 \] \(\nabla^2 f =\) the Hessian, \(\mathbf{H}\) \(\theta\) might just be a local minimum
\[ f^{\prime}(x_0) = {\left. \frac{df}{dx}\right|}_{x=x_0} = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}} \] \[ f(x) \approx f(x_0) +(x-x_0)f^{\prime}(x_0) \] Locally, the function looks linear; to minimize a linear function, move down the slope
Multivariate version: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) \] \(\nabla f(\theta_0)\) points in the direction of fastest ascent at \(\theta_0\)
Pro:
Cons:
Big-\(O\) notation: \[ h(x) = O(g(x)) \] means \[ \lim_{x\rightarrow\infty}{\frac{h(x)}{g(x)}} = c \] for some \(c \neq 0\)
e.g., \(x^2 - 5000x + 123456778 = O(x^2)\)
e.g., \(e^{x}/(1+e^{x}) = O(1)\)
Useful to look at over-all scaling, hiding details Also done when the limit is \(x\rightarrow 0\)
Pro:
Con:
What if we do a quadratic approximation to \(f\)? \[ f(x) \approx f(x_0) + (x-x_0)f^{\prime}(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]
Special cases of general idea of Taylor approximation
Simplifies if \(x_0\) is a minimum since then \(f^{\prime}(x_0) = 0\): \[ f(x) \approx f(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]
Near a minimum, smooth functions look like parabolas
Carries over to the multivariate case: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) + \frac{1}{2}(\theta-\theta_0)^T \mathbf{H}(\theta_0) (\theta-\theta_0) \]
If we know \[ f(x) = ax^2 + bx + c \] we minimize exactly: \[ \begin{eqnarray*} 2ax^* + b & = & 0\\ x^* & = & \frac{-b}{2a} \end{eqnarray*} \] If \[ f(x) = \frac{1}{2}a (x-x_0)^2 + b(x-x_0) + c \] then \[ x^* = x_0 - a^{-1}b \]
Taylor-expand for the value at the minimum \(\theta^*\) \[ f(\theta^*) \approx f(\theta) + (\theta^*-\theta) \nabla f(\theta) + \frac{1}{2}(\theta^*-\theta)^T \mathbf{H}(\theta) (\theta^*-\theta) \] Take gradient, set to zero, solve for \(\theta^*\): \[ \begin{eqnarray*} 0 & = & \nabla f(\theta) + \mathbf{H}(\theta) (\theta^*-\theta) \\ \theta^* & = & \theta - {\left(\mathbf{H}(\theta)\right)}^{-1} \nabla f(\theta) \end{eqnarray*} \] Works exactly if \(f\) is quadratic and \(\mathbf{H}^{-1}\) exists, etc.
If \(f\) isn't quadratic, keep pretending it is until we get close to \(\theta^*\), when it will be nearly true
The Algorithm
Like gradient descent, but with inverse Hessian giving the step-size
"This is about how far you can go with that gradient"
Pros:
Cons:
Want to use the Hessian to improve convergence, but don't want to have to keep computing the Hessian at each step
Approaches:
Try to cage \(\theta^*\) with a simplex of \(p+1\) points
Order the trial points, \(f(\theta_1) \leq f(\theta_2) \ldots \leq f(\theta_{p+1})\)
\(\theta_{p+1}\) is the worst guess — try to improve it
Center of the not-worst = \(\theta_0 = \frac{1}{n}\sum_{i=1}^{n}{\theta_i}\)
Try to improve the worst guess \(\theta_{p+1}\)
The Moves:
Pros:
Con:
Gradient descent, Newton's method, simplex, etc., adjust all coordinates of \(\theta\) at once — gets harder as the number of dimensions \(p\) grows
Coordinate descent: never do more than 1D optimization
Cons:
Pros:
We have data \((x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)\), and possible curves, \(r(x;\theta)\) e.g.
Least-squares curve fitting: \[ \hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^n{(y_i - r(x_i;\theta))^2}} \]
"Robust" curve fitting: \[ \hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^{n}{\psi(y_i - r(x_i;\theta))}} \]
optim
and nls