Agenda

  • Functions are objects: can be arguments for or returned by other functions
  • Example: curve()
  • Optimization via gradient descent, Newton's method, Nelder-Mead, …
  • Curve-fitting by optimizing

Functions as Objects

  • In R, functions are objects, just like everything else
  • This means that they can be passed to functions as arguments and returned by functions as outputs as well
  • We often want to do very similar things to many different functions
  • The procedure is the same, only the function we're working with changes
  • Write one function to do the job, and pass the function as an argument
  • Because R treats a function like any other object, we can do this simply: invoke the function by its argument name in the body

Functions as Objects

  • We have already seen examples
  • apply(), sapply(), etc.: Take this function and use it on all of these objects
  • nlm(): Take this function and try to make it small, starting from here
  • ks.test(): Compare these data to this cumulative distribution function
  • curve(): Evaluate this function over that range, and plot the results

Syntax Facts About Functions

  • Typing a function's name, without parentheses, in the terminal gives you its source code
  • Functions are their own class in R
class(sin)
## [1] "function"
class(sample)
## [1] "function"
resample <- function(x) { sample(x, size=length(x), replace=TRUE) }
class(resample)
## [1] "function"

Syntax Facts About Functions

  • Functions can be put into lists or even arrays
  • A call to function returns a function object
  • body executed: access with body(foo)
  • arguments required: access with formals(foo)
  • parent environment: access with environment(foo)

Syntax Facts About Functions

R has separate types for built-in functions and for those written in R:

typeof(resample)
## [1] "closure"
typeof(sample)
## [1] "closure"
typeof(sin)
## [1] "builtin"

Why closure for written-in-R functions? Because expressions are "closed" by referring to the parent environment

There's also a 2nd class of built-in functions called primitive

Anonymous Functions

  • function() returns an object of class function
  • So far we've assigned that object to a name
  • If we don't have an assignment, we get an anonymous function
  • Usually part of some larger expression:
sapply((-2):2,function(log.ratio){exp(log.ratio)/(1+exp(log.ratio))})
## [1] 0.1192029 0.2689414 0.5000000 0.7310586 0.8807971

Anonymous Functions

  • Often handy when connecting other pieces of code
    • especially in things like apply and sapply
  • Won't cluttering the workspace
  • Can't be examined or re-used later

Example: grad()

  • Many problems in statistics come down to optimization
  • So do lots of problems in economics, physics, CS, biology, …
  • Lots of optimization problems require the gradient of the objective function
  • Gradient of \(f\) at \(x\): \[ \nabla f(x) = \left[\left.\frac{\partial f}{\partial x_1}\right|_{x} \ldots \left.\frac{\partial f}{\partial x_p}\right|_{x}\right] \]

Example: grad()

  • We do the same thing to get the gradient of \(f\) at \(x\) no matter what \(f\) is
  • Find the partial derivative of \(f\) with respect to each component of \(x\) and return the vector of partial derivatives
  • It makes no sense to re-write this every time we change \(f\)!
  • Write code to calculate the gradient of an arbitrary function
  • We could write our own, but there are lots of tricky issues
    • Best way to calculate partial derivative
    • What if \(x\) is at the edge of the domain of \(f\)?
  • Fortunately, someone has already done this

Example: grad()

From the package numDeriv

grad(func, x, ...)
  • Assumes func is a function which returns a single floating-point value
  • Assumes x is a vector of arguments to func
    • If x is a vector and func(x) is also a vector, then it's assumed func is vectorized and we get a vector of derivatives
  • Extra arguments in ... get passed along to func
  • Other functions in the package for the Jacobian of a vector-valued function, and the matrix of 2nd partials (Hessian)

Example: grad()

require("numDeriv")
## Loading required package: numDeriv
just_a_phase <- runif(n=1,min=-pi,max=pi)
all.equal(grad(func=cos,x=just_a_phase),-sin(just_a_phase))
## [1] TRUE
phases <- runif(n=10,min=-pi,max=pi)
all.equal(grad(func=cos,x=phases),-sin(phases))
## [1] TRUE
grad(func=function(x){x[1]^2+x[2]^3}, x=c(1,-1))
## [1] 2 3

Note: grad is perfectly happy with func being an anonymous function!

gradient.descent()

Now we can use this as a piece of a larger machine:

gradient.descent <- function(f,x,max.iterations,step.scale,
  stopping.deriv,...) {
  for (iteration in 1:max.iterations) {
    gradient <- grad(f,x,...)
    if(all(abs(gradient) < stopping.deriv)) { break() }
    x <- x - step.scale*gradient
  }
  fit <- list(argmin=x,final.gradient=gradient,final.value=f(x,...),
    iterations=iteration)
  return(fit)
}

Works equally well whether f is mean squared error of a regression, \(\psi\) error of a regression, (negative log) likelihood, cost of a production plan, …

Cautions

  • Scoping: f takes values for all names which aren't its arguments from the environment where it was defined, not the one where it is called (e.g., not from inside grad or gradient.descent)
  • Debugging: If f and g are both complicated, avoid debugging g(f) as a block; divide the work by writing very simple f.dummy to debug/test g, and debug/test the real f separately

Returning Functions

  • Functions can be return values like anything else
  • Create a linear predictor, based on sample values of two variables
make.linear.predictor <- function(x,y) {
  linear.fit <- lm(y~x)
  predictor <- function(x) {
   return(predict(object=linear.fit,newdata=data.frame(x=x)))
  }
  return(predictor)
}
  • The predictor function persists and works, even when the data we used to create it is gone

Returning Functions

library(MASS); data(cats)
vet_predictor <- make.linear.predictor(x=cats$Bwt,y=cats$Hwt)
rm(cats)            # Data set goes away
vet_predictor(3.5)  # My cat's body mass in kilograms
##        1 
## 13.76256

Example: curve()

  • A call to curve looks like this:
curve(expr, from = a, to = b, ...)
  • expr is some expression involving a variable called x which is swept from the value a to the value b
  • ... are other plot-control arguments
  • curve feeds the expression a vector x and expects a numeric vector back (so this is fine)

    curve(x^2 * sin(x))

Example: curve()

We can use our own function in curve:

psi <- function(x,c=1) {ifelse(abs(x)>c,2*c*abs(x)-c^2,x^2)}
curve(psi(x,c=10),from=-20,to=20)

Or this

curve(psi(x=10,c=x),from=-20,to=20)

Example: curve()

If our function doesn't take vectors to vectors, curve becomes unhappy

mse <- function(y0,a,Y=gmp$pcgmp,N=gmp$pop) {
   mean((Y - y0*(N^a))^2)
}
> curve(mse(a=x,y0=6611),from=0.10,to=0.15)
Error in curve(mse(a = x, y0 = 6611), from = 0.1, to = 0.15) : 
  'expr' did not evaluate to an object of length 'n'
In addition: Warning message:
In N^a : longer object length is not a multiple of shorter object length

How do we solve this?

Example: curve()

Define a new, vectorized function, say with sapply:

sapply(seq(from=0.10,to=0.15,by=0.01),mse,y0=6611)
## [1] 154701953 102322974  68755654  64529166 104079527 207057513
mse(6611,0.10)
## [1] 154701953
mse.plottable <- function(a,...){ return(sapply(a,mse,...)) }
mse.plottable(seq(from=0.10,to=0.15,by=0.01),y0=6611)
## [1] 154701953 102322974  68755654  64529166 104079527 207057513

Example: curve()

curve(mse.plottable(a=x,y0=6611),from=0.10,to=0.20,xlab="a",ylab="MSE")
curve(mse.plottable(a=x,y0=5100),add=TRUE,col="blue")

Example: curve()

Alternate strategy: Vectorize() returns a new, vectorized function

mse.vec <- Vectorize(mse, vectorize.args=c("y0","a"))
mse.vec(a=seq(from=0.10,to=0.15,by=0.01),y0=6611)
## [1] 154701953 102322974  68755654  64529166 104079527 207057513
mse.vec(a=1/8,y0=c(5000,6000,7000))
## [1] 134617132  74693733  63732256

Example: curve()

curve(mse.vec(a=x,y0=6611),from=0.10,to=0.20,xlab="a",ylab="MSE")
curve(mse.vec(a=x,y0=5100),add=TRUE,col="blue")

Optimization

Examples of Optimization Problems

Optimization Problems

Given an objective function \(f: \mathcal{D} \mapsto R\), find \[ \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\argmin}{argmin} \theta^* = \argmin_{\theta}{f(\theta)} \]

Basics: maximizing \(f\) is minimizing \(-f\): \[ \argmax_{\theta}{f(\theta)}= \argmin_{\theta}{-f(\theta)} \]

If \(h\) is strictly increasing (e.g., \(\log\)), then \[ \argmin_{\theta}{f(\theta)} = \argmin_{\theta}{h(f(\theta))} \]

Considerations

  • Approximation: How close can we get to \(\theta^*\), and/or \(f(\theta^*)\)?
  • Time complexity: How many computer steps does that take? Varies with precision of approximation, niceness of \(f\), size of \(\mathcal{D}\), size of data, method…
  • Most optimization algorithms use successive approximation, so distinguish number of iterations from cost of each iteration

Use calculus

Suppose \(x\) is one dimensional and \(f\) is smooth. If \(x^*\) is an interior minimum / maximum / extremum point \[ {\left. \frac{df}{dx} \right|}_{x=x^*} = 0 \]

If \(x^*\) a minimum, \[ {\left. \frac{d^2f}{dx^2}\right|}_{x=x^*} > 0 \]

Use calculus

This all carries over to multiple dimensions: At an interior extremum, \[ \nabla f(\theta^*) = 0 \]

At an interior minimum, \[ \nabla^2 f(\theta^*) \geq 0 \] meaning for any vector \(v\), \[ v^T \nabla^2 f(\theta^*) v \geq 0 \] \(\nabla^2 f =\) the Hessian, \(\mathbf{H}\) \(\theta\) might just be a local minimum

Gradient Descent

\[ f^{\prime}(x_0) = {\left. \frac{df}{dx}\right|}_{x=x_0} = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}} \] \[ f(x) \approx f(x_0) +(x-x_0)f^{\prime}(x_0) \] Locally, the function looks linear; to minimize a linear function, move down the slope

Multivariate version: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) \] \(\nabla f(\theta_0)\) points in the direction of fastest ascent at \(\theta_0\)

Gradient Descent

  1. Start with initial guess for \(\theta\), step-size \(\eta\)
  2. While ((not too tired) and (making adequate progress))
  • Find gradient \(\nabla f(\theta)\)
  • Set \(\theta \leftarrow \theta - \eta \nabla f(\theta)\)
  1. Return final \(\theta\) as approximate \(\theta^*\) Variations: adaptively adjust \(\eta\) to make sure of improvement or search along the gradient direction for minimum

Gradient Descent

Pro:

  • Moves in direction of greatest immediate improvement
  • If \(\eta\) is small enough, gets to a local minimum eventually, and then stops

Cons:

  • "small enough" \(\eta\) can be really, really small
  • Slowness or zig-zagging if components of \(\nabla f\) are very different sizes How much work do we need?

Scaling

Big-\(O\) notation: \[ h(x) = O(g(x)) \] means \[ \lim_{x\rightarrow\infty}{\frac{h(x)}{g(x)}} = c \] for some \(c \neq 0\)

e.g., \(x^2 - 5000x + 123456778 = O(x^2)\)

e.g., \(e^{x}/(1+e^{x}) = O(1)\)

Useful to look at over-all scaling, hiding details Also done when the limit is \(x\rightarrow 0\)

Gradient Descent

Pro:

  • For nice \(f\), \(f(\theta) \leq f(\theta^*)+\epsilon\) in \(O(\epsilon^{-2})\) iterations + For very nice \(f\), only \(O(\log{\epsilon^{-1}})\) iterations
  • To get \(\nabla f(\theta)\), take \(p\) derivatives, \(\therefore\) each iteration costs \(O(p)\)

Con:

  • Taking derivatives can slow down as data grows — each iteration might really be \(O(np)\)

Taylor Series

What if we do a quadratic approximation to \(f\)? \[ f(x) \approx f(x_0) + (x-x_0)f^{\prime}(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]

Special cases of general idea of Taylor approximation

Simplifies if \(x_0\) is a minimum since then \(f^{\prime}(x_0) = 0\): \[ f(x) \approx f(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]

Near a minimum, smooth functions look like parabolas

Carries over to the multivariate case: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) + \frac{1}{2}(\theta-\theta_0)^T \mathbf{H}(\theta_0) (\theta-\theta_0) \]

Minimizing a Quadratic

If we know \[ f(x) = ax^2 + bx + c \] we minimize exactly: \[ \begin{eqnarray*} 2ax^* + b & = & 0\\ x^* & = & \frac{-b}{2a} \end{eqnarray*} \] If \[ f(x) = \frac{1}{2}a (x-x_0)^2 + b(x-x_0) + c \] then \[ x^* = x_0 - a^{-1}b \]

Newton's Method

Taylor-expand for the value at the minimum \(\theta^*\) \[ f(\theta^*) \approx f(\theta) + (\theta^*-\theta) \nabla f(\theta) + \frac{1}{2}(\theta^*-\theta)^T \mathbf{H}(\theta) (\theta^*-\theta) \] Take gradient, set to zero, solve for \(\theta^*\): \[ \begin{eqnarray*} 0 & = & \nabla f(\theta) + \mathbf{H}(\theta) (\theta^*-\theta) \\ \theta^* & = & \theta - {\left(\mathbf{H}(\theta)\right)}^{-1} \nabla f(\theta) \end{eqnarray*} \] Works exactly if \(f\) is quadratic and \(\mathbf{H}^{-1}\) exists, etc.

If \(f\) isn't quadratic, keep pretending it is until we get close to \(\theta^*\), when it will be nearly true

Newton's Method

The Algorithm

  1. Start with guess for \(\theta\)
  2. While ((not too tired) and (making adequate progress))
  • Find gradient \(\nabla f(\theta)\) and Hessian \(\mathbf{H}(\theta)\)
  • Set \(\theta \leftarrow \theta - \mathbf{H}(\theta)^{-1} \nabla f(\theta)\)
  1. Return final \(\theta\) as approximation to \(\theta^*\)

Like gradient descent, but with inverse Hessian giving the step-size

"This is about how far you can go with that gradient"

Newton's Method

Pros:

  • Step-sizes chosen adaptively through 2nd derivatives, much harder to get zig-zagging, over-shooting, etc.
  • Also guaranteed to need \(O(\epsilon^{-2})\) steps to get within \(\epsilon\) of optimum
  • Only \(O(\log\log{\epsilon^{-1}})\) for very nice functions
  • Typically many fewer iterations than gradient descent

Newton's Method

Cons:

  • Hopeless if \(\mathbf{H}\) doesn't exist or isn't invertible
  • Need to take \(O(p^2)\) second derivatives plus \(p\) first derivatives
  • Need to solve \(\mathbf{H} \theta_{\mathrm{new}} = \mathbf{H} \theta_{\mathrm{old}} - \nabla f(\theta_{\mathrm{old}})\) for \(\theta_{\mathrm{new}}\)
  • Inverting \(\mathbf{H}\) is \(O(p^3)\), but cleverness gives \(O(p^2)\) for solving for \(\theta_{\mathrm{new}}\)

Getting Around the Hessian

Want to use the Hessian to improve convergence, but don't want to have to keep computing the Hessian at each step

Approaches:

  • Use knowledge of the system to get some approximation to the Hessian, use that instead of taking derivatives ("Fisher scoring")
  • Use only diagonal entries (\(p\) unmixed 2nd derivatives)
  • Use \(\mathbf{H}(\theta)\) at initial guess, hope \(\mathbf{H}\) changes very slowly with \(\theta\)
  • Re-compute \(\mathbf{H}(\theta)\) every \(k\) steps, \(k > 1\)
  • Fast, approximate updates to the Hessian at each step (BFGS)
  • Lots of other methods!
  • Nedler-Mead, a.k.a. "the simplex method", which doesn't need any derivatives

Nelder-Mead

Try to cage \(\theta^*\) with a simplex of \(p+1\) points

Order the trial points, \(f(\theta_1) \leq f(\theta_2) \ldots \leq f(\theta_{p+1})\)

\(\theta_{p+1}\) is the worst guess — try to improve it

Center of the not-worst = \(\theta_0 = \frac{1}{n}\sum_{i=1}^{n}{\theta_i}\)

Nelder-Mead

Try to improve the worst guess \(\theta_{p+1}\)

  1. Reflection: Try \(\theta_0 - (\theta_{p+1}-\theta_0)\), across the center from \(\theta_{p+1}\)
  • if it's better than \(\theta_p\) but not than \(\theta_1\), replace the old \(\theta_{p+1}\) with it
  • Expansion: if the reflected point is the new best, try \(\theta_0 - 2(\theta_{p+1}-\theta_0)\); replace the old \(\theta_{p+1}\) with the better of the reflected and the expanded point
  1. Contraction: If the reflected point is worse that \(\theta_p\), try \(\theta_0 + \frac{\theta_{p+1}-\theta_0}{2}\); if the contracted value is better, replace \(\theta_{p+1}\) with it
  2. Reduction: If all else fails, \(\theta_i \leftarrow \frac{\theta_1 + \theta_i}{2}\)
  3. Go back to (1) until we stop improving or run out of time

Making Sense of Nedler-Mead

The Moves:

  • Reflection: try the opposite of the worst point
  • Expansion: if that really helps, try it some more
  • Contraction: see if we overshot when trying the opposite
  • Reduction: if all else fails, try making each point more like the best point

Making Sense of Nedler-Mead

Pros:

  • Each iteration \(\leq 4\) values of \(f\), plus sorting (and sorting is at most \(O(p\log{p})\), usually much better)
  • No derivatives used, can even work for dis-continuous \(f\)

Con:

  • Can need many more iterations than gradient methods

Coordinate Descent

Gradient descent, Newton's method, simplex, etc., adjust all coordinates of \(\theta\) at once — gets harder as the number of dimensions \(p\) grows

Coordinate descent: never do more than 1D optimization

  • Start with initial guess \(\theta\)
  • While ((not too tired) and (making adequate progress))
  • For \(i \in (1:p)\)
    • do 1D optimization over \(i^{\mathrm{th}}\) coordinate of \(\theta\), holding the others fixed
    • Update \(i^{\mathrm{th}}\) coordinate to this optimal value
  • Return final value of \(\theta\)

Coordinate Descent

Cons:

  • Needs a good 1D optimizer
  • Can bog down for very tricky functions, especially with lots of interactions among variables

Pros:

  • Can be extremely fast and simple

Curve-Fitting by Optimizing

We have data \((x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)\), and possible curves, \(r(x;\theta)\) e.g.

  • \(r(x) = x \cdot \theta\)
  • \(r(x) = \theta_1 x^{\theta_2}\)
  • \(r(x) = \sum_{j=1}^{q}{\theta_j b_j(x)}\) for fixed "basis" functions \(b_j\)

Curve-Fitting by Optimizing

Least-squares curve fitting: \[ \hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^n{(y_i - r(x_i;\theta))^2}} \]

"Robust" curve fitting: \[ \hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^{n}{\psi(y_i - r(x_i;\theta))}} \]

Summary

  • In R, functions are objects, and can be arguments to other functions
  • Functions can also be returned by other functions
  • Optimizaiton
  • Trade-offs: complexity of iteration vs. number of iterations vs. precision of approximation
  • Gradient descent: less complex iterations, more guarantees, less adaptive
  • Newton: more complex iterations, but few of them for good functions, more adaptive, less robust
  • Next time pre-built code like optim and nls