Optimization I

Agenda

Functions are objects: can be arguments for or returned by other functions
Example: curve()
Optimization via gradient descent, Newton's method, Nelder-Mead, …
Curve-fitting by optimizing

Functions as Objects

In R, functions are objects, just like everything else
This means that they can be passed to functions as arguments and returned by functions as outputs as well
We often want to do very similar things to many different functions
The procedure is the same, only the function we're working with changes
Write one function to do the job, and pass the function as an argument
Because R treats a function like any other object, we can do this simply: invoke the function by its argument name in the body

Functions as Objects

We have already seen examples
apply(), sapply(), etc.: Take this function and use it on all of these objects
nlm(): Take this function and try to make it small, starting from here
ks.test(): Compare these data to this cumulative distribution function
curve(): Evaluate this function over that range, and plot the results

Syntax Facts About Functions

Typing a function's name, without parentheses, in the terminal gives you its source code
Functions are their own class in R

class(sin)

## [1] "function"

class(sample)

## [1] "function"

resample <- function(x) { sample(x, size=length(x), replace=TRUE) }
class(resample)

## [1] "function"

Syntax Facts About Functions

Functions can be put into lists or even arrays
A call to function returns a function object
body executed: access with body(foo)
arguments required: access with formals(foo)
parent environment: access with environment(foo)

Syntax Facts About Functions

R has separate types for built-in functions and for those written in R:

typeof(resample)

## [1] "closure"

typeof(sample)

## [1] "closure"

typeof(sin)

## [1] "builtin"

Why closure for written-in-R functions? Because expressions are "closed" by referring to the parent environment

There's also a 2nd class of built-in functions called primitive

Anonymous Functions

function() returns an object of class function
So far we've assigned that object to a name
If we don't have an assignment, we get an anonymous function
Usually part of some larger expression:

sapply((-2):2,function(log.ratio){exp(log.ratio)/(1+exp(log.ratio))})

## [1] 0.1192029 0.2689414 0.5000000 0.7310586 0.8807971

Anonymous Functions

Often handy when connecting other pieces of code
- especially in things like apply and sapply
Won't cluttering the workspace
Can't be examined or re-used later

Example: grad()

Many problems in statistics come down to optimization
So do lots of problems in economics, physics, CS, biology, …
Lots of optimization problems require the gradient of the objective function
Gradient of \(f\) at \(x\): \[ \nabla f(x) = \left[\left.\frac{\partial f}{\partial x_1}\right|_{x} \ldots \left.\frac{\partial f}{\partial x_p}\right|_{x}\right] \]

Example: grad()

We do the same thing to get the gradient of \(f\) at \(x\) no matter what \(f\) is
Find the partial derivative of \(f\) with respect to each component of \(x\) and return the vector of partial derivatives
It makes no sense to re-write this every time we change \(f\)!
Write code to calculate the gradient of an arbitrary function
We could write our own, but there are lots of tricky issues
- Best way to calculate partial derivative
- What if \(x\) is at the edge of the domain of \(f\)?
Fortunately, someone has already done this

Example: grad()

From the package numDeriv

grad(func, x, ...)

Assumes func is a function which returns a single floating-point value
Assumes x is a vector of arguments to func
- If x is a vector and func(x) is also a vector, then it's assumed func is vectorized and we get a vector of derivatives
Extra arguments in ... get passed along to func
Other functions in the package for the Jacobian of a vector-valued function, and the matrix of 2nd partials (Hessian)

Example: grad()

require("numDeriv")

## Loading required package: numDeriv

just_a_phase <- runif(n=1,min=-pi,max=pi)
all.equal(grad(func=cos,x=just_a_phase),-sin(just_a_phase))

## [1] TRUE

phases <- runif(n=10,min=-pi,max=pi)
all.equal(grad(func=cos,x=phases),-sin(phases))

## [1] TRUE

grad(func=function(x){x[1]^2+x[2]^3}, x=c(1,-1))

## [1] 2 3

Note: grad is perfectly happy with func being an anonymous function!

gradient.descent()

Now we can use this as a piece of a larger machine:

gradient.descent <- function(f,x,max.iterations,step.scale,
  stopping.deriv,...) {
  for (iteration in 1:max.iterations) {
    gradient <- grad(f,x,...)
    if(all(abs(gradient) < stopping.deriv)) { break() }
    x <- x - step.scale*gradient
  }
  fit <- list(argmin=x,final.gradient=gradient,final.value=f(x,...),
    iterations=iteration)
  return(fit)
}

Works equally well whether f is mean squared error of a regression, \(\psi\) error of a regression, (negative log) likelihood, cost of a production plan, …

Cautions

Scoping: f takes values for all names which aren't its arguments from the environment where it was defined, not the one where it is called (e.g., not from inside grad or gradient.descent)
Debugging: If f and g are both complicated, avoid debugging g(f) as a block; divide the work by writing very simple f.dummy to debug/test g, and debug/test the real f separately

Returning Functions

Functions can be return values like anything else
Create a linear predictor, based on sample values of two variables

make.linear.predictor <- function(x,y) {
  linear.fit <- lm(y~x)
  predictor <- function(x) {
   return(predict(object=linear.fit,newdata=data.frame(x=x)))
  }
  return(predictor)
}

The predictor function persists and works, even when the data we used to create it is gone

Returning Functions

library(MASS); data(cats)
vet_predictor <- make.linear.predictor(x=cats$Bwt,y=cats$Hwt)
rm(cats)            # Data set goes away
vet_predictor(3.5)  # My cat's body mass in kilograms

##        1 
## 13.76256

Example: curve()

A call to curve looks like this:

curve(expr, from = a, to = b, ...)

expr is some expression involving a variable called x which is swept from the value a to the value b
... are other plot-control arguments
curve feeds the expression a vector x and expects a numeric vector back (so this is fine)
```
curve(x^2 * sin(x))
```

Example: curve()

We can use our own function in curve:

psi <- function(x,c=1) {ifelse(abs(x)>c,2*c*abs(x)-c^2,x^2)}
curve(psi(x,c=10),from=-20,to=20)

Or this

curve(psi(x=10,c=x),from=-20,to=20)

Example: curve()

If our function doesn't take vectors to vectors, curve becomes unhappy

mse <- function(y0,a,Y=gmp$pcgmp,N=gmp$pop) {
   mean((Y - y0*(N^a))^2)
}

> curve(mse(a=x,y0=6611),from=0.10,to=0.15)
Error in curve(mse(a = x, y0 = 6611), from = 0.1, to = 0.15) : 
  'expr' did not evaluate to an object of length 'n'
In addition: Warning message:
In N^a : longer object length is not a multiple of shorter object length

How do we solve this?

Example: curve()

Define a new, vectorized function, say with sapply:

sapply(seq(from=0.10,to=0.15,by=0.01),mse,y0=6611)

## [1] 154701953 102322974  68755654  64529166 104079527 207057513

mse(6611,0.10)

## [1] 154701953

mse.plottable <- function(a,...){ return(sapply(a,mse,...)) }
mse.plottable(seq(from=0.10,to=0.15,by=0.01),y0=6611)

## [1] 154701953 102322974  68755654  64529166 104079527 207057513

Example: curve()

curve(mse.plottable(a=x,y0=6611),from=0.10,to=0.20,xlab="a",ylab="MSE")
curve(mse.plottable(a=x,y0=5100),add=TRUE,col="blue")

Example: curve()

Alternate strategy: Vectorize() returns a new, vectorized function

mse.vec <- Vectorize(mse, vectorize.args=c("y0","a"))
mse.vec(a=seq(from=0.10,to=0.15,by=0.01),y0=6611)

## [1] 154701953 102322974  68755654  64529166 104079527 207057513

mse.vec(a=1/8,y0=c(5000,6000,7000))

## [1] 134617132  74693733  63732256

Example: curve()

curve(mse.vec(a=x,y0=6611),from=0.10,to=0.20,xlab="a",ylab="MSE")
curve(mse.vec(a=x,y0=5100),add=TRUE,col="blue")

Optimization

Examples of Optimization Problems

Minimize mean-squared error of regression surface
Maximize likelihood of distribution
Maximize output of tanks from given supplies and factories
Maximize return of portfolio for given volatility
Minimize cost of airline flight schedule
Maximize reproductive fitness of an organism

http://www.benfrederickson.com/numerical-optimization/

https://www.youtube.com/watch?v=x2KbdoxrQ6o

Optimization Problems

Given an objective function \(f: \mathcal{D} \mapsto R\), find \[ \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\argmin}{argmin} \theta^* = \argmin_{\theta}{f(\theta)} \]

Basics: maximizing \(f\) is minimizing \(-f\): \[ \argmax_{\theta}{f(\theta)}= \argmin_{\theta}{-f(\theta)} \]

If \(h\) is strictly increasing (e.g., \(\log\)), then \[ \argmin_{\theta}{f(\theta)} = \argmin_{\theta}{h(f(\theta))} \]

Considerations

Approximation: How close can we get to \(\theta^*\), and/or \(f(\theta^*)\)?
Time complexity: How many computer steps does that take? Varies with precision of approximation, niceness of \(f\), size of \(\mathcal{D}\), size of data, method…
Most optimization algorithms use successive approximation, so distinguish number of iterations from cost of each iteration

Use calculus

Suppose \(x\) is one dimensional and \(f\) is smooth. If \(x^*\) is an interior minimum / maximum / extremum point \[ {\left. \frac{df}{dx} \right|}_{x=x^*} = 0 \]

If \(x^*\) a minimum, \[ {\left. \frac{d^2f}{dx^2}\right|}_{x=x^*} > 0 \]

Use calculus

This all carries over to multiple dimensions: At an interior extremum, \[ \nabla f(\theta^*) = 0 \]

At an interior minimum, \[ \nabla^2 f(\theta^*) \geq 0 \] meaning for any vector \(v\), \[ v^T \nabla^2 f(\theta^*) v \geq 0 \] \(\nabla^2 f =\) the Hessian, \(\mathbf{H}\) \(\theta\) might just be a local minimum

Gradient Descent

\[ f^{\prime}(x_0) = {\left. \frac{df}{dx}\right|}_{x=x_0} = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}} \] \[ f(x) \approx f(x_0) +(x-x_0)f^{\prime}(x_0) \] Locally, the function looks linear; to minimize a linear function, move down the slope

Multivariate version: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) \] \(\nabla f(\theta_0)\) points in the direction of fastest ascent at \(\theta_0\)

Gradient Descent

Start with initial guess for \(\theta\), step-size \(\eta\)
While ((not too tired) and (making adequate progress))

Find gradient \(\nabla f(\theta)\)
Set \(\theta \leftarrow \theta - \eta \nabla f(\theta)\)

Return final \(\theta\) as approximate \(\theta^*\) Variations: adaptively adjust \(\eta\) to make sure of improvement or search along the gradient direction for minimum

Gradient Descent

Pro:

Moves in direction of greatest immediate improvement
If \(\eta\) is small enough, gets to a local minimum eventually, and then stops

Cons:

"small enough" \(\eta\) can be really, really small
Slowness or zig-zagging if components of \(\nabla f\) are very different sizes How much work do we need?

Scaling

Big-\(O\) notation: \[ h(x) = O(g(x)) \] means \[ \lim_{x\rightarrow\infty}{\frac{h(x)}{g(x)}} = c \] for some \(c \neq 0\)

e.g., \(x^2 - 5000x + 123456778 = O(x^2)\)

e.g., \(e^{x}/(1+e^{x}) = O(1)\)

Useful to look at over-all scaling, hiding details Also done when the limit is \(x\rightarrow 0\)

Gradient Descent

Pro:

For nice \(f\), \(f(\theta) \leq f(\theta^*)+\epsilon\) in \(O(\epsilon^{-2})\) iterations + For very nice \(f\), only \(O(\log{\epsilon^{-1}})\) iterations
To get \(\nabla f(\theta)\), take \(p\) derivatives, \(\therefore\) each iteration costs \(O(p)\)

Con:

Taking derivatives can slow down as data grows — each iteration might really be \(O(np)\)

Taylor Series

What if we do a quadratic approximation to \(f\)? \[ f(x) \approx f(x_0) + (x-x_0)f^{\prime}(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]

Special cases of general idea of Taylor approximation

Simplifies if \(x_0\) is a minimum since then \(f^{\prime}(x_0) = 0\): \[ f(x) \approx f(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]

Near a minimum, smooth functions look like parabolas

Carries over to the multivariate case: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) + \frac{1}{2}(\theta-\theta_0)^T \mathbf{H}(\theta_0) (\theta-\theta_0) \]

Minimizing a Quadratic

If we know \[ f(x) = ax^2 + bx + c \] we minimize exactly: \[ \begin{eqnarray*} 2ax^* + b & = & 0\\ x^* & = & \frac{-b}{2a} \end{eqnarray*} \] If \[ f(x) = \frac{1}{2}a (x-x_0)^2 + b(x-x_0) + c \] then \[ x^* = x_0 - a^{-1}b \]

Newton's Method

Taylor-expand for the value at the minimum \(\theta^*\) \[ f(\theta^*) \approx f(\theta) + (\theta^*-\theta) \nabla f(\theta) + \frac{1}{2}(\theta^*-\theta)^T \mathbf{H}(\theta) (\theta^*-\theta) \] Take gradient, set to zero, solve for \(\theta^*\): \[ \begin{eqnarray*} 0 & = & \nabla f(\theta) + \mathbf{H}(\theta) (\theta^*-\theta) \\ \theta^* & = & \theta - {\left(\mathbf{H}(\theta)\right)}^{-1} \nabla f(\theta) \end{eqnarray*} \] Works exactly if \(f\) is quadratic and \(\mathbf{H}^{-1}\) exists, etc.

If \(f\) isn't quadratic, keep pretending it is until we get close to \(\theta^*\), when it will be nearly true

Newton's Method

The Algorithm

Start with guess for \(\theta\)
While ((not too tired) and (making adequate progress))

Find gradient \(\nabla f(\theta)\) and Hessian \(\mathbf{H}(\theta)\)
Set \(\theta \leftarrow \theta - \mathbf{H}(\theta)^{-1} \nabla f(\theta)\)

Return final \(\theta\) as approximation to \(\theta^*\)

Like gradient descent, but with inverse Hessian giving the step-size

"This is about how far you can go with that gradient"

Newton's Method

Pros:

Step-sizes chosen adaptively through 2nd derivatives, much harder to get zig-zagging, over-shooting, etc.
Also guaranteed to need \(O(\epsilon^{-2})\) steps to get within \(\epsilon\) of optimum
Only \(O(\log\log{\epsilon^{-1}})\) for very nice functions
Typically many fewer iterations than gradient descent

Newton's Method

Cons:

Hopeless if \(\mathbf{H}\) doesn't exist or isn't invertible
Need to take \(O(p^2)\) second derivatives plus \(p\) first derivatives
Need to solve \(\mathbf{H} \theta_{\mathrm{new}} = \mathbf{H} \theta_{\mathrm{old}} - \nabla f(\theta_{\mathrm{old}})\) for \(\theta_{\mathrm{new}}\)
Inverting \(\mathbf{H}\) is \(O(p^3)\), but cleverness gives \(O(p^2)\) for solving for \(\theta_{\mathrm{new}}\)

Getting Around the Hessian

Want to use the Hessian to improve convergence, but don't want to have to keep computing the Hessian at each step

Approaches:

Use knowledge of the system to get some approximation to the Hessian, use that instead of taking derivatives ("Fisher scoring")
Use only diagonal entries (\(p\) unmixed 2nd derivatives)
Use \(\mathbf{H}(\theta)\) at initial guess, hope \(\mathbf{H}\) changes very slowly with \(\theta\)
Re-compute \(\mathbf{H}(\theta)\) every \(k\) steps, \(k > 1\)
Fast, approximate updates to the Hessian at each step (BFGS)
Lots of other methods!
Nedler-Mead, a.k.a. "the simplex method", which doesn't need any derivatives

Nelder-Mead

Try to cage \(\theta^*\) with a simplex of \(p+1\) points

Order the trial points, \(f(\theta_1) \leq f(\theta_2) \ldots \leq f(\theta_{p+1})\)

\(\theta_{p+1}\) is the worst guess — try to improve it

Center of the not-worst = \(\theta_0 = \frac{1}{n}\sum_{i=1}^{n}{\theta_i}\)

Nelder-Mead

Try to improve the worst guess \(\theta_{p+1}\)

Reflection: Try \(\theta_0 - (\theta_{p+1}-\theta_0)\), across the center from \(\theta_{p+1}\)

if it's better than \(\theta_p\) but not than \(\theta_1\), replace the old \(\theta_{p+1}\) with it
Expansion: if the reflected point is the new best, try \(\theta_0 - 2(\theta_{p+1}-\theta_0)\); replace the old \(\theta_{p+1}\) with the better of the reflected and the expanded point

Contraction: If the reflected point is worse that \(\theta_p\), try \(\theta_0 + \frac{\theta_{p+1}-\theta_0}{2}\); if the contracted value is better, replace \(\theta_{p+1}\) with it
Reduction: If all else fails, \(\theta_i \leftarrow \frac{\theta_1 + \theta_i}{2}\)
Go back to (1) until we stop improving or run out of time

Making Sense of Nedler-Mead

The Moves:

Reflection: try the opposite of the worst point
Expansion: if that really helps, try it some more
Contraction: see if we overshot when trying the opposite
Reduction: if all else fails, try making each point more like the best point

Making Sense of Nedler-Mead

Pros:

Each iteration \(\leq 4\) values of \(f\), plus sorting (and sorting is at most \(O(p\log{p})\), usually much better)
No derivatives used, can even work for dis-continuous \(f\)

Con:

Can need many more iterations than gradient methods

Coordinate Descent

Gradient descent, Newton's method, simplex, etc., adjust all coordinates of \(\theta\) at once — gets harder as the number of dimensions \(p\) grows

Coordinate descent: never do more than 1D optimization

Start with initial guess \(\theta\)
While ((not too tired) and (making adequate progress))
For \(i \in (1:p)\)
- do 1D optimization over \(i^{\mathrm{th}}\) coordinate of \(\theta\), holding the others fixed
- Update \(i^{\mathrm{th}}\) coordinate to this optimal value
Return final value of \(\theta\)

Coordinate Descent

Cons:

Needs a good 1D optimizer
Can bog down for very tricky functions, especially with lots of interactions among variables

Pros:

Can be extremely fast and simple

Curve-Fitting by Optimizing

We have data \((x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)\), and possible curves, \(r(x;\theta)\) e.g.

\(r(x) = x \cdot \theta\)
\(r(x) = \theta_1 x^{\theta_2}\)
\(r(x) = \sum_{j=1}^{q}{\theta_j b_j(x)}\) for fixed "basis" functions \(b_j\)

Curve-Fitting by Optimizing

Least-squares curve fitting: \[ \hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^n{(y_i - r(x_i;\theta))^2}} \]

"Robust" curve fitting: \[ \hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^{n}{\psi(y_i - r(x_i;\theta))}} \]

Summary

In R, functions are objects, and can be arguments to other functions
Functions can also be returned by other functions
Optimizaiton
Trade-offs: complexity of iteration vs. number of iterations vs. precision of approximation
Gradient descent: less complex iterations, more guarantees, less adaptive
Newton: more complex iterations, but few of them for good functions, more adaptive, less robust
Next time pre-built code like optim and nls