---
title: 'Optimization I'
author: "James M. Flegal"
output:
ioslides_presentation:
smaller: yes
---
## Agenda
- Functions are objects: can be arguments for or returned by other functions
- Example: `curve()`
- Optimization via gradient descent, Newton's method, Nelder-Mead, ...
- Curve-fitting by optimizing
## Functions as Objects
- In R, functions are objects, just like everything else
- This means that they can be passed to functions as arguments and returned by functions as outputs as well
- We often want to do very similar things to many different functions
- The procedure is the same, only the function we're working with changes
- Write one function to do the job, and pass the function as an argument
- Because R treats a function like any other object, we can do this simply: invoke
the function by its argument name in the body
## Functions as Objects
- We have already seen examples
- `apply()`, `sapply()`, etc.: Take _this_ function and use it on all of _these_ objects
- `nlm()`: Take _this_ function and try to make it small, starting from _here_
- `ks.test()`: Compare _these_ data to _this_ cumulative distribution function
- `curve()`: Evaluate _this_ function over _that_ range, and plot the results
## Syntax Facts About Functions
- Typing a function's name, without parentheses, in the terminal gives you its source code
- Functions are their own **class** in R
```{r}
class(sin)
class(sample)
resample <- function(x) { sample(x, size=length(x), replace=TRUE) }
class(resample)
```
## Syntax Facts About Functions
- Functions can be put into lists or even arrays
- A call to `function` returns a function object
+ body executed: access with `body(foo)`
+ arguments required: access with `formals(foo)`
+ parent environment: access with `environment(foo)`
## Syntax Facts About Functions
R has separate **types** for built-in functions and for those written in R:
```{r}
typeof(resample)
typeof(sample)
typeof(sin)
```
Why `closure` for written-in-R functions? Because expressions are "closed" by referring to the parent environment
There's also a 2nd class of built-in functions called `primitive`
## Anonymous Functions
- `function()` returns an object of class `function`
- So far we've assigned that object to a name
- If we don't have an assignment, we get an **anonymous function**
- Usually part of some larger expression:
```{r}
sapply((-2):2,function(log.ratio){exp(log.ratio)/(1+exp(log.ratio))})
```
## Anonymous Functions
- Often handy when connecting other pieces of code
+ especially in things like `apply` and `sapply`
- Won't cluttering the workspace
- Can't be examined or re-used later
## Example: grad()
- Many problems in statistics come down to optimization
- So do lots of problems in economics, physics, CS, biology, ...
- Lots of optimization problems require the gradient of the **objective function**
- Gradient of $f$ at $x$:
\[
\nabla f(x) = \left[\left.\frac{\partial f}{\partial x_1}\right|_{x} \ldots \left.\frac{\partial f}{\partial x_p}\right|_{x}\right]
\]
## Example: grad()
- We do the same thing to get the gradient of $f$ at $x$ no matter what $f$ is
- Find the partial derivative of $f$ with respect to each component of $x$ and return the vector of partial derivatives
- It makes no sense to re-write this every time we change $f$!
- Write code to calculate the gradient of an arbitrary function
- We _could_ write our own, but there are lots of tricky issues
+ Best way to calculate partial derivative
+ What if $x$ is at the edge of the domain of $f$?
- Fortunately, someone has already done this
## Example: grad()
From the package `numDeriv`
```{r,eval=FALSE}
grad(func, x, ...)
```
- Assumes `func` is a function which returns a single floating-point value
- Assumes `x` is a vector of arguments to `func`
+ If `x` is a vector and `func(x)` is also a vector, then it's assumed `func` is vectorized and we get a vector of derivatives
- Extra arguments in `...` get passed along to `func`
- Other functions in the package for the Jacobian of a vector-valued function, and the matrix of 2nd partials (Hessian)
## Example: grad()
```{r}
require("numDeriv")
just_a_phase <- runif(n=1,min=-pi,max=pi)
all.equal(grad(func=cos,x=just_a_phase),-sin(just_a_phase))
phases <- runif(n=10,min=-pi,max=pi)
all.equal(grad(func=cos,x=phases),-sin(phases))
grad(func=function(x){x[1]^2+x[2]^3}, x=c(1,-1))
```
Note: `grad` is perfectly happy with `func` being an anonymous function!
## gradient.descent()
Now we can use this as a piece of a larger machine:
```
gradient.descent <- function(f,x,max.iterations,step.scale,
stopping.deriv,...) {
for (iteration in 1:max.iterations) {
gradient <- grad(f,x,...)
if(all(abs(gradient) < stopping.deriv)) { break() }
x <- x - step.scale*gradient
}
fit <- list(argmin=x,final.gradient=gradient,final.value=f(x,...),
iterations=iteration)
return(fit)
}
```
Works equally well whether `f` is mean squared error of a regression, $\psi$ error of a regression, (negative log) likelihood, cost of a production plan, ...
## Cautions
- _Scoping_: `f` takes values for all names which aren't its arguments
from the environment where it was defined, not the one where it is called
(e.g., not from inside `grad` or `gradient.descent`)
- _Debugging_: If `f` and `g` are both complicated, avoid
debugging `g(f)` as a block; divide the work by writing _very
simple_ `f.dummy` to debug/test `g`, and debug/test the real
`f` separately
## Returning Functions
- Functions can be return values like anything else
- Create a linear predictor, based on sample values of two variables
```{r}
make.linear.predictor <- function(x,y) {
linear.fit <- lm(y~x)
predictor <- function(x) {
return(predict(object=linear.fit,newdata=data.frame(x=x)))
}
return(predictor)
}
```
- The predictor function persists and works, even when the data we used to create it is gone
## Returning Functions
```{r}
library(MASS); data(cats)
vet_predictor <- make.linear.predictor(x=cats$Bwt,y=cats$Hwt)
rm(cats) # Data set goes away
vet_predictor(3.5) # My cat's body mass in kilograms
```
## Example: curve()
- A call to `curve` looks like this:
```{r,eval=FALSE}
curve(expr, from = a, to = b, ...)
```
- `expr` is some expression involving a variable called `x` which is swept `from` the value `a` `to` the value `b`
- `...` are other plot-control arguments
- `curve` feeds the expression a vector `x` and expects a numeric vector back (so this is fine)
```
curve(x^2 * sin(x))
```
## Example: curve()
We can use our own function in `curve`:
```{r,eval=FALSE}
psi <- function(x,c=1) {ifelse(abs(x)>c,2*c*abs(x)-c^2,x^2)}
curve(psi(x,c=10),from=-20,to=20)
```
Or this
```{r,eval=FALSE}
curve(psi(x=10,c=x),from=-20,to=20)
```
## Example: curve()
If our function doesn't take vectors to vectors, `curve` becomes unhappy
```{r,echo=FALSE}
gmp <- read.table("gmp.dat")
gmp$pop <- round(gmp$gmp/gmp$pcgmp)
```
```{r}
mse <- function(y0,a,Y=gmp$pcgmp,N=gmp$pop) {
mean((Y - y0*(N^a))^2)
}
```
```{r,eval=FALSE}
> curve(mse(a=x,y0=6611),from=0.10,to=0.15)
Error in curve(mse(a = x, y0 = 6611), from = 0.1, to = 0.15) :
'expr' did not evaluate to an object of length 'n'
In addition: Warning message:
In N^a : longer object length is not a multiple of shorter object length
```
How do we solve this?
## Example: curve()
Define a new, vectorized function, say with `sapply`:
```{r}
sapply(seq(from=0.10,to=0.15,by=0.01),mse,y0=6611)
mse(6611,0.10)
mse.plottable <- function(a,...){ return(sapply(a,mse,...)) }
mse.plottable(seq(from=0.10,to=0.15,by=0.01),y0=6611)
```
## Example: curve()
```{r}
curve(mse.plottable(a=x,y0=6611),from=0.10,to=0.20,xlab="a",ylab="MSE")
curve(mse.plottable(a=x,y0=5100),add=TRUE,col="blue")
```
## Example: curve()
Alternate strategy: `Vectorize()` returns a new, vectorized function
```{r}
mse.vec <- Vectorize(mse, vectorize.args=c("y0","a"))
mse.vec(a=seq(from=0.10,to=0.15,by=0.01),y0=6611)
mse.vec(a=1/8,y0=c(5000,6000,7000))
```
## Example: curve()
```{r}
curve(mse.vec(a=x,y0=6611),from=0.10,to=0.20,xlab="a",ylab="MSE")
curve(mse.vec(a=x,y0=5100),add=TRUE,col="blue")
```
# Optimization
## Examples of Optimization Problems
- Minimize mean-squared error of regression surface
- Maximize likelihood of distribution
- Maximize output of tanks from given supplies and factories
- Maximize return of portfolio for given volatility
- Minimize cost of airline flight schedule
- Maximize reproductive fitness of an organism
http://www.benfrederickson.com/numerical-optimization/
https://www.youtube.com/watch?v=x2KbdoxrQ6o
## Optimization Problems
Given an **objective function** $f: \mathcal{D} \mapsto R$, find
\[ \DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\argmin}{argmin}
\theta^* = \argmin_{\theta}{f(\theta)}
\]
Basics: maximizing $f$ is minimizing $-f$:
\[ \argmax_{\theta}{f(\theta)}= \argmin_{\theta}{-f(\theta)} \]
If $h$ is strictly increasing (e.g., $\log$), then
\[ \argmin_{\theta}{f(\theta)} = \argmin_{\theta}{h(f(\theta))} \]
## Considerations
- Approximation: How close can we get to $\theta^*$, and/or $f(\theta^*)$?
- Time complexity: How many computer steps does that take? Varies with precision of approximation, niceness of $f$, size of $\mathcal{D}$, size of data, method...
- Most optimization algorithms use **successive approximation**, so distinguish number of iterations from cost of each iteration
## Use calculus
Suppose $x$ is one dimensional and $f$ is smooth. If $x^*$ is an **interior** minimum / maximum / extremum point
\[ {\left. \frac{df}{dx} \right|}_{x=x^*} = 0 \]
If $x^*$ a minimum,
\[ {\left. \frac{d^2f}{dx^2}\right|}_{x=x^*} > 0 \]
## Use calculus
This all carries over to multiple dimensions: At an **interior extremum**,
\[ \nabla f(\theta^*) = 0 \]
At an **interior minimum**, \[ \nabla^2 f(\theta^*) \geq 0 \]
meaning for any vector $v$,
\[ v^T \nabla^2 f(\theta^*) v \geq 0 \]
$\nabla^2 f =$ the **Hessian**, $\mathbf{H}$ $\theta$ might just be a **local** minimum
## Gradient Descent
\[ f^{\prime}(x_0) = {\left. \frac{df}{dx}\right|}_{x=x_0} = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}} \]
\[ f(x) \approx f(x_0) +(x-x_0)f^{\prime}(x_0) \]
Locally, the function looks linear; to minimize a linear function, move down the slope
Multivariate version:
\[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) \]
$\nabla f(\theta_0)$ points in the direction of fastest ascent at $\theta_0$
## Gradient Descent
1. Start with initial guess for $\theta$, step-size $\eta$
2. While ((not too tired) and (making adequate progress))
- Find gradient $\nabla f(\theta)$
- Set $\theta \leftarrow \theta - \eta \nabla f(\theta)$
3. Return final $\theta$ as approximate $\theta^*$ Variations: adaptively adjust $\eta$ to make sure of improvement or search along the gradient direction for minimum
## Gradient Descent
Pro:
- Moves in direction of greatest immediate improvement
- If $\eta$ is small enough, gets to a local minimum eventually, and then stops
Cons:
- "small enough" $\eta$ can be really, really small
- Slowness or zig-zagging if components of $\nabla f$ are very different sizes How much work do we need?
## Scaling
Big-$O$ notation:
\[ h(x) = O(g(x)) \]
means
\[ \lim_{x\rightarrow\infty}{\frac{h(x)}{g(x)}} = c \]
for some $c \neq 0$
e.g., $x^2 - 5000x + 123456778 = O(x^2)$
e.g., $e^{x}/(1+e^{x}) = O(1)$
Useful to look at over-all scaling, hiding details Also done when the limit is $x\rightarrow 0$
## Gradient Descent
Pro:
- For nice $f$, $f(\theta) \leq f(\theta^*)+\epsilon$ in $O(\epsilon^{-2})$ iterations + For _very_ nice $f$, only $O(\log{\epsilon^{-1}})$ iterations
- To get $\nabla f(\theta)$, take $p$ derivatives, $\therefore$ each iteration costs $O(p)$
Con:
- Taking derivatives can slow down as data grows --- each iteration might really be $O(np)$
## Taylor Series
What if we do a quadratic approximation to $f$?
\[ f(x) \approx f(x_0) + (x-x_0)f^{\prime}(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]
Special cases of general idea of Taylor approximation
Simplifies if $x_0$ is a minimum since then $f^{\prime}(x_0) = 0$:
\[ f(x) \approx f(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]
Near a minimum, smooth functions look like parabolas
Carries over to the multivariate case:
\[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) + \frac{1}{2}(\theta-\theta_0)^T \mathbf{H}(\theta_0) (\theta-\theta_0) \]
## Minimizing a Quadratic
If we know
\[ f(x) = ax^2 + bx + c \]
we minimize exactly:
\[ \begin{eqnarray*} 2ax^* + b & = & 0\\ x^* & = & \frac{-b}{2a} \end{eqnarray*} \]
If
\[ f(x) = \frac{1}{2}a (x-x_0)^2 + b(x-x_0) + c \]
then
\[ x^* = x_0 - a^{-1}b \]
## Newton's Method
Taylor-expand for the value *at the minimum* $\theta^*$
\[ f(\theta^*) \approx f(\theta) + (\theta^*-\theta) \nabla f(\theta) + \frac{1}{2}(\theta^*-\theta)^T \mathbf{H}(\theta) (\theta^*-\theta) \]
Take gradient, set to zero, solve for $\theta^*$:
\[ \begin{eqnarray*} 0 & = & \nabla f(\theta) + \mathbf{H}(\theta) (\theta^*-\theta) \\ \theta^* & = & \theta - {\left(\mathbf{H}(\theta)\right)}^{-1} \nabla f(\theta) \end{eqnarray*} \]
Works *exactly* if $f$ is quadratic and $\mathbf{H}^{-1}$ exists, etc.
If $f$ isn't quadratic, keep pretending it is until we get close to $\theta^*$, when it will be nearly true
## Newton's Method
The Algorithm
1. Start with guess for $\theta$
2. While ((not too tired) and (making adequate progress))
- Find gradient $\nabla f(\theta)$ and Hessian $\mathbf{H}(\theta)$
- Set $\theta \leftarrow \theta - \mathbf{H}(\theta)^{-1} \nabla f(\theta)$
3. Return final $\theta$ as approximation to $\theta^*$
Like gradient descent, but with inverse Hessian giving the step-size
"This is about how far you can go with that gradient"
## Newton's Method
Pros:
- Step-sizes chosen adaptively through 2nd derivatives, much harder to get zig-zagging, over-shooting, etc.
- Also guaranteed to need $O(\epsilon^{-2})$ steps to get within $\epsilon$ of optimum
- Only $O(\log\log{\epsilon^{-1}})$ for very nice functions
- Typically many fewer iterations than gradient descent
## Newton's Method
Cons:
- Hopeless if $\mathbf{H}$ doesn't exist or isn't invertible
- Need to take $O(p^2)$ second derivatives *plus* $p$ first derivatives
- Need to solve $\mathbf{H} \theta_{\mathrm{new}} = \mathbf{H} \theta_{\mathrm{old}} - \nabla f(\theta_{\mathrm{old}})$ for $\theta_{\mathrm{new}}$
- Inverting $\mathbf{H}$ is $O(p^3)$, but cleverness gives $O(p^2)$ for solving for $\theta_{\mathrm{new}}$
## Getting Around the Hessian
Want to use the Hessian to improve convergence, but don't want to have to keep computing the Hessian at each step
Approaches:
- Use knowledge of the system to get some approximation to the Hessian, use that instead of taking derivatives ("Fisher scoring")
- Use only diagonal entries ($p$ unmixed 2nd derivatives)
- Use $\mathbf{H}(\theta)$ at initial guess, hope $\mathbf{H}$ changes *very* slowly with $\theta$
- Re-compute $\mathbf{H}(\theta)$ every $k$ steps, $k > 1$
- Fast, approximate updates to the Hessian at each step ([BFGS](https://en.wikipedia.org/wiki/Broyden–Fletcher–Goldfarb–Shanno_algorithm))
- Lots of other methods!
- Nedler-Mead, a.k.a. "the simplex method", which doesn't need any derivatives
## Nelder-Mead
Try to cage $\theta^*$ with a **simplex** of $p+1$ points
Order the trial points, $f(\theta_1) \leq f(\theta_2) \ldots \leq f(\theta_{p+1})$
$\theta_{p+1}$ is the worst guess --- try to improve it
Center of the not-worst = $\theta_0 = \frac{1}{n}\sum_{i=1}^{n}{\theta_i}$
## Nelder-Mead
Try to improve the worst guess $\theta_{p+1}$
1. **Reflection**: Try $\theta_0 - (\theta_{p+1}-\theta_0)$, across the center from $\theta_{p+1}$
+ if it's better than $\theta_p$ but not than $\theta_1$, replace the old $\theta_{p+1}$ with it
+ **Expansion**: if the reflected point is the new best, try $\theta_0 - 2(\theta_{p+1}-\theta_0)$; replace the old $\theta_{p+1}$ with the better of the reflected and the expanded point
2. **Contraction**: If the reflected point is worse that $\theta_p$, try $\theta_0 + \frac{\theta_{p+1}-\theta_0}{2}$; if the contracted value is better, replace $\theta_{p+1}$ with it
3. **Reduction**: If all else fails, $\theta_i \leftarrow \frac{\theta_1 + \theta_i}{2}$
4. Go back to (1) until we stop improving or run out of time
## Making Sense of Nedler-Mead
The Moves:
- Reflection: try the opposite of the worst point
- Expansion: if that really helps, try it some more
- Contraction: see if we overshot when trying the opposite
- Reduction: if all else fails, try making each point more like the best point
## Making Sense of Nedler-Mead
Pros:
- Each iteration $\leq 4$ values of $f$, plus sorting (and sorting is at most $O(p\log{p})$, usually much better)
- No derivatives used, can even work for dis-continuous $f$
Con:
- Can need _many_ more iterations than gradient methods
## Coordinate Descent
Gradient descent, Newton's method, simplex, etc., adjust all coordinates of $\theta$ at once --- gets harder as the number of dimensions $p$ grows
**Coordinate descent**: never do more than 1D optimization
- Start with initial guess $\theta$
- While ((not too tired) and (making adequate progress))
+ For $i \in (1:p)$
* do 1D optimization over $i^{\mathrm{th}}$ coordinate of $\theta$, holding the others fixed
* Update $i^{\mathrm{th}}$ coordinate to this optimal value
- Return final value of $\theta$
## Coordinate Descent
Cons:
- Needs a good 1D optimizer
- Can bog down for very tricky functions, especially with lots of interactions among variables
Pros:
- Can be extremely fast and simple
## Curve-Fitting by Optimizing
We have data $(x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)$, and possible curves, $r(x;\theta)$ e.g.
- $r(x) = x \cdot \theta$
- $r(x) = \theta_1 x^{\theta_2}$
- $r(x) = \sum_{j=1}^{q}{\theta_j b_j(x)}$ for fixed "basis" functions $b_j$
## Curve-Fitting by Optimizing
Least-squares curve fitting:
\[ \hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^n{(y_i - r(x_i;\theta))^2}} \]
"Robust" curve fitting:
\[ \hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^{n}{\psi(y_i - r(x_i;\theta))}} \]
## Summary
- In R, functions are objects, and can be arguments to other functions
- Functions can also be returned by other functions
- Optimizaiton
+ Trade-offs: complexity of iteration vs. number of iterations vs. precision of approximation
+ Gradient descent: less complex iterations, more guarantees, less adaptive
+ Newton: more complex iterations, but few of them for good functions, more adaptive, less robust
+ Next time pre-built code like `optim` and `nls`