- Functions are objects: can be arguments for or returned by other functions
- Example:
`curve()`

- Optimization via gradient descent, Newton's method, Nelder-Mead, …
- Curve-fitting by optimizing

- Functions are objects: can be arguments for or returned by other functions
- Example:
`curve()`

- Optimization via gradient descent, Newton's method, Nelder-Mead, …
- Curve-fitting by optimizing

- In R, functions are objects, just like everything else
- This means that they can be passed to functions as arguments and returned by functions as outputs as well
- We often want to do very similar things to many different functions
- The procedure is the same, only the function we're working with changes
- Write one function to do the job, and pass the function as an argument
- Because R treats a function like any other object, we can do this simply: invoke the function by its argument name in the body

- We have already seen examples
`apply()`

,`sapply()`

, etc.: Take*this*function and use it on all of*these*objects`nlm()`

: Take*this*function and try to make it small, starting from*here*`ks.test()`

: Compare*these*data to*this*cumulative distribution function`curve()`

: Evaluate*this*function over*that*range, and plot the results

- Typing a function's name, without parentheses, in the terminal gives you its source code
- Functions are their own
**class**in R

class(sin)

## [1] "function"

class(sample)

## [1] "function"

resample <- function(x) { sample(x, size=length(x), replace=TRUE) } class(resample)

## [1] "function"

- Functions can be put into lists or even arrays
- A call to
`function`

returns a function object - body executed: access with
`body(foo)`

- arguments required: access with
`formals(foo)`

- parent environment: access with
`environment(foo)`

R has separate **types** for built-in functions and for those written in R:

typeof(resample)

## [1] "closure"

typeof(sample)

## [1] "closure"

typeof(sin)

## [1] "builtin"

Why `closure`

for written-in-R functions? Because expressions are "closed" by referring to the parent environment

There's also a 2nd class of built-in functions called `primitive`

`function()`

returns an object of class`function`

- So far we've assigned that object to a name
- If we don't have an assignment, we get an
**anonymous function** - Usually part of some larger expression:

sapply((-2):2,function(log.ratio){exp(log.ratio)/(1+exp(log.ratio))})

## [1] 0.1192029 0.2689414 0.5000000 0.7310586 0.8807971

- Often handy when connecting other pieces of code
- especially in things like
`apply`

and`sapply`

- especially in things like
- Won't cluttering the workspace
- Can't be examined or re-used later

- Many problems in statistics come down to optimization
- So do lots of problems in economics, physics, CS, biology, …
- Lots of optimization problems require the gradient of the
**objective function** - Gradient of \(f\) at \(x\): \[ \nabla f(x) = \left[\left.\frac{\partial f}{\partial x_1}\right|_{x} \ldots \left.\frac{\partial f}{\partial x_p}\right|_{x}\right] \]

- We do the same thing to get the gradient of \(f\) at \(x\) no matter what \(f\) is
- Find the partial derivative of \(f\) with respect to each component of \(x\) and return the vector of partial derivatives
- It makes no sense to re-write this every time we change \(f\)!
- Write code to calculate the gradient of an arbitrary function
- We
*could*write our own, but there are lots of tricky issues- Best way to calculate partial derivative
- What if \(x\) is at the edge of the domain of \(f\)?

- Fortunately, someone has already done this

From the package `numDeriv`

grad(func, x, ...)

- Assumes
`func`

is a function which returns a single floating-point value - Assumes
`x`

is a vector of arguments to`func`

- If
`x`

is a vector and`func(x)`

is also a vector, then it's assumed`func`

is vectorized and we get a vector of derivatives

- If
- Extra arguments in
`...`

get passed along to`func`

- Other functions in the package for the Jacobian of a vector-valued function, and the matrix of 2nd partials (Hessian)

require("numDeriv")

## Loading required package: numDeriv

just_a_phase <- runif(n=1,min=-pi,max=pi) all.equal(grad(func=cos,x=just_a_phase),-sin(just_a_phase))

## [1] TRUE

phases <- runif(n=10,min=-pi,max=pi) all.equal(grad(func=cos,x=phases),-sin(phases))

## [1] TRUE

grad(func=function(x){x[1]^2+x[2]^3}, x=c(1,-1))

## [1] 2 3

Note: `grad`

is perfectly happy with `func`

being an anonymous function!

Now we can use this as a piece of a larger machine:

gradient.descent <- function(f,x,max.iterations,step.scale, stopping.deriv,...) { for (iteration in 1:max.iterations) { gradient <- grad(f,x,...) if(all(abs(gradient) < stopping.deriv)) { break() } x <- x - step.scale*gradient } fit <- list(argmin=x,final.gradient=gradient,final.value=f(x,...), iterations=iteration) return(fit) }

Works equally well whether `f`

is mean squared error of a regression, \(\psi\) error of a regression, (negative log) likelihood, cost of a production plan, …

*Scoping*:`f`

takes values for all names which aren't its arguments from the environment where it was defined, not the one where it is called (e.g., not from inside`grad`

or`gradient.descent`

)*Debugging*: If`f`

and`g`

are both complicated, avoid debugging`g(f)`

as a block; divide the work by writing*very simple*`f.dummy`

to debug/test`g`

, and debug/test the real`f`

separately

- Functions can be return values like anything else
- Create a linear predictor, based on sample values of two variables

make.linear.predictor <- function(x,y) { linear.fit <- lm(y~x) predictor <- function(x) { return(predict(object=linear.fit,newdata=data.frame(x=x))) } return(predictor) }

- The predictor function persists and works, even when the data we used to create it is gone

library(MASS); data(cats) vet_predictor <- make.linear.predictor(x=cats$Bwt,y=cats$Hwt) rm(cats) # Data set goes away vet_predictor(3.5) # My cat's body mass in kilograms

## 1 ## 13.76256

- A call to
`curve`

looks like this:

curve(expr, from = a, to = b, ...)

`expr`

is some expression involving a variable called`x`

which is swept`from`

the value`a`

`to`

the value`b`

`...`

are other plot-control arguments`curve`

feeds the expression a vector`x`

and expects a numeric vector back (so this is fine)curve(x^2 * sin(x))

We can use our own function in `curve`

:

psi <- function(x,c=1) {ifelse(abs(x)>c,2*c*abs(x)-c^2,x^2)} curve(psi(x,c=10),from=-20,to=20)

Or this

curve(psi(x=10,c=x),from=-20,to=20)

If our function doesn't take vectors to vectors, `curve`

becomes unhappy

mse <- function(y0,a,Y=gmp$pcgmp,N=gmp$pop) { mean((Y - y0*(N^a))^2) }

> curve(mse(a=x,y0=6611),from=0.10,to=0.15) Error in curve(mse(a = x, y0 = 6611), from = 0.1, to = 0.15) : 'expr' did not evaluate to an object of length 'n' In addition: Warning message: In N^a : longer object length is not a multiple of shorter object length

How do we solve this?

Define a new, vectorized function, say with `sapply`

:

sapply(seq(from=0.10,to=0.15,by=0.01),mse,y0=6611)

## [1] 154701953 102322974 68755654 64529166 104079527 207057513

mse(6611,0.10)

## [1] 154701953

mse.plottable <- function(a,...){ return(sapply(a,mse,...)) } mse.plottable(seq(from=0.10,to=0.15,by=0.01),y0=6611)

## [1] 154701953 102322974 68755654 64529166 104079527 207057513

curve(mse.plottable(a=x,y0=6611),from=0.10,to=0.20,xlab="a",ylab="MSE") curve(mse.plottable(a=x,y0=5100),add=TRUE,col="blue")

Alternate strategy: `Vectorize()`

returns a new, vectorized function

mse.vec <- Vectorize(mse, vectorize.args=c("y0","a")) mse.vec(a=seq(from=0.10,to=0.15,by=0.01),y0=6611)

## [1] 154701953 102322974 68755654 64529166 104079527 207057513

mse.vec(a=1/8,y0=c(5000,6000,7000))

## [1] 134617132 74693733 63732256

curve(mse.vec(a=x,y0=6611),from=0.10,to=0.20,xlab="a",ylab="MSE") curve(mse.vec(a=x,y0=5100),add=TRUE,col="blue")

- Minimize mean-squared error of regression surface
- Maximize likelihood of distribution
- Maximize output of tanks from given supplies and factories
- Maximize return of portfolio for given volatility
- Minimize cost of airline flight schedule
- Maximize reproductive fitness of an organism

Given an **objective function** \(f: \mathcal{D} \mapsto R\), find \[ \DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\argmin}{argmin}
\theta^* = \argmin_{\theta}{f(\theta)}
\]

Basics: maximizing \(f\) is minimizing \(-f\): \[ \argmax_{\theta}{f(\theta)}= \argmin_{\theta}{-f(\theta)} \]

If \(h\) is strictly increasing (e.g., \(\log\)), then \[ \argmin_{\theta}{f(\theta)} = \argmin_{\theta}{h(f(\theta))} \]

- Approximation: How close can we get to \(\theta^*\), and/or \(f(\theta^*)\)?
- Time complexity: How many computer steps does that take? Varies with precision of approximation, niceness of \(f\), size of \(\mathcal{D}\), size of data, method…
- Most optimization algorithms use
**successive approximation**, so distinguish number of iterations from cost of each iteration

Suppose \(x\) is one dimensional and \(f\) is smooth. If \(x^*\) is an **interior** minimum / maximum / extremum point \[ {\left. \frac{df}{dx} \right|}_{x=x^*} = 0 \]

If \(x^*\) a minimum, \[ {\left. \frac{d^2f}{dx^2}\right|}_{x=x^*} > 0 \]

This all carries over to multiple dimensions: At an **interior extremum**, \[ \nabla f(\theta^*) = 0 \]

At an **interior minimum**, \[ \nabla^2 f(\theta^*) \geq 0 \] meaning for any vector \(v\), \[ v^T \nabla^2 f(\theta^*) v \geq 0 \] \(\nabla^2 f =\) the **Hessian**, \(\mathbf{H}\) \(\theta\) might just be a **local** minimum

\[ f^{\prime}(x_0) = {\left. \frac{df}{dx}\right|}_{x=x_0} = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}} \] \[ f(x) \approx f(x_0) +(x-x_0)f^{\prime}(x_0) \] Locally, the function looks linear; to minimize a linear function, move down the slope

Multivariate version: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) \] \(\nabla f(\theta_0)\) points in the direction of fastest ascent at \(\theta_0\)

- Start with initial guess for \(\theta\), step-size \(\eta\)
- While ((not too tired) and (making adequate progress))

- Find gradient \(\nabla f(\theta)\)
- Set \(\theta \leftarrow \theta - \eta \nabla f(\theta)\)

- Return final \(\theta\) as approximate \(\theta^*\) Variations: adaptively adjust \(\eta\) to make sure of improvement or search along the gradient direction for minimum

Pro:

- Moves in direction of greatest immediate improvement
- If \(\eta\) is small enough, gets to a local minimum eventually, and then stops

Cons:

- "small enough" \(\eta\) can be really, really small
- Slowness or zig-zagging if components of \(\nabla f\) are very different sizes How much work do we need?

Big-\(O\) notation: \[ h(x) = O(g(x)) \] means \[ \lim_{x\rightarrow\infty}{\frac{h(x)}{g(x)}} = c \] for some \(c \neq 0\)

e.g., \(x^2 - 5000x + 123456778 = O(x^2)\)

e.g., \(e^{x}/(1+e^{x}) = O(1)\)

Useful to look at over-all scaling, hiding details Also done when the limit is \(x\rightarrow 0\)

Pro:

- For nice \(f\), \(f(\theta) \leq f(\theta^*)+\epsilon\) in \(O(\epsilon^{-2})\) iterations + For
*very*nice \(f\), only \(O(\log{\epsilon^{-1}})\) iterations - To get \(\nabla f(\theta)\), take \(p\) derivatives, \(\therefore\) each iteration costs \(O(p)\)

Con:

- Taking derivatives can slow down as data grows — each iteration might really be \(O(np)\)

What if we do a quadratic approximation to \(f\)? \[ f(x) \approx f(x_0) + (x-x_0)f^{\prime}(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]

Special cases of general idea of Taylor approximation

Simplifies if \(x_0\) is a minimum since then \(f^{\prime}(x_0) = 0\): \[ f(x) \approx f(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0) \]

Near a minimum, smooth functions look like parabolas

Carries over to the multivariate case: \[ f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) + \frac{1}{2}(\theta-\theta_0)^T \mathbf{H}(\theta_0) (\theta-\theta_0) \]

If we know \[ f(x) = ax^2 + bx + c \] we minimize exactly: \[ \begin{eqnarray*} 2ax^* + b & = & 0\\ x^* & = & \frac{-b}{2a} \end{eqnarray*} \] If \[ f(x) = \frac{1}{2}a (x-x_0)^2 + b(x-x_0) + c \] then \[ x^* = x_0 - a^{-1}b \]

Taylor-expand for the value *at the minimum* \(\theta^*\) \[ f(\theta^*) \approx f(\theta) + (\theta^*-\theta) \nabla f(\theta) + \frac{1}{2}(\theta^*-\theta)^T \mathbf{H}(\theta) (\theta^*-\theta) \] Take gradient, set to zero, solve for \(\theta^*\): \[ \begin{eqnarray*} 0 & = & \nabla f(\theta) + \mathbf{H}(\theta) (\theta^*-\theta) \\ \theta^* & = & \theta - {\left(\mathbf{H}(\theta)\right)}^{-1} \nabla f(\theta) \end{eqnarray*} \] Works *exactly* if \(f\) is quadratic and \(\mathbf{H}^{-1}\) exists, etc.

If \(f\) isn't quadratic, keep pretending it is until we get close to \(\theta^*\), when it will be nearly true

The Algorithm

- Start with guess for \(\theta\)
- While ((not too tired) and (making adequate progress))

- Find gradient \(\nabla f(\theta)\) and Hessian \(\mathbf{H}(\theta)\)
- Set \(\theta \leftarrow \theta - \mathbf{H}(\theta)^{-1} \nabla f(\theta)\)

- Return final \(\theta\) as approximation to \(\theta^*\)

Like gradient descent, but with inverse Hessian giving the step-size

"This is about how far you can go with that gradient"

Pros:

- Step-sizes chosen adaptively through 2nd derivatives, much harder to get zig-zagging, over-shooting, etc.
- Also guaranteed to need \(O(\epsilon^{-2})\) steps to get within \(\epsilon\) of optimum
- Only \(O(\log\log{\epsilon^{-1}})\) for very nice functions
- Typically many fewer iterations than gradient descent

Cons:

- Hopeless if \(\mathbf{H}\) doesn't exist or isn't invertible
- Need to take \(O(p^2)\) second derivatives
*plus*\(p\) first derivatives - Need to solve \(\mathbf{H} \theta_{\mathrm{new}} = \mathbf{H} \theta_{\mathrm{old}} - \nabla f(\theta_{\mathrm{old}})\) for \(\theta_{\mathrm{new}}\)
- Inverting \(\mathbf{H}\) is \(O(p^3)\), but cleverness gives \(O(p^2)\) for solving for \(\theta_{\mathrm{new}}\)

Want to use the Hessian to improve convergence, but don't want to have to keep computing the Hessian at each step

Approaches:

- Use knowledge of the system to get some approximation to the Hessian, use that instead of taking derivatives ("Fisher scoring")
- Use only diagonal entries (\(p\) unmixed 2nd derivatives)
- Use \(\mathbf{H}(\theta)\) at initial guess, hope \(\mathbf{H}\) changes
*very*slowly with \(\theta\) - Re-compute \(\mathbf{H}(\theta)\) every \(k\) steps, \(k > 1\)
- Fast, approximate updates to the Hessian at each step (BFGS)
- Lots of other methods!
- Nedler-Mead, a.k.a. "the simplex method", which doesn't need any derivatives

Try to cage \(\theta^*\) with a **simplex** of \(p+1\) points

Order the trial points, \(f(\theta_1) \leq f(\theta_2) \ldots \leq f(\theta_{p+1})\)

\(\theta_{p+1}\) is the worst guess — try to improve it

Center of the not-worst = \(\theta_0 = \frac{1}{n}\sum_{i=1}^{n}{\theta_i}\)

Try to improve the worst guess \(\theta_{p+1}\)

**Reflection**: Try \(\theta_0 - (\theta_{p+1}-\theta_0)\), across the center from \(\theta_{p+1}\)

- if it's better than \(\theta_p\) but not than \(\theta_1\), replace the old \(\theta_{p+1}\) with it
**Expansion**: if the reflected point is the new best, try \(\theta_0 - 2(\theta_{p+1}-\theta_0)\); replace the old \(\theta_{p+1}\) with the better of the reflected and the expanded point

**Contraction**: If the reflected point is worse that \(\theta_p\), try \(\theta_0 + \frac{\theta_{p+1}-\theta_0}{2}\); if the contracted value is better, replace \(\theta_{p+1}\) with it**Reduction**: If all else fails, \(\theta_i \leftarrow \frac{\theta_1 + \theta_i}{2}\)- Go back to (1) until we stop improving or run out of time

The Moves:

- Reflection: try the opposite of the worst point
- Expansion: if that really helps, try it some more
- Contraction: see if we overshot when trying the opposite
- Reduction: if all else fails, try making each point more like the best point

Pros:

- Each iteration \(\leq 4\) values of \(f\), plus sorting (and sorting is at most \(O(p\log{p})\), usually much better)
- No derivatives used, can even work for dis-continuous \(f\)

Con:

- Can need
*many*more iterations than gradient methods

Gradient descent, Newton's method, simplex, etc., adjust all coordinates of \(\theta\) at once — gets harder as the number of dimensions \(p\) grows

**Coordinate descent**: never do more than 1D optimization

- Start with initial guess \(\theta\)
- While ((not too tired) and (making adequate progress))
- For \(i \in (1:p)\)
- do 1D optimization over \(i^{\mathrm{th}}\) coordinate of \(\theta\), holding the others fixed
- Update \(i^{\mathrm{th}}\) coordinate to this optimal value

- Return final value of \(\theta\)

Cons:

- Needs a good 1D optimizer
- Can bog down for very tricky functions, especially with lots of interactions among variables

Pros:

- Can be extremely fast and simple

We have data \((x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)\), and possible curves, \(r(x;\theta)\) e.g.

- \(r(x) = x \cdot \theta\)
- \(r(x) = \theta_1 x^{\theta_2}\)
- \(r(x) = \sum_{j=1}^{q}{\theta_j b_j(x)}\) for fixed "basis" functions \(b_j\)

Least-squares curve fitting: \[ \hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^n{(y_i - r(x_i;\theta))^2}} \]

"Robust" curve fitting: \[ \hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^{n}{\psi(y_i - r(x_i;\theta))}} \]

- In R, functions are objects, and can be arguments to other functions
- Functions can also be returned by other functions
- Optimizaiton
- Trade-offs: complexity of iteration vs. number of iterations vs. precision of approximation
- Gradient descent: less complex iterations, more guarantees, less adaptive
- Newton: more complex iterations, but few of them for good functions, more adaptive, less robust
- Next time pre-built code like
`optim`

and`nls`