## Agenda

• Functions are objects: can be arguments for or returned by other functions
• Example: curve()
• Curve-fitting by optimizing

## Functions as Objects

• In R, functions are objects, just like everything else
• This means that they can be passed to functions as arguments and returned by functions as outputs as well
• We often want to do very similar things to many different functions
• The procedure is the same, only the function we're working with changes
• Write one function to do the job, and pass the function as an argument
• Because R treats a function like any other object, we can do this simply: invoke the function by its argument name in the body

## Functions as Objects

• We have already seen examples
• apply(), sapply(), etc.: Take this function and use it on all of these objects
• nlm(): Take this function and try to make it small, starting from here
• ks.test(): Compare these data to this cumulative distribution function
• curve(): Evaluate this function over that range, and plot the results

• Typing a function's name, without parentheses, in the terminal gives you its source code
• Functions are their own class in R
class(sin)
## [1] "function"
class(sample)
## [1] "function"
resample <- function(x) { sample(x, size=length(x), replace=TRUE) }
class(resample)
## [1] "function"

• Functions can be put into lists or even arrays
• A call to function returns a function object
• body executed: access with body(foo)
• arguments required: access with formals(foo)
• parent environment: access with environment(foo)

R has separate types for built-in functions and for those written in R:

typeof(resample)
## [1] "closure"
typeof(sample)
## [1] "closure"
typeof(sin)
## [1] "builtin"

Why closure for written-in-R functions? Because expressions are "closed" by referring to the parent environment

There's also a 2nd class of built-in functions called primitive

## Anonymous Functions

• function() returns an object of class function
• So far we've assigned that object to a name
• If we don't have an assignment, we get an anonymous function
• Usually part of some larger expression:
sapply((-2):2,function(log.ratio){exp(log.ratio)/(1+exp(log.ratio))})
## [1] 0.1192029 0.2689414 0.5000000 0.7310586 0.8807971

## Anonymous Functions

• Often handy when connecting other pieces of code
• especially in things like apply and sapply
• Won't cluttering the workspace
• Can't be examined or re-used later

• Many problems in statistics come down to optimization
• So do lots of problems in economics, physics, CS, biology, …
• Lots of optimization problems require the gradient of the objective function
• Gradient of $$f$$ at $$x$$: $\nabla f(x) = \left[\left.\frac{\partial f}{\partial x_1}\right|_{x} \ldots \left.\frac{\partial f}{\partial x_p}\right|_{x}\right]$

• We do the same thing to get the gradient of $$f$$ at $$x$$ no matter what $$f$$ is
• Find the partial derivative of $$f$$ with respect to each component of $$x$$ and return the vector of partial derivatives
• It makes no sense to re-write this every time we change $$f$$!
• Write code to calculate the gradient of an arbitrary function
• We could write our own, but there are lots of tricky issues
• Best way to calculate partial derivative
• What if $$x$$ is at the edge of the domain of $$f$$?
• Fortunately, someone has already done this

From the package numDeriv

grad(func, x, ...)
• Assumes func is a function which returns a single floating-point value
• Assumes x is a vector of arguments to func
• If x is a vector and func(x) is also a vector, then it's assumed func is vectorized and we get a vector of derivatives
• Extra arguments in ... get passed along to func
• Other functions in the package for the Jacobian of a vector-valued function, and the matrix of 2nd partials (Hessian)

require("numDeriv")
## Loading required package: numDeriv
just_a_phase <- runif(n=1,min=-pi,max=pi)
all.equal(grad(func=cos,x=just_a_phase),-sin(just_a_phase))
## [1] TRUE
phases <- runif(n=10,min=-pi,max=pi)
all.equal(grad(func=cos,x=phases),-sin(phases))
## [1] TRUE
grad(func=function(x){x[1]^2+x[2]^3}, x=c(1,-1))
## [1] 2 3

Note: grad is perfectly happy with func being an anonymous function!

Now we can use this as a piece of a larger machine:

gradient.descent <- function(f,x,max.iterations,step.scale,
stopping.deriv,...) {
for (iteration in 1:max.iterations) {
if(all(abs(gradient) < stopping.deriv)) { break() }
}
iterations=iteration)
return(fit)
}

Works equally well whether f is mean squared error of a regression, $$\psi$$ error of a regression, (negative log) likelihood, cost of a production plan, …

## Cautions

• Scoping: f takes values for all names which aren't its arguments from the environment where it was defined, not the one where it is called (e.g., not from inside grad or gradient.descent)
• Debugging: If f and g are both complicated, avoid debugging g(f) as a block; divide the work by writing very simple f.dummy to debug/test g, and debug/test the real f separately

## Returning Functions

• Functions can be return values like anything else
• Create a linear predictor, based on sample values of two variables
make.linear.predictor <- function(x,y) {
linear.fit <- lm(y~x)
predictor <- function(x) {
return(predict(object=linear.fit,newdata=data.frame(x=x)))
}
return(predictor)
}
• The predictor function persists and works, even when the data we used to create it is gone

## Returning Functions

library(MASS); data(cats)
vet_predictor <- make.linear.predictor(x=cats$Bwt,y=cats$Hwt)
rm(cats)            # Data set goes away
vet_predictor(3.5)  # My cat's body mass in kilograms
##        1
## 13.76256

## Example: curve()

• A call to curve looks like this:
curve(expr, from = a, to = b, ...)
• expr is some expression involving a variable called x which is swept from the value a to the value b
• ... are other plot-control arguments
• curve feeds the expression a vector x and expects a numeric vector back (so this is fine)

curve(x^2 * sin(x))

## Example: curve()

We can use our own function in curve:

psi <- function(x,c=1) {ifelse(abs(x)>c,2*c*abs(x)-c^2,x^2)}
curve(psi(x,c=10),from=-20,to=20)

Or this

curve(psi(x=10,c=x),from=-20,to=20)

## Example: curve()

If our function doesn't take vectors to vectors, curve becomes unhappy

mse <- function(y0,a,Y=gmp$pcgmp,N=gmp$pop) {
mean((Y - y0*(N^a))^2)
}
> curve(mse(a=x,y0=6611),from=0.10,to=0.15)
Error in curve(mse(a = x, y0 = 6611), from = 0.1, to = 0.15) :
'expr' did not evaluate to an object of length 'n'
In N^a : longer object length is not a multiple of shorter object length

How do we solve this?

## Example: curve()

Define a new, vectorized function, say with sapply:

sapply(seq(from=0.10,to=0.15,by=0.01),mse,y0=6611)
## [1] 154701953 102322974  68755654  64529166 104079527 207057513
mse(6611,0.10)
## [1] 154701953
mse.plottable <- function(a,...){ return(sapply(a,mse,...)) }
mse.plottable(seq(from=0.10,to=0.15,by=0.01),y0=6611)
## [1] 154701953 102322974  68755654  64529166 104079527 207057513

## Example: curve()

curve(mse.plottable(a=x,y0=6611),from=0.10,to=0.20,xlab="a",ylab="MSE")
curve(mse.plottable(a=x,y0=5100),add=TRUE,col="blue")

## Example: curve()

Alternate strategy: Vectorize() returns a new, vectorized function

mse.vec <- Vectorize(mse, vectorize.args=c("y0","a"))
mse.vec(a=seq(from=0.10,to=0.15,by=0.01),y0=6611)
## [1] 154701953 102322974  68755654  64529166 104079527 207057513
mse.vec(a=1/8,y0=c(5000,6000,7000))
## [1] 134617132  74693733  63732256

## Example: curve()

curve(mse.vec(a=x,y0=6611),from=0.10,to=0.20,xlab="a",ylab="MSE")
curve(mse.vec(a=x,y0=5100),add=TRUE,col="blue")

## Optimization Problems

Given an objective function $$f: \mathcal{D} \mapsto R$$, find $\DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\argmin}{argmin} \theta^* = \argmin_{\theta}{f(\theta)}$

Basics: maximizing $$f$$ is minimizing $$-f$$: $\argmax_{\theta}{f(\theta)}= \argmin_{\theta}{-f(\theta)}$

If $$h$$ is strictly increasing (e.g., $$\log$$), then $\argmin_{\theta}{f(\theta)} = \argmin_{\theta}{h(f(\theta))}$

## Considerations

• Approximation: How close can we get to $$\theta^*$$, and/or $$f(\theta^*)$$?
• Time complexity: How many computer steps does that take? Varies with precision of approximation, niceness of $$f$$, size of $$\mathcal{D}$$, size of data, method…
• Most optimization algorithms use successive approximation, so distinguish number of iterations from cost of each iteration

## Use calculus

Suppose $$x$$ is one dimensional and $$f$$ is smooth. If $$x^*$$ is an interior minimum / maximum / extremum point ${\left. \frac{df}{dx} \right|}_{x=x^*} = 0$

If $$x^*$$ a minimum, ${\left. \frac{d^2f}{dx^2}\right|}_{x=x^*} > 0$

## Use calculus

This all carries over to multiple dimensions: At an interior extremum, $\nabla f(\theta^*) = 0$

At an interior minimum, $\nabla^2 f(\theta^*) \geq 0$ meaning for any vector $$v$$, $v^T \nabla^2 f(\theta^*) v \geq 0$ $$\nabla^2 f =$$ the Hessian, $$\mathbf{H}$$ $$\theta$$ might just be a local minimum

$f^{\prime}(x_0) = {\left. \frac{df}{dx}\right|}_{x=x_0} = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}}$ $f(x) \approx f(x_0) +(x-x_0)f^{\prime}(x_0)$ Locally, the function looks linear; to minimize a linear function, move down the slope

Multivariate version: $f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0)$ $$\nabla f(\theta_0)$$ points in the direction of fastest ascent at $$\theta_0$$

1. Start with initial guess for $$\theta$$, step-size $$\eta$$
2. While ((not too tired) and (making adequate progress))
• Find gradient $$\nabla f(\theta)$$
• Set $$\theta \leftarrow \theta - \eta \nabla f(\theta)$$
1. Return final $$\theta$$ as approximate $$\theta^*$$ Variations: adaptively adjust $$\eta$$ to make sure of improvement or search along the gradient direction for minimum

Pro:

• Moves in direction of greatest immediate improvement
• If $$\eta$$ is small enough, gets to a local minimum eventually, and then stops

Cons:

• "small enough" $$\eta$$ can be really, really small
• Slowness or zig-zagging if components of $$\nabla f$$ are very different sizes How much work do we need?

## Scaling

Big-$$O$$ notation: $h(x) = O(g(x))$ means $\lim_{x\rightarrow\infty}{\frac{h(x)}{g(x)}} = c$ for some $$c \neq 0$$

e.g., $$x^2 - 5000x + 123456778 = O(x^2)$$

e.g., $$e^{x}/(1+e^{x}) = O(1)$$

Useful to look at over-all scaling, hiding details Also done when the limit is $$x\rightarrow 0$$

Pro:

• For nice $$f$$, $$f(\theta) \leq f(\theta^*)+\epsilon$$ in $$O(\epsilon^{-2})$$ iterations + For very nice $$f$$, only $$O(\log{\epsilon^{-1}})$$ iterations
• To get $$\nabla f(\theta)$$, take $$p$$ derivatives, $$\therefore$$ each iteration costs $$O(p)$$

Con:

• Taking derivatives can slow down as data grows — each iteration might really be $$O(np)$$

## Taylor Series

What if we do a quadratic approximation to $$f$$? $f(x) \approx f(x_0) + (x-x_0)f^{\prime}(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0)$

Special cases of general idea of Taylor approximation

Simplifies if $$x_0$$ is a minimum since then $$f^{\prime}(x_0) = 0$$: $f(x) \approx f(x_0) + \frac{1}{2}(x-x_0)^2 f^{\prime\prime}(x_0)$

Near a minimum, smooth functions look like parabolas

Carries over to the multivariate case: $f(\theta) \approx f(\theta_0) + (\theta-\theta_0) \cdot \nabla f(\theta_0) + \frac{1}{2}(\theta-\theta_0)^T \mathbf{H}(\theta_0) (\theta-\theta_0)$

If we know $f(x) = ax^2 + bx + c$ we minimize exactly: $\begin{eqnarray*} 2ax^* + b & = & 0\\ x^* & = & \frac{-b}{2a} \end{eqnarray*}$ If $f(x) = \frac{1}{2}a (x-x_0)^2 + b(x-x_0) + c$ then $x^* = x_0 - a^{-1}b$

## Newton's Method

Taylor-expand for the value at the minimum $$\theta^*$$ $f(\theta^*) \approx f(\theta) + (\theta^*-\theta) \nabla f(\theta) + \frac{1}{2}(\theta^*-\theta)^T \mathbf{H}(\theta) (\theta^*-\theta)$ Take gradient, set to zero, solve for $$\theta^*$$: $\begin{eqnarray*} 0 & = & \nabla f(\theta) + \mathbf{H}(\theta) (\theta^*-\theta) \\ \theta^* & = & \theta - {\left(\mathbf{H}(\theta)\right)}^{-1} \nabla f(\theta) \end{eqnarray*}$ Works exactly if $$f$$ is quadratic and $$\mathbf{H}^{-1}$$ exists, etc.

If $$f$$ isn't quadratic, keep pretending it is until we get close to $$\theta^*$$, when it will be nearly true

## Newton's Method

The Algorithm

1. Start with guess for $$\theta$$
2. While ((not too tired) and (making adequate progress))
• Find gradient $$\nabla f(\theta)$$ and Hessian $$\mathbf{H}(\theta)$$
• Set $$\theta \leftarrow \theta - \mathbf{H}(\theta)^{-1} \nabla f(\theta)$$
1. Return final $$\theta$$ as approximation to $$\theta^*$$

Like gradient descent, but with inverse Hessian giving the step-size

## Newton's Method

Pros:

• Step-sizes chosen adaptively through 2nd derivatives, much harder to get zig-zagging, over-shooting, etc.
• Also guaranteed to need $$O(\epsilon^{-2})$$ steps to get within $$\epsilon$$ of optimum
• Only $$O(\log\log{\epsilon^{-1}})$$ for very nice functions
• Typically many fewer iterations than gradient descent

## Newton's Method

Cons:

• Hopeless if $$\mathbf{H}$$ doesn't exist or isn't invertible
• Need to take $$O(p^2)$$ second derivatives plus $$p$$ first derivatives
• Need to solve $$\mathbf{H} \theta_{\mathrm{new}} = \mathbf{H} \theta_{\mathrm{old}} - \nabla f(\theta_{\mathrm{old}})$$ for $$\theta_{\mathrm{new}}$$
• Inverting $$\mathbf{H}$$ is $$O(p^3)$$, but cleverness gives $$O(p^2)$$ for solving for $$\theta_{\mathrm{new}}$$

## Getting Around the Hessian

Want to use the Hessian to improve convergence, but don't want to have to keep computing the Hessian at each step

Approaches:

• Use knowledge of the system to get some approximation to the Hessian, use that instead of taking derivatives ("Fisher scoring")
• Use only diagonal entries ($$p$$ unmixed 2nd derivatives)
• Use $$\mathbf{H}(\theta)$$ at initial guess, hope $$\mathbf{H}$$ changes very slowly with $$\theta$$
• Re-compute $$\mathbf{H}(\theta)$$ every $$k$$ steps, $$k > 1$$
• Fast, approximate updates to the Hessian at each step (BFGS)
• Lots of other methods!
• Nedler-Mead, a.k.a. "the simplex method", which doesn't need any derivatives

Try to cage $$\theta^*$$ with a simplex of $$p+1$$ points

Order the trial points, $$f(\theta_1) \leq f(\theta_2) \ldots \leq f(\theta_{p+1})$$

$$\theta_{p+1}$$ is the worst guess — try to improve it

Center of the not-worst = $$\theta_0 = \frac{1}{n}\sum_{i=1}^{n}{\theta_i}$$

Try to improve the worst guess $$\theta_{p+1}$$

1. Reflection: Try $$\theta_0 - (\theta_{p+1}-\theta_0)$$, across the center from $$\theta_{p+1}$$
• if it's better than $$\theta_p$$ but not than $$\theta_1$$, replace the old $$\theta_{p+1}$$ with it
• Expansion: if the reflected point is the new best, try $$\theta_0 - 2(\theta_{p+1}-\theta_0)$$; replace the old $$\theta_{p+1}$$ with the better of the reflected and the expanded point
1. Contraction: If the reflected point is worse that $$\theta_p$$, try $$\theta_0 + \frac{\theta_{p+1}-\theta_0}{2}$$; if the contracted value is better, replace $$\theta_{p+1}$$ with it
2. Reduction: If all else fails, $$\theta_i \leftarrow \frac{\theta_1 + \theta_i}{2}$$
3. Go back to (1) until we stop improving or run out of time

The Moves:

• Reflection: try the opposite of the worst point
• Expansion: if that really helps, try it some more
• Contraction: see if we overshot when trying the opposite
• Reduction: if all else fails, try making each point more like the best point

Pros:

• Each iteration $$\leq 4$$ values of $$f$$, plus sorting (and sorting is at most $$O(p\log{p})$$, usually much better)
• No derivatives used, can even work for dis-continuous $$f$$

Con:

• Can need many more iterations than gradient methods

## Coordinate Descent

Gradient descent, Newton's method, simplex, etc., adjust all coordinates of $$\theta$$ at once — gets harder as the number of dimensions $$p$$ grows

Coordinate descent: never do more than 1D optimization

• Start with initial guess $$\theta$$
• While ((not too tired) and (making adequate progress))
• For $$i \in (1:p)$$
• do 1D optimization over $$i^{\mathrm{th}}$$ coordinate of $$\theta$$, holding the others fixed
• Update $$i^{\mathrm{th}}$$ coordinate to this optimal value
• Return final value of $$\theta$$

## Coordinate Descent

Cons:

• Needs a good 1D optimizer
• Can bog down for very tricky functions, especially with lots of interactions among variables

Pros:

• Can be extremely fast and simple

## Curve-Fitting by Optimizing

We have data $$(x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)$$, and possible curves, $$r(x;\theta)$$ e.g.

• $$r(x) = x \cdot \theta$$
• $$r(x) = \theta_1 x^{\theta_2}$$
• $$r(x) = \sum_{j=1}^{q}{\theta_j b_j(x)}$$ for fixed "basis" functions $$b_j$$

## Curve-Fitting by Optimizing

Least-squares curve fitting: $\hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^n{(y_i - r(x_i;\theta))^2}}$

"Robust" curve fitting: $\hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^{n}{\psi(y_i - r(x_i;\theta))}}$

## Summary

• In R, functions are objects, and can be arguments to other functions
• Functions can also be returned by other functions
• Optimizaiton
• Trade-offs: complexity of iteration vs. number of iterations vs. precision of approximation
• Next time pre-built code like optim and nls