Optimization II

Agenda

Optimization with optim() and nls()
Optimization under constraints
Lagrange multipliers
Penalized optimization
Statistical uses of penalized optimization

Optimization in R: optim()

optim(par, fn, gr, method, control, hessian)

fn: function to be minimized; mandatory
par: initial parameter guess; mandatory
gr: gradient function; only needed for some methods
method: defaults to a gradient-free method (``Nedler-Mead''), could be BFGS (Newton-ish)
control: optional list of control settings
(maximum iterations, scaling, tolerance for convergence, etc.)
hessian: should the final Hessian be returned? default FALSE

Return contains the location ($par) and the value ($val) of the optimum, diagnostics, possibly $hessian

Optimization in R: optim()

gmp <- read.table("gmp.dat") 
gmp$pop <- gmp$gmp/gmp$pcgmp 
library(numDeriv) 
mse <- function(theta) { 
  mean((gmp$pcgmp - theta[1]*gmp$pop^theta[2])^2) 
  } 
grad.mse <- function(theta) { grad(func=mse,x=theta) } 
theta0=c(5000,0.15) 
fit1 <- optim(theta0,mse,grad.mse,method="BFGS",hessian=TRUE)

Optimization in R: optim()

fit1: Newton-ish BFGS method

fit1[1:3]

## $par
## [1] 6493.2563739    0.1276921
## 
## $value
## [1] 61853983
## 
## $counts
## function gradient 
##       63       11

Optimization in R: optim()

fit1: Newton-ish BFGS method

fit1[4:6]

## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##              [,1]         [,2]
## [1,]      52.5021      4422071
## [2,] 4422071.3594 375729087979

Optimization in R: nls()

optim is a general-purpose optimizer
nlm is another general-purpose optimizer; nonlinear least squares
Try them both if one doesn't work

Optimization in R: nls()

nls(formula, data, start, control, [[many other options]])

formula: Mathematical expression with response variable, predictor variable(s), and unknown parameter(s)
data: Data frame with variable names matching formula
start: Guess at parameters (optional)
control: Like with optim (optional)

Returns an nls object, with fitted values, prediction methods, etc. The default optimization is a version of Newton's method.

Optimization in R: nls()

fit2: Fitting the Same Model with nls()

fit2 <- nls(pcgmp~y0*pop^a,data=gmp,start=list(y0=5000,a=0.1)) 
summary(fit2)

## 
## Formula: pcgmp ~ y0 * pop^a
## 
## Parameters:
##     Estimate Std. Error t value Pr(>|t|)    
## y0 6.494e+03  8.565e+02   7.582 2.87e-13 ***
## a  1.277e-01  1.012e-02  12.612  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7886 on 364 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 1.819e-07

Optimization in R: nls()

fit2: Fitting the Same Model with nls()

plot(pcgmp~pop,data=gmp) 
pop.order <- order(gmp$pop) 
lines(gmp$pop[pop.order],fitted(fit2)[pop.order]) 
curve(fit1$par[1]*x^fit1$par[2],add=TRUE,lty="dashed",col="blue")

Example: Multinomial

Roll dice $n$ times; $n_1, \ldots n_6$ count the outcomes

Likelihood and log-likelihood:

\[ \begin{eqnarray*} L(\theta_1,\theta_2,\theta_3,\theta_4,\theta_5,\theta_6) & = & \frac{n!}{n_1! n_2! n_3! n_4! n_5! n_6!}\prod_{i=1}^{6}{\theta_i^{n_i}}\\ \ell(\theta_1,\theta_2,\theta_3,\theta_4,\theta_5,\theta_6) & = & \log{\frac{n!}{n_1! n_2! n_3! n_4! n_5! n_6!}} + \sum_{i=1}^{6}{n_i\log{\theta_i}} \end{eqnarray*} \]

Optimize by taking the derivative and setting to zero:

\[ \begin{eqnarray*} \frac{\partial \ell}{\partial \theta_1} & = & \frac{n_1}{\theta_1} = 0\\ \therefore \theta_1 & =& \infty \end{eqnarray*} \]

Example: Multinomial

We forgot that $\sum_{i=1}^{6}{\theta_i}=1$

We could use the constraint to eliminate one of the variables \[ \theta_6 = 1 - \sum_{i=1}^{5}{\theta_i} \]

Then solve the equations \[ \frac{\partial \ell}{\partial \theta_i} = \frac{n_1}{\theta_i} -\frac{n_6}{1-\sum_{j=1}^{5}{\theta_j}} = 0 \]

BUT eliminating a variable with the constraint is usually messy

Lagrange multipliers

\[ g(\theta) = c ~ \Leftrightarrow ~ g(\theta)-c=0 \]

Lagrangian:

\[ \mathcal{L}(\theta,\lambda) = f(\theta) - \lambda(g(\theta)-c) \]

$= f$ when the constraint is satisfied

Now do unconstrained minimization over $\theta$ and $\lambda$:

\[ \begin{eqnarray*} {\left.\nabla_{\theta}\mathcal{L}\right|}_{\theta^*,\lambda^*} & = & \nabla f(\theta^*) - \lambda^*\nabla g(\theta^*) =0\\ {\left. \frac{\partial \mathcal{L}}{\partial \lambda}\right|}_{\theta^*,\lambda^*} & = & g(\theta^*) - c = 0 \end{eqnarray*} \]

optimizing Lagrange multiplier $\lambda$ enforces constraint

More constraints, more multipliers

Lagrange multipliers

Try the dice again:

\[ \begin{eqnarray*} \mathcal{L} & = & \log{\frac{n!}{\prod_i{n_i!}}} + \sum_{i=1}^{6}{n_i\log{(\theta_i)}} - \lambda\left(\sum_{i=1}^{6}{\theta_i} - 1\right)\\ {\left.\frac{\partial\mathcal{L}}{\partial \theta_i}\right|}_{\theta_i=\theta^*_i} & = & \frac{n_i}{\theta^*_i} - \lambda^* = 0\\ \frac{n_i}{\lambda^*} & = & \theta^*_i\\ \sum_{i=1}^{6}{\frac{n_i}{\lambda^*}} & = & \sum_{i=1}^{6}{\theta^*_i} = 1\\ \lambda^* & =& \sum_{i=1}^{6}{n_i} ~ \Rightarrow \theta^*_i = \frac{n_i}{\sum_{i=1}^{6}{n_i}} \end{eqnarray*} \]

Lagrange multipliers

Constrained minimum value is generally higher than the unconstrained

Changing the constraint level $c$ changes $\theta^*$, $f(\theta^*)$

\[ \begin{eqnarray*} \frac{\partial f(\theta^*)}{\partial c} & = & \frac{\partial \mathcal{L}(\theta^*,\lambda^*)}{\partial c}\\ & = & \left[\nabla f(\theta^*)-\lambda^*\nabla g(\theta^*)\right]\frac{\partial \theta^*}{\partial c} - \left[g(\theta^*)-c\right]\frac{\partial \lambda^*}{\partial c} + \lambda^* = \lambda^* \end{eqnarray*} \]

$\lambda^* =$ Rate of change in optimal value as the constraint is relaxed

$\lambda^* =$ ``Shadow price'': How much would you pay for minute change in the level of the constraint

Inequality Constraints

What about an inequality constraint?

\[ h(\theta) \leq d ~ \Leftrightarrow ~ h(\theta) - d \leq 0 \]

The region where the constraint is satisfied is the feasible set

Roughly two cases:

Unconstrained optimum is inside the feasible set $\Rightarrow$ constraint is inactive
Optimum is outside feasible set; constraint is active, binds or bites; constrained optimum is usually on the boundary

Add a Lagrange multiplier; $\lambda \neq 0$ $\Leftrightarrow$ constraint binds

Mathematical Programming

Older than computer programming…

Optimize $f(\theta)$ subject to $g(\theta) = c$ and $h(\theta) \leq d$

``Give us the best deal on $f$, keeping in mind that we've only got $d$ to spend, and the books have to balance''

Linear programming

$f$, $h$ both linear in $\theta$
$\theta^*$ always at a corner of the feasible set

Example: Factory

Revenue: 13k per car, 27k per truck

Constraints:

\[ \begin{eqnarray*} 40 * \mathrm{cars} + 60*\mathrm{trucks} & < & 1600 \mathrm{hours} \\ 1 * \mathrm{cars}+ 3 * \mathrm{trucks} & < & 70 \mathrm{tons} \end{eqnarray*} \]

Find the revenue-maximizing number of cars and trucks to produce

Example: Factory

The feasible region:

Example: Factory

The feasible region, plus lines of equal profit

Barrier Methods

(a.k.a. "interior point", "central path", etc.)

Having constraints switch on and off abruptly is annoying, especially with gradient methods

Fix $\mu >0$ and try minimizing \[ f(\theta) - \mu\log{\left(d-h(\theta)\right)} \] "pushes away" from the barrier — more and more weakly as $\mu \rightarrow 0$

Barrier Methods

Initial $\theta$ in feasible set, initial $\mu$
While ((not too tired) and (making adequate progress))
1. Minimize $f(\theta) - \mu\log{\left(d-h(\theta)\right)}$
2. Reduce $\mu$
Return final $\theta$

R implementation

constrOptim implements the barrier method

Try this:

factory <- matrix(c(40,1,60,3),nrow=2,
  dimnames=list(c("labor","steel"),c("car","truck")))
available <- c(1600,70); names(available) <- rownames(factory)
prices <- c(car=13,truck=27)
revenue <- function(output) { return(-output %*% prices) }
plan <- constrOptim(theta=c(5,5),f=revenue,grad=NULL,
  ui=-factory,ci=-available,method="Nelder-Mead")
plan$par

## [1]  9.999896 20.000035

constrOptim only works with constraints like $\mathbf{u}\theta \geq c$, so minus signs

Constraints vs. Penalties

\[ \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\argmin}{argmin} \argmin_{\theta : h(\theta) \leq d}{f(\theta)} ~ \Leftrightarrow \argmin_{\theta,\lambda}{f(\theta) - \lambda(h(\theta)-d)} \]

$d$ doesn't matter for doing the second minimization over $\theta$

We could just as well minimize

\[ f(\theta) - \lambda h(\theta) \]

Constrained optimization	Penalized optimization
Constraint level $d$	Penalty factor $\lambda$

Statistical applications of penalization

Mostly you've seen unpenalized estimates (least squares, maximum likelihood)

Lots of modern advanced methods rely on penalties

For when the direct estimate is too unstable
For handling high-dimensional cases
For handling non-parametrics

Ordinary least squares

No penalization; minimize MSE of linear function $\beta \cdot x$:

\[ \hat{\beta} = \argmin_{\beta}{\frac{1}{n}\sum_{i=1}^{n}{(y_i - \beta\cdot x_i)^2}} = \argmin_{\beta}{MSE(\beta)} \]

Closed-form solution if we can invert matrices:

\[ \hat{\beta} = (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T\mathbf{y} \]

where $\mathbf{x}$ is the $n\times p$ matrix of $x$ vectors, and $\mathbf{y}$ is the $n\times 1$ matrix of $y$ values.

Ridge regression

Now put a penalty on the magnitude of the coefficient vector: \[ \tilde{\beta} = \argmin_{\beta}{MSE(\beta) + \mu \sum_{j=1}^{p}{\beta_j^2}} = \argmin_{\beta}{MSE(\beta) + \mu \|\beta\|_2^2} \]

Penalizing $\beta$ this way makes the estimate more stable; especially useful for

Lots of noise
Collinear data ($\mathbf{x}$ not of "full rank")
High-dimensional, $p > n$ data (which implies collinearity)

This is called ridge regression, or Tikhonov regularization

Closed form: \[ \tilde{\beta} = (\mathbf{x}^T\mathbf{x} + \mu I)^{-1}\mathbf{x}^T\mathbf{y} \]

The Lasso

Put a penalty on the sum of coefficient's absolute values: \[ \beta^{\dagger} = \argmin_{\beta}{MSE(\beta) + \lambda \sum_{j=1}^{p}{|\beta_j|}} = \argmin_{\beta}{MSE(\beta) + \lambda\|\beta\|_1} \]

This is called the lasso

Also stabilizes (like ridge)
Also handles high-dimensional data (like ridge)
Enforces sparsity: it likes to drive small coefficients exactly to 0

No closed form, but very efficient interior-point algorithms (e.g., lars package)

Spline smoothing

"Spline smoothing": minimize MSE of a smooth, nonlinear function, plus a penalty on curvature:

\[ \hat{f} = \argmin_{f}{\frac{1}{n}\sum_{i=1}^{n}{(y_i-f(x_i))^2} + \int{(f^{\prime\prime}(x))^2 dx}} \]

This fits smooth regressions without assuming any specific functional form

Lets you check linear models
Makes you wonder why you bother with linear models

Many different R implementations, starting with smooth.spline

How Big a Penalty?

Rarely know the constraint level or the penalty factor $\lambda$ from on high

Lots of ways of picking, but often cross-validation works well

Divide the data into parts
For each value of $\lambda$, estimate the model on one part of the data
See how well the models fit the other part of the data
Use the $\lambda$ which extrapolates best on average

Summary

Start with pre-built code like optim or nls, implement your own as needed
Use Lagrange multipliers to turn constrained optimization problems into unconstrained but penalized ones
- Optimal multiplier values are the prices we'd pay to weaken the constraints
The nature of the penalty term reflects the sort of constraint we put on the problem
- Shrinkage
- Sparsity
- Smoothness

Example: Lasso

x <- matrix(rnorm(200),nrow=100)
y <- (x %*% c(2,1))+ rnorm(100,sd=0.05)
mse <- function(b1,b2) {mean((y- x %*% c(b1,b2))^2)}
coef.seq <- seq(from=-1,to=5,length.out=200)
m <- outer(coef.seq,coef.seq,Vectorize(mse))
l1 <- function(b1,b2) {abs(b1)+abs(b2)}
l1.levels <- outer(coef.seq,coef.seq,l1)
ols.coefs <- coefficients(lm(y~0+x))

Example: Lasso

contour(x=coef.seq,y=coef.seq,z=m,drawlabels=FALSE,nlevels=30,col="grey",
  main="Contours of MSE vs. Contours of L1")
contour(x=coef.seq,y=coef.seq,z=l1.levels,nlevels=20,add=TRUE)
points(x=ols.coefs[1],y=ols.coefs[2],pch="+")
points(0,0)

Example: Lasso

contour(x=coef.seq,y=coef.seq,z=m+l1.levels,drawlabels=FALSE,nlevels=30,
  main="Contours of MSE+L1")

Bonus: Writing Our Own gradient()

Suppose we didn't know about the numDeriv package..
Use the simplest possible method: change x by some amount, find the difference in f, take the slope
method="simple" option in numDeriv::grad
Start with pseudo-code

gradient <- function(f,x,deriv.steps) {
  # not real code
  evaluate the function at x and at x+deriv.steps
  take slopes to get partial derivatives
  return the vector of partial derivatives
}

Bonus Example: gradient()

A naive implementation would use a for loop

gradient <- function(f,x,deriv.steps,...) {
  p <- length(x)
  stopifnot(length(deriv.steps)==p)
  f.old <- f(x,...)
  gradient <- vector(length=p)
  for (coordinate in 1:p) {
   x.new <- x
   x.new[coordinate] <- x.new[coordinate]+deriv.steps[coordinate]
   f.new <- f(x.new,...)
   gradient[coordinate] <- (f.new - f.old)/deriv.steps[coordinate]
  }
  return(gradient)
}

Works, but it's so repetitive!

Bonus Example: gradient()

Better: use matrix manipulation and apply

gradient <- function(f,x,deriv.steps,...) {
  p <- length(x)
  stopifnot(length(deriv.steps)==p)
  x.new <- matrix(rep(x,times=p),nrow=p) + diag(deriv.steps,nrow=p)
  f.new <- apply(x.new,2,f,...)
  gradient <- (f.new - f(x,...))/deriv.steps
  return(gradient)
}

(clearer, and half as long)

Presumes that f takes a vector and returns a single number
Any extra arguments to gradient will get passed to f
Check: Does this work when f is a function of a single number?

Bonus Example: gradient()

Acts badly if f is only defined on a limited domain and we ask for the gradient somewhere near a boundary
Forces the user to choose deriv.steps
Uses the same deriv.steps everywhere, imagine $f(x) = x^2\sin{x}$

… and so on through much of a first course in numerical analysis

Agenda

Optimization in R: optim()

Optimization in R: optim()

Optimization in R: optim()

Optimization in R: optim()

Optimization in R: nls()

Optimization in R: nls()

Optimization in R: nls()

Optimization in R: nls()

Example: Multinomial

Example: Multinomial

Lagrange multipliers

Lagrange multipliers

Lagrange multipliers

Inequality Constraints

Mathematical Programming

Example: Factory

Example: Factory

Example: Factory

More Complex Financial Problem

Barrier Methods

Barrier Methods

R implementation

Constraints vs. Penalties

Statistical applications of penalization

Ordinary least squares

Ridge regression

The Lasso

Spline smoothing

How Big a Penalty?

Summary

Example: Lasso

Example: Lasso

Example: Lasso

Bonus: Writing Our Own gradient()

Bonus Example: gradient()

Bonus Example: gradient()

Bonus Example: gradient()