Optimization

In an optimization problem we want to find \mathrm{argmin}_x f(x).

An iterative optimization method produces estimates x_{n+1} = x_n + d_n where d_n is a direction vector.

Unconstrained nonlinear

Newton’s method

x_{n+1} = x_n + d where d is a step size which we need to choose.

The method takes a locally quadratic approximation of f around x_n f(x_n + d) \approx f(x_n) + f'(x_n)d + \frac{1}{2} f''(x_n)d^2 which reaches its minimum at d = -\frac{f'(x_n)}{f''(x_n)}.

The iteration, generalized to a multivariate setting, is given by x_{n+1} = x_n - H_n^{-1}G_n, where H is the Hessian and G is gradient.

Newton’s method for root-finding has a similar update step x_{n+1} = x_n - \frac{f(x)}{f'(x)}, which is similar since the goal here is to find the root of f'(x).

Gauss–Newton method

Used for non-linear least squares problems. Approximates the Hessian as follows:

S = S(x) = \sum_i (y_i - f(x))^2 = \sum_i r_i^2

G = \frac{\partial S}{\partial x_j} = 2 \sum_i r_i \frac{\partial r_i}{\partial x_j} = 2J_r^⊤r

H = \frac{\partial^2 S}{\partial x_j x_k} = 2 \sum_i \left(\frac{\partial r_i}{\partial x_j}\frac{\partial r_i}{\partial x_k} + r_i \frac{\partial^2 r_i}{\partial x_j x_k} \right) \approx 2 \sum_i \left(\frac{\partial r_i}{\partial x_j}\frac{\partial r_i}{\partial x_k}\right) = 2J_r^⊤ J_r

The update step is x_{n+1} = x_n - H^{-1}G = x_n - (J_r^⊤J_r)^{-1} (J_r^⊤r)

Levenberg–Marquardt

Used for non-linear least squares problems. \argmin_x ||f(x) ||^2 If f is affine, then we can solve using linear least squares.

Modifies the Gauss–Newton method to handle possible divergences if the step sizes are too big using a trust region method.

Quasi-Newton method

Used for minimizing general real-valued functions. In Quasi-Newton methods, the Hessian matrix of second derivatives is not computed. Instead, the Hessian matrix is approximated using updates specified by gradient evaluations.

Quasi-Newton Methods

Quasi-Newton methods involve approximating the Hessian matrix (H). These methods are applied to a quadratic approximation iteratively. The function f(x) is approximated as follows: f(x) \approx f(a) + (x-a)^T \nabla f(a) + \frac{1}{2}(x-a)^T H (x-a)

Alternatively, this can be expressed using the gradient G as: f(x) \approx f(a) + (x-a)^T G + \frac{1}{2}(x-a)^T H (x-a)

Positive semi-definiteness:

Gives an idea of curvature. Hessian matrix gives you the same as a second deriviative in 1D, except now you have infinitely many directions to look for curvature. Positive definiteness says that all the eigenvalues are positive, which means that any time you look along an eigenvector, the function will be curving up. Positive definite H at x guarantees that f has a local minimum at x. Negative would give local maximum, and mixed would indicate a saddle.

Challenges with the Hessian (H)

There are several issues associated with using the exact Hessian:

Hessian Approximation (B)

To address these issues, we choose B \approx H. The matrix B must satisfy specific properties:

Update Strategies

The main idea of Quasi-Newton methods is to update the previous approximation B' to a new approximation B such that B is “close” to B'. Common update methods include:

Line search is a strategy used to find a “good enough” step size \alpha along a chosen direction. Instead of finding the exact minimum of \alpha, it often uses the minimum of a quadratic approximation.

The Process

  1. Identify and decide on a search direction.
  2. Take a step of size \alpha along this direction.
  3. Evaluate the function at f(x_{curr} + \alpha p) to find a suitable “min”.

Wolfe Conditions

To ensure convergence, the step size must satisfy the Wolfe conditions:

Line Search Algorithm

Steps

Methods

Stochastic gradient descent

Want to minimize Q(w) = \frac{1}{n} \sum_{i = 1}^{n} Q_i(w). Gradient descent does w:= w - \eta \nabla Q(w) = w - \frac{\eta}{n} \sum_{i = 1}^{n} \nabla Q_i(w), where \nabla is the step size. In stochastic (or “on-line”) gradient descent, the true gradient of Q(w) is approximated by a gradient at a single sample: w:= w - \eta \nabla Q_i(w).

Algo:

A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a “mini-batch”) at each step. This can perform significantly better due to using vectorization.

Converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum.

Constrained least squares

Non-negative least squares

Given a matrix A and a (column) vector of response variables y, the goal is to find \operatorname {arg\,min} \limits_{ x }\| Ax - y \|_2^2 subject to x \geq 0.

Can be reformulated as a quadratic programming problem. A generalization is bounded-variable least squares with \alpha_i \leq x_i \leq \beta_i.

Available in SciPy.

Constrained nonlinear

Lagrange Multipliers

Minimize the function f(P) subject to the constraint that g(P) = 0. Imagine f(P) ballooning out until it reaches the constraint surface. At the point where f meets the constraint g, both f and g will be tangential (for smooth f, g). That means that their normal vectors will be perpendicular. So we have \nabla f(P) = \lambda \nabla g(P).

In D dimensions, we now have D+1 equations in D+1 unknowns. D of the unknowns are the coordinates of P (e.g. x, y, and z for D = 3), and the other is the new unknown constant \lambda. The equation for the gradients derived above provides D equations of constraint. The original constraint equation g(P) = 0 is the final equation in the system. Thus, in general, a unique solution exists.

Alternatively, the problem can be formulated as F(P, \lambda) = f(P) - \lambda g(P).

We next set \nabla F(P, \lambda) = 0, but keep in mind that the gradient is now D + 1 dimensional: one of its components is a partial derivative with respect to \lambda. If you set this new component of the gradient equal to zero, you get the constraint equation g(P) = 0. Meanwhile, the old components of the gradient treat \lambda as a constant, so it just pulls through. Thus, the other D equations are precisely the D equations found in the graphical approach above.

Convex optimization

Linear programming

Linear programming is a technique for the optimization of a linear objective function, subject to linear equality and linear inequality constraints.

Its feasible region is a convex polytope, which is a set defined as the intersection of finitely many half spaces, each of which is defined by a linear inequality.

Its objective function is a real-valued affine (linear) function defined on this polyhedron. A linear programming algorithm finds a point in the polytope where this function has the smallest (or largest) value if such a point exists.

Linear programs are problems that can be expressed in canonical form as

\begin{align} & \text{Optimize} && c^\mathrm{T} x\\ & \text{subject to} && Ax \leq b \\ & \text{and} && x \ge \mathbf{0} \end{align}

where x is the unknown vector, and A is a known matrix and b,c are known vectors.

Typically solved using the simplex algorithm, or with some variant of the interior-point method.

The problem can be reformulated to solve variants of the canonical form. For example, to optimize c^\mathrm{T} |x|, introduce auxilliary variables t_i \geq 0 into the constraints so that x = t_1 - t_2. The algorithm will set t_i such that t_1 + t_2 = |x| (one of the t_i will be 0).

References