Optimization

In an optimization problem we want to find \mathrm{argmin}_x f(x)

An iterative optimization method produces estimates x_{n+1} = x_n + d_n where d_n is a direction vector.

Unconstrained nonlinear

Newton’s method

x_{n+1} = x_n + d where d is a step size which we need to choose.

The method takes a locally quadratic approximation of f around x_n f(x_n + d) \approx f(x_n) + f'(x_n)d + \frac{1}{2} f''(x_n)d^2 which reaches its minimum at d = -\frac{f'(x_n)}{f''(x_n)}.

The iteration, generalized to a multivariate setting, is given by x_{n+1} = x_n - H_n^{-1}G_n, where H is the Hessian and G is gradient.

Quadratic convergence
Sensitive to start point
Hessian H (n^2 space, H^{-1} requires n^3 ops), might be numerically unstable
Finds complex zeros too.

Newton’s method for root-finding has a similar update step x_{n+1} = x_n - \frac{f(x)}{f'(x)}, which is similar since the goal here is to find the root of f'(x).

Gauss–Newton method

Used for non-linear least squares problems. Approximates the Hessian as follows:

S = S(x) = \sum_i (y_i - f(x))^2 = \sum_i r_i^2

G = \frac{\partial S}{\partial x_j} = 2 \sum_i r_i \frac{\partial r_i}{\partial x_j} = 2J_r^⊤r

H = \frac{\partial^2 S}{\partial x_j x_k} = 2 \sum_i \left(\frac{\partial r_i}{\partial x_j}\frac{\partial r_i}{\partial x_k} + r_i \frac{\partial^2 r_i}{\partial x_j x_k} \right) \approx 2 \sum_i \left(\frac{\partial r_i}{\partial x_j}\frac{\partial r_i}{\partial x_k}\right) = 2J_r^⊤ J_r

The update step is x_{n+1} = x_n - H^{-1}G = x_n - (J_r^⊤J_r)^{-1} (J_r^⊤r)

Levenberg–Marquardt

Used for non-linear least squares problems. \argmin_x ||f(x) ||^2 If f is affine, then we can solve using linear least squares.

Modifies the Gauss–Newton method to handle possible divergences if the step sizes are too big using a trust region method.

Quasi-Newton Methods

Used for minimizing general real-valued functions.

Quasi-Newton methods involve approximating the Hessian matrix (H). These methods are applied to a quadratic approximation iteratively. The function f(x) is approximated as follows: f(x) \approx f(a) + (x-a)^T \nabla f(a) + \frac{1}{2}(x-a)^T H (x-a)

Alternatively, this can be expressed using the gradient G as: f(x) \approx f(a) + (x-a)^T G + \frac{1}{2}(x-a)^T H (x-a)

Positive semi-definiteness:

Gives an idea of curvature. Hessian matrix gives you the same as a second derivative in 1D, except now you have infinitely many directions to look for curvature. Positive definiteness says that all the eigenvalues are positive, which means that any time you look along an eigenvector, the function will be curving up. Positive definite H at x guarantees that f has a local minimum at x. Negative would give local maximum, and mixed would indicate a saddle.

Challenges with the Hessian (H)

There are several issues associated with using the exact Hessian:

Availability: The exact Hessian might not be available.
Computational Cost: It is slow to calculate, with a complexity of O(n^2).
Definiteness: The Hessian is not guaranteed to be positive definite.

Hessian Approximation (B)

To address these issues, we choose B \approx H. The matrix B must satisfy specific properties:

Symmetry: B must be symmetric.
Positive Definite: B must be positive definite (B \succ 0), which guarantees that f has a minimum rather than a saddle point or maximum.
Gradient Consistency: The approximation g(x) (which uses B instead of H) must satisfy \nabla g(q') = \nabla f(q').

Update Strategies

The main idea of Quasi-Newton methods is to update the previous approximation B' to a new approximation B such that B is “close” to B'. Common update methods include:

SR1 (Symmetric Rank-1): The update satisfies \text{rank}(B - B') = 1, though a solution is not guaranteed to exist. Used in Trust Region methods.
PSB (Powell-Symmetric-Broyden): Minimizes the Frobenius norm \min \|B - B'\|_F. However, PSB does not guarantee that B \succ 0.
DFP (Davidon-Fletcher-Powell): Minimizes \|W(B - B')W\|_F while guaranteeing that if B' \succ 0, then B \succ 0, through choice of W.
BFGS (Broyden-Fletcher-Goldfarb-Shanno): Minimizes \|W(B^{-1} - (B')^{-1})W\|_F through the choice of a weighting matrix W. It is more stable than DFP, ensures B \succ 0, and the inverse B^{-1} is typically used in the update step. BFGS is frequently used in Trust Region (TR) methods.

Line Search

Line search is a strategy used to find a “good enough” step size \alpha along a chosen direction. Instead of finding the exact minimum of \alpha, it often uses the minimum of a quadratic approximation.

The Process

Identify and decide on a search direction.
Take a step of size \alpha along this direction.
Evaluate the function at f(x_{curr} + \alpha p) to find a suitable “min”.

Wolfe Conditions

To ensure convergence, the step size must satisfy the Wolfe conditions:

Sufficient Decrease: f(x) should decrease, at least linearly.
Curvature Condition: A larger step should provide a larger decrease, placing a lower bound (LB) on the gradient.
Strong Wolfe Conditions: These further restrict the allowable range for the gradient (configurable).

Line Search Algorithm

Steps

Bracketing: Search for an interval containing a valid point by doubling step sizes: (0,1), (1,2), (2,4), \dots.
Conditions for a Minimum:
- Left side has a negative gradient (f'(x_1) < 0)
- Right side has f(x_2) > f(x_1) or a positive gradient (f'(x_2) > 0).
Refinement: Bisect the interval or use a quadratic approximation until the point is found.

Methods

Nelder-Mead Simplex algorithm
- derivative-free
Newton-Conjugate-Gradient algorithm
- use the first-order derivative
Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm
- use the second-order derivative
- Quasi-Newton method, so approximates the Hessian
Levenberg–Marquardt algorithm
- use the second-order derivative
- also known as the damped least-squares (DLS)
- used to solve non-linear least squares problems
Trust-Region Newton-Conjugate-Gradient Algorithm
Trust-Region Truncated Generalized Lanczos / Conjugate Gradient Algorithm
Trust-Region Nearly Exact Algorithm

Stochastic gradient descent

Want to minimize Q(w) = \frac{1}{n} \sum_{i = 1}^{n} Q_i(w). Gradient descent does w:= w - \eta \nabla Q(w) = w - \frac{\eta}{n} \sum_{i = 1}^{n} \nabla Q_i(w), where \nabla is the step size. In stochastic (or “on-line”) gradient descent, the true gradient of Q(w) is approximated by a gradient at a single sample: w:= w - \eta \nabla Q_i(w).

Algo:

Choose an initial vector of parameters w and learning rate \eta.
Repeat until an approximate minimum is obtained:
- Randomly shuffle samples in the training set.
- For i = 1,2,...,n do: w:= w - \eta \nabla Q_i(w).

A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a “mini-batch”) at each step. This can perform significantly better due to using vectorization.

Converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum.

Constrained least squares

Non-negative least squares

Given a matrix A and a (column) vector of response variables y, the goal is to find \operatorname {arg\,min} \limits_{ x }\| Ax - y \|_2^2 subject to x \geq 0.

Can be reformulated as a quadratic programming problem. A generalization is bounded-variable least squares with \alpha_i \leq x_i \leq \beta_i.

Available in SciPy.

Constrained nonlinear

Broyden–Fletcher–Goldfarb–Shanno Bounded (BFGS-B) algorithm
- iterative method for solving problems with box-constraints
Lagrange Multipliers
Trust-Region Constrained Algorithm
Sequential Least SQuares Programming (SLSQP) Algorithm
Global optimization
Least-squares minimization

Lagrange Multipliers

Minimize the function f(P) subject to the constraint that g(P) = 0. Imagine f(P) ballooning out until it reaches the constraint surface. At the point where f meets the constraint g, both f and g will be tangential (for smooth f, g). That means that their normal vectors will be perpendicular. So we have \nabla f(P) = \lambda \nabla g(P).

In D dimensions, we now have D+1 equations in D+1 unknowns. D of the unknowns are the coordinates of P (e.g. x, y, and z for D = 3), and the other is the new unknown constant \lambda. The equation for the gradients derived above provides D equations of constraint. The original constraint equation g(P) = 0 is the final equation in the system. Thus, in general, a unique solution exists.

Alternatively, the problem can be formulated as F(P, \lambda) = f(P) - \lambda g(P).

We next set \nabla F(P, \lambda) = 0, but keep in mind that the gradient is now D + 1 dimensional: one of its components is a partial derivative with respect to \lambda. If you set this new component of the gradient equal to zero, you get the constraint equation g(P) = 0. Meanwhile, the old components of the gradient treat \lambda as a constant, so it just pulls through. Thus, the other D equations are precisely the D equations found in the graphical approach above.

Convex optimization

Linear programming

Linear programming is a technique for the optimization of a linear objective function, subject to linear equality and linear inequality constraints.

Its feasible region is a convex polytope, which is a set defined as the intersection of finitely many half spaces, each of which is defined by a linear inequality.

Its objective function is a real-valued affine (linear) function defined on this polyhedron. A linear programming algorithm finds a point in the polytope where this function has the smallest (or largest) value if such a point exists.

Linear programs are problems that can be expressed in canonical form as

\begin{align} & \text{Optimize} && c^\mathrm{T} x\\ & \text{subject to} && Ax \leq b \\ & \text{and} && x \ge \mathbf{0} \end{align}

where x is the unknown vector, and A is a known matrix and b,c are known vectors.

Typically solved using the simplex algorithm, or with some variant of the interior-point method.

The problem can be reformulated to solve variants of the canonical form. For example, to optimize c^\mathrm{T} |x|, introduce auxilliary variables t_i \geq 0 into the constraints so that x = t_1 - t_2. The algorithm will set t_i such that t_1 + t_2 = |x| (one of the t_i will be 0).

References

Lagrange Multipliers: http://www.slimy.com/~steuard/teaching/tutorials/Lagrange.html#Milkmaid
Linear Programming: https://math.stackexchange.com/questions/432003/converting-absolute-value-program-into-linear-program
SGD: https://en.wikipedia.org/wiki/Stochastic_gradient_descent