Optimization

In an optimization problem we want to find \argmin_x f(x).

An iterative optimization method produces estimates x_{n+1} = x_n + d_n where d_n is a direction vector.

Unconstrained nonlinear

Newton’s method

x_{n+1} = x_n + d where d is a step size which we need to choose.

The method takes a locally quadratic approximation of f around x_n f(x_n+d)\approx f(x_n)+f'(x_n)d+{\frac {1}{2}}f''(x_n) d^{2}, which reaches its minimum at d = -\frac{f'(x_n)}{f''(x_n)}.

The iteration, generalized to a multivariate setting, is given by x_{n+1} = x_n - H_n^{-1}G_n, where H is the Hessian and G is gradient.

Newton’s method for root-finding has a similar update step x_{n+1} = x_n - \frac{f(x)}{f'(x)}, which is similar since the goal here is to find the root of f'(x).

Gauss–Newton method

Used for non-linear least squares problems. Approximates the Hessian as follows:

S = S(x) = \sum_i (y_i - f(x))^2 = \sum_i r_i^2

G = \frac{\partial S}{\partial x_j} = 2 \sum_i r_i \frac{\partial r_i}{\partial x_j} = 2J_r^⊤r

H = \frac{\partial^2 S}{\partial x_j x_k} = 2 \sum_i \left(\frac{\partial r_i}{\partial x_j}\frac{\partial r_i}{\partial x_k} + r_i \frac{\partial^2 r_i}{\partial x_j x_k} \right) \approx 2 \sum_i \left(\frac{\partial r_i}{\partial x_j}\frac{\partial r_i}{\partial x_k}\right) = 2J_r^⊤ J_r

The update step is x_{n+1} = x_n - H^{-1}G = x_n - (J_r^⊤J_r)^{-1} (J_r^⊤r)

Levenberg–Marquardt

Used for non-linear least squares problems.

Modifies the Gauss–Newton method to handle possible divergences if the step sizes are too big using a trust region method.

Quasi-Newton method

Used for minimizing general real-valued functions. In Quasi-Newton methods, the Hessian matrix of second derivatives is not computed. Instead, the Hessian matrix is approximated using updates specified by gradient evaluations.

Stochastic gradient descent

Want to minimize Q(w) = \frac{1}{n} \sum_{i = 1}^{n} Q_i(w). Gradient descent does w:= w - \eta \nabla Q(w) = w - \frac{\eta}{n} \sum_{i = 1}^{n} \nabla Q_i(w), where \nabla is the step size. In stochastic (or “on-line”) gradient descent, the true gradient of Q(w) is approximated by a gradient at a single sample: w:= w - \eta \nabla Q_i(w).

Algo:

A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a “mini-batch”) at each step. This can perform significantly better due to using vectorization.

Converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum.

Constrained least squares

Non-negative least squares

Given a matrix A and a (column) vector of response variables y, the goal is to find \operatorname {arg\,min} \limits_{ x }\| Ax - y \|_2^2 subject to x \geq 0.

Can be reformulated as a quadratic programming problem. A generalization is bounded-variable least squares with \alpha_i \leq x_i \leq \beta_i.

Available in SciPy.

Constrained nonlinear

Lagrange Multipliers

Minimize the function f(P) subject to the constraint that g(P) = 0. Imagine f(P) ballooning out until it reaches the constraint surface. At the point where f meets the constraint g, both f and g will be tangential (for smooth f, g). That means that their normal vectors will be perpendicular. So we have \nabla f(P) = \lambda \nabla g(P).

In D dimensions, we now have D+1 equations in D+1 unknowns. D of the unknowns are the coordinates of P (e.g. x, y, and z for D = 3), and the other is the new unknown constant \lambda. The equation for the gradients derived above provides D equations of constraint. The original constraint equation g(P) = 0 is the final equation in the system. Thus, in general, a unique solution exists.

Alternatively, the problem can be formulated as F(P, \lambda) = f(P) - \lambda g(P).

We next set \nabla F(P, \lambda) = 0, but keep in mind that the gradient is now D + 1 dimensional: one of its components is a partial derivative with respect to \lambda. If you set this new component of the gradient equal to zero, you get the constraint equation g(P) = 0. Meanwhile, the old components of the gradient treat \lambda as a constant, so it just pulls through. Thus, the other D equations are precisely the D equations found in the graphical approach above.

Convex optimization

Linear programming

Linear programming is a technique for the optimization of a linear objective function, subject to linear equality and linear inequality constraints.

Its feasible region is a convex polytope, which is a set defined as the intersection of finitely many half spaces, each of which is defined by a linear inequality.

Its objective function is a real-valued affine (linear) function defined on this polyhedron. A linear programming algorithm finds a point in the polytope where this function has the smallest (or largest) value if such a point exists.

Linear programs are problems that can be expressed in canonical form as

\begin{align} & \text{Optimize} && c^\mathrm{T} x\\ & \text{subject to} && Ax \leq b \\ & \text{and} && x \ge \mathbf{0} \end{align}

where x is the unknown vector, and A is a known matrix and b,c are known vectors.

Typically solved using the simplex algorithm, or with some variant of the interior-point method.

The problem can be reformulated to solve variants of the canonical form. For example, to optimize c^\mathrm{T} |x|, introduce auxilliary variables t_i \geq 0 into the constraints so that x = t_1 - t_2. The algorithm will set t_i such that t_1 + t_2 = |x| (one of the t_i will be 0).

References