In an optimization problem we want to find \argmin_x f(x).
An iterative optimization method produces estimates x_{n+1} = x_n + d_n where d_n is a direction vector.
x_{n+1} = x_n + d where d is a step size which we need to choose.
The method takes a locally quadratic approximation of f around x_n f(x_n+d)\approx f(x_n)+f'(x_n)d+{\frac {1}{2}}f''(x_n) d^{2}, which reaches its minimum at d = -\frac{f'(x_n)}{f''(x_n)}.
The iteration, generalized to a multivariate setting, is given by x_{n+1} = x_n - H_n^{-1}G_n, where H is the Hessian and G is gradient.
Newton’s method for root-finding has a similar update step x_{n+1} = x_n - \frac{f(x)}{f'(x)}, which is similar since the goal here is to find the root of f'(x).
Used for non-linear least squares problems. Approximates the Hessian as follows:
S = S(x) = \sum_i (y_i - f(x))^2 = \sum_i r_i^2
G = \frac{\partial S}{\partial x_j} = 2 \sum_i r_i \frac{\partial r_i}{\partial x_j} = 2J_r^⊤r
H = \frac{\partial^2 S}{\partial x_j x_k} = 2 \sum_i \left(\frac{\partial r_i}{\partial x_j}\frac{\partial r_i}{\partial x_k} + r_i \frac{\partial^2 r_i}{\partial x_j x_k} \right) \approx 2 \sum_i \left(\frac{\partial r_i}{\partial x_j}\frac{\partial r_i}{\partial x_k}\right) = 2J_r^⊤ J_r
The update step is x_{n+1} = x_n - H^{-1}G = x_n - (J_r^⊤J_r)^{-1} (J_r^⊤r)
Used for non-linear least squares problems.
Modifies the Gauss–Newton method to handle possible divergences if the step sizes are too big using a trust region method.
Used for minimizing general real-valued functions. In Quasi-Newton methods, the Hessian matrix of second derivatives is not computed. Instead, the Hessian matrix is approximated using updates specified by gradient evaluations.
Want to minimize Q(w) = \frac{1}{n} \sum_{i = 1}^{n} Q_i(w). Gradient descent does w:= w - \eta \nabla Q(w) = w - \frac{\eta}{n} \sum_{i = 1}^{n} \nabla Q_i(w), where \nabla is the step size. In stochastic (or “on-line”) gradient descent, the true gradient of Q(w) is approximated by a gradient at a single sample: w:= w - \eta \nabla Q_i(w).
Algo:
A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a “mini-batch”) at each step. This can perform significantly better due to using vectorization.
Converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum.
Given a matrix A and a (column) vector of response variables y, the goal is to find \operatorname {arg\,min} \limits_{ x }\| Ax - y \|_2^2 subject to x \geq 0.
Can be reformulated as a quadratic programming problem. A generalization is bounded-variable least squares with \alpha_i \leq x_i \leq \beta_i.
Available in SciPy.
Minimize the function f(P) subject to the constraint that g(P) = 0. Imagine f(P) ballooning out until it reaches the constraint surface. At the point where f meets the constraint g, both f and g will be tangential (for smooth f, g). That means that their normal vectors will be perpendicular. So we have \nabla f(P) = \lambda \nabla g(P).
In D dimensions, we now have D+1 equations in D+1 unknowns. D of the unknowns are the coordinates of P (e.g. x, y, and z for D = 3), and the other is the new unknown constant \lambda. The equation for the gradients derived above provides D equations of constraint. The original constraint equation g(P) = 0 is the final equation in the system. Thus, in general, a unique solution exists.
Alternatively, the problem can be formulated as F(P, \lambda) = f(P) - \lambda g(P).
We next set \nabla F(P, \lambda) = 0, but keep in mind that the gradient is now D + 1 dimensional: one of its components is a partial derivative with respect to \lambda. If you set this new component of the gradient equal to zero, you get the constraint equation g(P) = 0. Meanwhile, the old components of the gradient treat \lambda as a constant, so it just pulls through. Thus, the other D equations are precisely the D equations found in the graphical approach above.
Linear programming is a technique for the optimization of a linear objective function, subject to linear equality and linear inequality constraints.
Its feasible region is a convex polytope, which is a set defined as the intersection of finitely many half spaces, each of which is defined by a linear inequality.
Its objective function is a real-valued affine (linear) function defined on this polyhedron. A linear programming algorithm finds a point in the polytope where this function has the smallest (or largest) value if such a point exists.
Linear programs are problems that can be expressed in canonical form as
\begin{align} & \text{Optimize} && c^\mathrm{T} x\\ & \text{subject to} && Ax \leq b \\ & \text{and} && x \ge \mathbf{0} \end{align}
where x is the unknown vector, and A is a known matrix and b,c are known vectors.
Typically solved using the simplex algorithm, or with some variant of the interior-point method.
The problem can be reformulated to solve variants of the canonical form. For example, to optimize c^\mathrm{T} |x|, introduce auxilliary variables t_i \geq 0 into the constraints so that x = t_1 - t_2. The algorithm will set t_i such that t_1 + t_2 = |x| (one of the t_i will be 0).