Gradient-based optimization

On a basic level, gradient-based algorithms find the direction of steepest descend and follow that path. This is done in three main steps:

Search Direction
Step Size
Convergence Check

The first step is to pick a direction to go. The solver evaluates the slope by differentiating it in a small area. In one dimension this derivative is the slope, in multiple dimensions it is called the gradient. Consequentially, the search direction is the direction of steepest descend in the examined area.

Next, the algorithm needs to choose how far to go in the chosen direction, the chosen value is called the step size. Once a direction and a step size are chosen, the solver moves in the chosen direction, where it checks to see if it has reached the bottom. If not, it uses the slope again to pick a new direction and step size. This continues until the solver reaches the minimum. We call this process convergence.

The process of following the gradient of steepest descend is called gradient descent. Conversely, stepping in the direction of the gradient itself is known as gradient ascent.

How to calculate derivatives

Gradient-based algorithms need to calculate local gradients to work. There are three main methods of obtaining gradients:

Symbolic Differentiation
Numeric Differentiation
Automatic Differentiation

Symbolic differentiation is in quintessence what we learn to do in calculus. An algorithm can just as well apply the rules of calculus differentiation to a function and thus get its derivative. Subsequently, the derivative can be used to get the gradient on specific point. However, symbolic differentiation outputs the whole derivative in one form which makes it difficult to interpret. Furthermore, it is not a very fast method.

The classic form of numeric differentiation is finite differencing: we extract the slope or gradient of a function by calculating the function values close to the point of interest. The major problem with finite differencing is the limit computers have on how closely two points can be put together. Thus, we can always only get an approximation of the derivative with this method. Furthermore, numerical differentiation becomes very slow for large-dimensional problems.

Automatic differentiation is like symbolic differentiation in that it applies a series of rules to obtain a derivative from a function. The difference is that automatic differentiation is specifically designed to work on computer code. Automatic differentiation uses the chain rule to break the function down into their simple parts, differentiates each part and brings them back together again. This process is more difficult to implement than numerical methods, but it can deliver the exact derivatives instead of approximations, and it also works fast on large-dimensional problems.

Strengths and Weaknesses of Using Gradients

Gradient-based algorithms are widely used, have fast performance, and scale well to large problems. However, they do require smooth, continuous function gradients, and finding these gradients can be computationally expensive. Many gradient-based optimisers are also susceptible to finding local minima rather than a global optimum. Gradient based optimisers are a powerful tool, but as with any optimisation problem, it takes experience and practice to know which method is the right one.

Newton’s Method

Newton’s method can be applied to the derivative $f^{'}$ of a twice-differentiable function $f$ to find the roots of the derivative also known as the critical points of $f$ . In calculus, they are the solutions to $f^{'} (x) = 0$ . These solutions may be minima, maxima, or saddle points. In numerical analysis, Newton’s method, is a root-finding algorithm which produces successively better approximations to the roots of a real-valued function. The most basic version starts with a single-variable function $f$ defined for a real variable $x$ , the function’s derivative $f^{'}$ , and an initial guess $x_{t = 0}$ for a root of $f$ . If the function satisfies sufficient assumptions and the initial guess is close, then

$\begin{matrix} (8.1) & x_{t + 1} = x_{t} - \frac{f (x_{t})}{f^{'} (x_{t})} \end{matrix}$

is a better approximation of the root than $x$ . The improved guess is the unique root of the linear approximation at the initial point. The process is repeated until a sufficiently precise value is reached. Newton’s method can also be extended to complex functions and to systems of equations.

Quasi-Newton Methods

Newton’s method assumes that the function can be locally approximated as a quadratic in the region around the optimum and uses the first and second derivatives to find the stationary point; they require a Hessian matrix. In quasi-Newton methods, the Hessian matrix does not need to be computed. The Hessian is updated by analyzing successive gradient vectors instead. Quasi-Newton methods are a generalization of the secant method to find the root of the first derivative for multidimensional problems. The optim() method "BFGS" is a quasi-Newton method, specifically that published simultaneously in 1970 by Broyden, Fletcher, Goldfarb and Shanno (Broyden 1970, Fletcher 1970, Goldfarb 1970, Shanno 1970). This uses function values and gradients to build up a picture of the surface to be optimized. The method "L-BFGS-B" of Byrd et al. (1995) is another quasi-Newton Method which allows box constraints: each variable can be given a lower and/or upper bound.

Conjugate Gradient Algorithm

The conjugate gradient algorithm is an efficient numerical method used to solve a linear system, or equivalently, optimize a quadratic convex function. Conjugate gradient methods will generally be more fragile than the BFGS method, but as they do not store a matrix, they may be successful in much larger optimization problems. The optim() method "CG" is a conjugate gradients method based on that by Fletcher and Reeves (1964).

--- title: Gradient-based optimization --- On a basic level, gradient-based algorithms find the direction of steepest descend and follow that path. This is done in three main steps: - Search Direction - Step Size - Convergence Check The first step is to pick a direction to go. The solver evaluates the slope by differentiating it in a small area. In one dimension this derivative is the slope, in multiple dimensions it is called the gradient. Consequentially, the search direction is the direction of steepest descend in the examined area. Next, the algorithm needs to choose how far to go in the chosen direction, the chosen value is called the step size. Once a direction and a step size are chosen, the solver moves in the chosen direction, where it checks to see if it has reached the bottom. If not, it uses the slope again to pick a new direction and step size. This continues until the solver reaches the minimum. We call this process convergence. The process of following the gradient of steepest descend is called gradient descent. Conversely, stepping in the direction of the gradient itself is known as gradient ascent. ## How to calculate derivatives Gradient-based algorithms need to calculate local gradients to work. There are three main methods of obtaining gradients: - Symbolic Differentiation - Numeric Differentiation - Automatic Differentiation Symbolic differentiation is in quintessence what we learn to do in calculus. An algorithm can just as well apply the rules of calculus differentiation to a function and thus get its derivative. Subsequently, the derivative can be used to get the gradient on specific point. However, symbolic differentiation outputs the whole derivative in one form which makes it difficult to interpret. Furthermore, it is not a very fast method. The classic form of numeric differentiation is finite differencing: we extract the slope or gradient of a function by calculating the function values close to the point of interest. The major problem with finite differencing is the limit computers have on how closely two points can be put together. Thus, we can always only get an approximation of the derivative with this method. Furthermore, numerical differentiation becomes very slow for large-dimensional problems. Automatic differentiation is like symbolic differentiation in that it applies a series of rules to obtain a derivative from a function. The difference is that automatic differentiation is specifically designed to work on computer code. Automatic differentiation uses the chain rule to break the function down into their simple parts, differentiates each part and brings them back together again. This process is more difficult to implement than numerical methods, but it can deliver the exact derivatives instead of approximations, and it also works fast on large-dimensional problems. ### Strengths and Weaknesses of Using Gradients Gradient-based algorithms are widely used, have fast performance, and scale well to large problems. However, they do require smooth, continuous function gradients, and finding these gradients can be computationally expensive. Many gradient-based optimisers are also susceptible to finding local minima rather than a global optimum. Gradient based optimisers are a powerful tool, but as with any optimisation problem, it takes experience and practice to know which method is the right one. ### Newton’s Method Newton's method can be applied to the derivative $f'$ of a twice-differentiable function $f$ to find the roots of the derivative also known as the critical points of $f$. In calculus, they are the solutions to $f'(x) = 0$. These solutions may be minima, maxima, or saddle points. In numerical analysis, Newton's method, is a root-finding algorithm which produces successively better approximations to the roots of a real-valued function. The most basic version starts with a single-variable function $f$ defined for a real variable $x$, the function's derivative $f'$, and an initial guess $x_{t=0}$ for a root of $f$. If the function satisfies sufficient assumptions and the initial guess is close, then $$x_{t+1} = x_t - \frac{f(x_t)}{f'(x_t)}$${#eq-newton} is a better approximation of the root than $x$. The improved guess is the unique root of the linear approximation at the initial point. The process is repeated until a sufficiently precise value is reached. Newton’s method can also be extended to complex functions and to systems of equations. ### Quasi-Newton Methods Newton’s method assumes that the function can be locally approximated as a quadratic in the region around the optimum and uses the first and second derivatives to find the stationary point; they require a Hessian matrix. In quasi-Newton methods, the Hessian matrix does not need to be computed. The Hessian is updated by analyzing successive gradient vectors instead. Quasi-Newton methods are a generalization of the secant method to find the root of the first derivative for multidimensional problems. The `optim()` method `"BFGS"` is a quasi-Newton method, specifically that published simultaneously in 1970 by Broyden, Fletcher, Goldfarb and Shanno [@broyden1970convergence; @fletcher1970new; @goldfarb1970family; @shanno1970conditioning]. This uses function values and gradients to build up a picture of the surface to be optimized. The method `"L-BFGS-B"` of @byrd1995limited is another quasi-Newton Method which allows box constraints: each variable can be given a lower and/or upper bound. ### Conjugate Gradient Algorithm The conjugate gradient algorithm is an efficient numerical method used to solve a linear system, or equivalently, optimize a quadratic convex function. Conjugate gradient methods will generally be more fragile than the BFGS method, but as they do not store a matrix, they may be successful in much larger optimization problems. The `optim()` method `"CG"` is a conjugate gradients method based on that by @fletcher1964function.