The Gradient Descent Algorithm and the Intuition Behind It | by Antonieta Mastrogiuseppe | Aug, 2023

A technical description of the Gradient Descent method, complemented with a graphical representation of the algorithm at work

1. INTRODUCING SOME KEY DEFINITIONS

Within optimization methods, and in the first order algorithm type, one has certainly heard of the one known as Gradient Descent. It is of the first-order optimization type as it requires the first-order derivative, namely the gradient. By optimizing, gradient descent aims to minimize the difference between the “actual” output and the predicted output of the model as measured by the objective function, namely a cost function. The gradient, or slope, is defined as the direction of the line drawn by such function (curved or straight) at a given point of such line. Iteratively, gradient descent aims to differentiate the cost function at different points of the line so to derive the degree of change in direction at these points and hence take steps towards the steepest descent, namely the local minimum. As its name indicates, the gradient is used as the direction towards the steepest descent in search for the local minimum where the parameters’ values of the cost function being optimized are minimized hence at its lowest values.

Gradient Descent is mostly used (among others) to train machine learning models and deep learning models, the latter based on a neural network architectural type. From linear regression and logistic regression to neural networks, gradient descent aims to calculate the function’s best parameters values. In its simplest form, gradient descent aims to minimize the error term of the below linear regression by deriving the optimal values of the parameters for the independent variables. This is,

y = β0 + β1 * X1 + … βk * Xk + Ɛ

where,

y is the dependent variable

k number of independent variables

X independent variable

β parameter

Ɛ error term component

In its more complex form, gradient descent is most frequently defined as the optimizer when training a deep learning model, specifically on the compiling phase. Deep learning is based on an interconnected network to learn and improve continuously, namely a neural network. Inspired by the human brain, a neural network is a highly complex network represented by artificial neurons, known as nodes. At the top level, the nodes have the important role to process and analyse the data from a node in the previous layer and pass it to another node in the next layer. In a neural network, a weight, namely the parameters of interest for optimization, are the link between nodes. They are the link between inputs/features and hidden layers hence they represent the importance of a given feature in predicting the final output. Finding the optimal value of a single weight depends on the value of many weights. And this optimization is taking placing for many weights at once, which in a deep neural network can be substantially large even in millions. Here is where gradient descent is found to perform highly efficiently on the large number of calculations involved, these based on the three main layers of a neural network: 1) Input Layer 2) Hidden Layer and 3) Output Layer.

There are numerous papers that properly detail and expand on methods such deep learning and methods that estimates the value of a function’s parameters hence expand on the difference between gradient descent and Ordinary Least Square (OLS), in the case of linear regression for example. As this is not the focus of this paper, the reader is prompted to investigate further and expand for a good understanding on such methodologies.

2. TIME FOR SOME CALCULUS!

For a good understanding of gradient descent, we need to expand on the definition of a differentiable function. A function, explicitly ƒ(x), is differentiable when the derivative can be defined at any point of the curved line derived by such function. This is, for all points in the domain of the function ƒ(x). Here, two concepts reinforce this definition: first-order derivative and second order derivative. The first-order derivative formula is defined as follow:

Strictly speaking, the first-order derivative of a function, denoted ƒ’(x) or df(x)/dx, is the slope of the function ƒ(x) at a given point value of x. If the slope is positive (negative), it indicates the function is increasing (decreasing) and by how much. A positive slope is an indication that as the value of x increases, so is the function ƒ(x). A negative slope, on the other hand, indicates that as the value of x increases, ƒ(x) decreases. The second-order derivative is the derivative of the derivative of the function ƒ(x). Denoted ƒ’’(x) or d2f(x)/dx2, the second derivative indicates the shape of the function ƒ(x). This is, whether such function is concave or convex. Mathematically speaking, (and this is important!!!) a second derivative will distinguish a relative maximum from a relative minimum.

where,

ƒ’’(x) > 0 then ƒ(x) is convex at x = a

and if ƒ’(a) = 0 then a is a critical point hence a relative minimum

where,

ƒ’’(x) < 0 then ƒ(x) is concave at x = a

and if ƒ’(a) = 0 then a is a critical point hence a relative maximum

or if the second derivative is equal to zero, then either 1) the function ƒ(x) is in a turning point, known as Inflection point, where it changes from concave to convex or vice versa or 2) the function at that point is undefined (i.e., non-continuous). In the case of the former:

ƒ’’(x) = 0 then ƒ(x) is at an inflection point at x = 2

The above has focused on a function with a single independent variable, namely a univariate function, y = ƒ(x). In the real world, one will be studying and modelling multivariable functions, where the variable under study is impacted by multiple factors, this is two or more independent variables, y = ƒ(x, z). To measure the impact of a change of the independent variable x in the dependent variable y, keeping z constant, the partial derivative of the function with respect to x is taken. Thus, partial derivatives calculate the rate of change in the cost function caused by a change in each of their inputs. Gradient descent iteratively calculates these changes in the cost function and at each different step updates the values of the parameters of such functions till reaching the minimum point where the value of such parameters is optimized hence the cost function is minimized.

3. THE GRADIENT DESCENT ALGORITHM AT WORK

The larger the absolute value of the slope, the further we can step, and/or we can keep taking steps towards the steepest descent, namely the local minimum. As we approach the lowest/minimum point, the slope diminishes so one can take smaller steps until reaching a flat surface where the slope is equal to zero (0), ƒ’(x) = 0, this is lowest value of βi as pointed by the red arrow in the graph below. This is where the local minimum of the curved line is, and optimum values of the function’s parameters are derived.

Thus, if a function is strictly convex (concave), there will only be one critical point. Now, there is also the case where there are multiple local minima. In this case, the search is for the single lowest value the function can achieve. This is known as Global Minimum.

The following two key questions arise:

1) In which direction to step?

2) How big the steps should be?

Let us recap what have we have said so far. Gradient descent is an algorithm, that while in the training phase of a model, iteratively adjusts hence optimizes the values of the function’s parameters by taking the partial derivative of the function with respect to each of its inputs at every step it takes towards the steepest descent, defined as the local minimum. If the derivative is positive, the function is increasing. Thus, steps should be taken opposite direction. The gradient indicates in which direction the step should be taken. If gradient is large, namely large slope absolute value, larger steps should be taken towards the local minimum. In fact, gradient descent takes increasingly smaller steps on the direction of the local minima within each iteration as shown by the blue arrows in the graph above.

How big the step to take relates to the learning rate. This is how fast the algorithm learns/moves towards the steepest descent. At the highest gradient, large absolute value of the slope, the fastest the algorithm learns. As it approaches the local minimum, smaller the steps to take. Thus, learning rate value as any hyperparameter is set after trying different values so that the cost function decreases across iterations. If too big, the local minimum can be missed. A small learning rate might lead to small updates to the weights causing the model not to improve significantly. If too small, it will take time to reach convergence. Convergence is reached when the cost function does not longer decrease. Thus, the cost function is the indicator of the algorithm performance. In a multivariate function world, this is denoted:

where,

df/dβ partial derivative of the cost function with respect to the parameter β

m number of data points

yi actual values of the dependent/target variable for the i-th data point

ŷi predicted values by the model of the dependent/target variable for the i-th data point

xi represents the i-th input associated with the data point.

where,

▽f represents the gradient vector of the function f(x) with respect to the parameters β

df/dβk represents the partial derivative of the function f(x) with respect to the k-th parameter β

were,

New β represents the current value of the i-th parameter β

Old β represents the updated value of the i-th parameter β

n is the learning rate: length of the step to take!

▽f is the gradient vector pointing in the direction of the steepest descent of the function f(x) with respect to changes in the parameters β to minimize f(x)

One of the limitations of gradient descent is related to one of the criteria mentioned above where the function must be differentiable at every point of its domain. When this is not the case and the algorithm finds a point that is undefined, (i.e., non-continuous) then it fails.

Another limitation is related to the size of the steps, namely the learning rate (n), taken towards the steepest descent. If too large it is likely to miss the local minimum or even not reach converge at all. If too small, it will take much longer to converge. If the number of inputs is large, this becomes even more problematic.

Finally, gradient descent might never find the global minimum. The algorithm is not able to distinguish between a local and global minimum. As it descent in search of the local minimum, once it converges it will then stop. The local minimum will corner the algorithm in the local minimum own valley preventing the step to be large enough to exit.

5. CONCLUSIONS