Machine learning: Optimizing loss functions II.

Loss functions are used to train neural networks, but really, they arise in all machine learning algorithms. The last post talked about the MSE function for linear regression.

The basic idea: gradient descent

As mentioned in the previous post, loss functions in a neural network are generally not convex, and so just setting the gradient equal to zero does not lead to a solution that minimizes the function (it could lead to a local minimum, a maximum, or even a saddle point). To get around this problem, we initialize the parameters of the model randomly with some weight vector $\vec\theta_0$. At each step we compute the gradient of the loss function $L(\vec\theta)$ at $\vec\theta=\vec\theta_i$, which gives the direction of the greatest increase in the loss function given the weights $\vec\theta_i$. Then we set

\[\vec\theta_{i+1} = \vec\theta_i - \alpha\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_i},\]

where $\alpha$, the learning rate, scales the steps taken along the gradient according to how large the gradient is – large gradients give large step sizes since we are far from a minimum, while small gradients give small step sizes close to a minimum. Since we negated the direction of steepest ascent in the loss function we will reach a minimum. Adjusting the learning rate is a trade-off between overshooting the minimum with too large of step sizes (when $\alpha$ is too large) and the algorithm taking too long (when $\alpha$ is too small). It takes experimentation to find the right value for $\alpha$, but typically it’s between $10^{-6}$ and $10^{-2}$. The algorithm ends when the amount by which the $\vec\theta_i$s are changing is below some predetermined threshold, and the optimized weight vector is the final $\vec\theta_{i+1}$ computed.

Computational costs and stochastic gradient descent (SGD)

Batch gradient descent computes the weight vector $\vec\theta_{i+1}$ at each iteration using the entire training data set. Some sources describe this process as computing a gradient for each individual point and then averaging; this is accurate because typically loss functions are expressed as sums, where each summand corresponds to an observation. But for neural networks it’s desirable to have large data sets so this can be computationally expensive. The algorithm may also converge at a shallow local minimum, which may not be helpful.

Stochastic gradient descent is an alternative that avoids these problems. At each iteration $\vec\theta_{i+1}$ is computed on a random observation from the training data. A full pass through all samples is called an epoch. Since computations are done on one data point at a time the algorithm is faster than batch gradient descent. However, the gradient varies more at each step than in batch gradient descent. This is what helps the algorithm avoid shallow local minima but as a result the loss function will not necessarily decrease monotonically with each iteration. A compromise to these two techniques is mini-batch gradient descent, which uses a predetermined number of samples in a batch, rather than the entire training set.

Other optimizers

A full list of optimizers for use with PyTorch is given here. I am just going to describe the most common ones, and you can read more about them here.

`Adagrad`

AdaGrad is short for adaptive gradient, and the idea is to update the effective learning rate at each step. The formula for $\vec\theta_{i+1}$ is modified with a sum of squared gradients factor $G_i$:

\[\vec\theta_{i+1} = \vec\theta_i - \left(\frac{\alpha}{\sqrt{G_i+\epsilon}}\right)\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_i}\]

The sum of squared gradients factor is initialized at $G_0=0$ and updated as

\[G_i = G_{i-1} + \left(\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_i}\right)^2,\quad i\geq 1\]

and $\epsilon$ is set to a small number like $10^{-8}$ to avoid division by $0$. The effect is that when the gradient is small, the effective learning rate stays large to make convergence faster, while a large gradient makes the effective learning rate smaller to keep from overshooting the minimum. The disadvantage is that for a large number of iterations the learning rate factor becomes tiny (because $G_i$ never decreases) and slows down convergence.

`RMSprop`

RMSProp, short for root mean square propagation, seeks to address the slow convergence of AdaGrad. A decay factor $\gamma\in(0,1)$ is introduced to minimize the contributions of successive gradient computations to the sum of squared gradients. In the formula for $\vec\theta_{i+1}$, the sum of squared gradients factor from AdaGrad is replaced with a moving average of squared gradients factor, $E_i$ (often denoted $E[g^2]_i$):

\[\vec\theta_{i+1} = \vec\theta_i - \left(\frac{\alpha}{\sqrt{E_i+\epsilon}}\right)\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_i},\]

where $E_0=0$ and

\[E_i = \gamma E_{i-1} + (1-\gamma)\left(\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_i}\right)^2,\quad i\geq 1.\]

As an example of how this controls the size of the effective learning rate, expanding terms for $E_3$ we get

\[\begin{align*} E_3 &= \gamma E_2 + (1-\gamma)\left(\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_3}\right)^2 \\ &= \gamma\left(\gamma E_1 + (1-\gamma)\left(\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_2}\right)^2\right) + (1-\gamma)\left(\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_3}\right)^2 \\ &= \gamma\left(\gamma\left(\gamma E_0 + (1-\gamma)\left(\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_1}\right)^2\right) + (1-\gamma)\left(\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_2}\right)^2\right) + (1-\gamma)\left(\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_3}\right)^2 \\ &= \gamma^2(1-\gamma)\left(\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_1}\right) + \gamma(1-\gamma)\left(\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_2}\right)^2 + (1-\gamma)\left(\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_3}\right)^2. \end{align*}\]

A high value of $\gamma$, say, $\gamma=0.9$, keeps all the contributions to $E_i$ low and balanced. Overall, $E_i$ cannot grow as fast as $G_i$ so the effective learning rate cannot get as small as in AdaGrad.

`Adam`

Adam, short for adaptive moment estimation, combines the advantages of RMSProp and another method known as SGD with momentum. SGD with momentum is not discussed in detail here, but the idea is that in the formula for $\vec\theta_{i+1}$ the gradient is replaced with a weighted average of it with a momentum term. This weighted average places more weight on recent gradient computations, so the overall direction of descent doesn’t vary as much as it does in SGD.

Here is the formula for Adam:

\[\vec\theta_{i+1} = \vec\theta_i - \left(\frac{\alpha}{\sqrt{\hat v_i+\epsilon}}\right)\hat m_i;\]

the factors $\hat v_i$ and $\hat m_i$ are adjustments to moving averages of the gradients and squared gradients, respectively. We have:

\[\begin{align*} v_0 &= m_0=0 \\ v_i &= \beta_1v_{i-1} + (1-\beta_1)\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_i},\quad i\geq 1,\quad& \hat v_i &= \frac{v_i}{1-\beta_1^i} \\ m_i &= \beta_2m_{i-1} + (1-\beta_2)\left(\nabla L(\vec\theta)\vert_{\vec\theta=\vec\theta_i}\right)^2,\quad i\geq 1,\quad & \hat m_i &= \frac{m_i}{1-\beta_2^i}. \end{align*}\]

The decay rates $\beta_1,\ \beta_2$ are between $0$ and $1$; typically $\beta_1=0.9$ and $\beta_2=0.999$. This justifies the use of the “bias-adjusted” momentum factors – for small $i$, $v_i$ and $m_i$ are biased toward $0$ and so $\hat v_i$ and $\hat m_i$ correct for this.

Each optimizer described above may be desirable, depending on simplicity versus high computing power, but because of its performance and fast convergence, Adam has become the most popular. It’s what I used in my neural network for the Kaggle competition.

Written on February 3, 2025