Stochastic gradient descent with momentum

As discussed in the previous chapter, at each iteration stochastic gradient descent (SGD) finds the direction where the objective function can be reduced fastest on a given example. Thus, gradient descent is also known as the method of steepest descent. Essentially, SGD is a myopic algorithm. It doesn’t look very far into the past and it doesn’t think much about the future. At each step, SGD just does whatever looks right just at that moment.

You might wonder, can do something smarter? It turns out that we can. One class of methods use an idea called momentum. The idea of momentum-based optimizers is to remember the previous gradients from recent optimzation steps and to use them to help to do a better job of choosing the direction to move next. In this chapter we’ll motivate and explain SGD with momentum.

Motivating example

In order to motivate the method, let’s start by visualizing a simple quadratic objective function \(f: \mathbb{R}^2 \rightarrow \mathbb{R}\) taking a two-dimensional vector \(\mathbf{x} = [x_1, x_2]^\top\) as the input. In the following figure, each contour lines indicates points of equivalent value \(f(\mathbf{x})\). The objective function is mimized in the center and the outer rings have progressively worse values.

The red triangle indicates the starting point for our stochastic gradient descent optimizer. The lines and arrows that follow indicate each step of SGD. You might wonder why the lines don’t just point directly towards the center. That’s because the gradient estimates in SGD are noisy, due to the small sample size. So the gradient steps are noisy even if they are correct on average (unbiased). As you can see, SGD wastes too much time swinging back and forth along the direction in parallel with the \(x_2\)-axis while advancing too slowly along the direction of the \(x_1\)-axis.

Curvature and Hessian matrix

Even if we just did plain old gradient descent, we’d expect our function to bounce around quite a lot. That’s because our gradient is changing as we move around in parameter space due to the curvature of the function.

We can reason about the curvature of objective function by considering their second derivative. The second derivative says how much the gradient changes as we move in parameter space. In one dimension, a second derivative of a function indicates how fast the first derivative changes when the input changes. Thus, it is often considered as a measure of the curvature of a function. It is the rate of change of the rate of change. If you’ve never done calculus before, that might sound rather meta, but you’ll get over it.

Consider the objective function \(f: \mathbb{R}^d \rightarrow \mathbb{R}\) that takes a multi-dimensional vector \(\mathbf{x} = [x_1, x_2, \ldots, x_d]^\top\) as the input. Its Hessian matrix \(\mathbf{H} \in \mathbb{R}^{d \times d}\) collects its second derivatives. Each entry \((i, j)\) says how much the gradient of the objective with respect to parameter \(i\) changes, with a small change in parameter \(j\).

\[\mathbf{H}_{i,j} = \frac{\partial^2 f(\mathbf{x})}{\partial x_i \partial x_j}\]

for all \(i, j = 1, \ldots, d\). Since \(\mathbf{H}\) is a real symmetric matrix, by spectral theorem, it is orthogonally diagonalizable as

\[\mathbf{S}^\top \mathbf{H} \mathbf{S} = \mathbf{\Lambda},\]

where \(\mathbf{S}\) is an orthornomal eigenbasis composed of eigenvectors of \(\mathbf{H}\) with corresponding eigenvalues in a diagonal matrix \(\mathbf{\Lambda}\): the eigenvalue \(\mathbf{\Lambda}_{i, i}\) corresponds to the eigenvector in the \(i^{\text{th}}\) column of \(\mathbf{S}\). The second derivative (curvature) of the objective function \(f\) in any direction \(\mathbf{d}\) (unit vector) is a quadractic form \(\mathbf{d}^\top \mathbf{H} \mathbf{d}\). Specifically, if the direction \(\mathbf{d}\) is an eigenvector of \(\mathbf{H}\), the curvature of \(f\) in that direction is equal to the corresponding eigenvalue of \(\mathbf{d}\). Since the curvature of the objective function in any direction is a weighted average of all the eigenvalues of the Hesssian matrix, the curvature is bounded by the minimum and maximum eigenvalues of the Hesssian matrix \(\mathbf{H}\). The ratio of the maximum to the minimum eigenvalue is the condition number of the Hessian matrix \(\mathbf{H}\).

Gradient descent in ill-conditioned problems

How does the condition number of the Hessian matrix of the objective function affect the performance of gradient descent? Let us revisit the problem in the motivating example.

Recall that gradient descent is a greedy approach that selects the steepest gradient at the current point as the direction of advancement. At the starting point, the search by gradient descent advances more aggressively in the direction of the \(x_2\)-axis than that of the \(x_1\)-axis.

In the plotted problem of the motivating example, the curvature in the direction of the \(x_2\)-axis is much larger than that of the \(x_1\)-axis. Thus, gradient descent tends to overshoot the bottom of the function that is projected to the plane in parallel with the \(x_2\)-axis. At the next iteration, if the gradient along the direction in parallel with the \(x_2\)-axis remains larger, the search continues to advance more aggresively along the direction in parallel with the \(x_2\)-axis and the overshooting continues to take place. As a result, gradient descent wastes too much time swinging back and forth in parallel with the \(x_2\)-axis due to overshooting while the advancement in the direction of the \(x_1\)-axis is too slow.

To generalize, the problem in the motivating example is an ill-conditioned problem. In an ill-conditioned problem, the condition number of the Hessian matrix of the objective function is large. In other words, the ratio of the largest curvature to the smallest is high.

The momentum algorithm

The aforementioned ill-conditioned problems are challenging for gradient descent. By treating gradient descent as a special form of stochastic gradient descent, we can address the challenge with the following momentum algorithm for stochastic gradient descent.

\[\begin{split}\begin{align*} \mathbf{v} &:= \gamma \mathbf{v} + \eta \nabla f_\mathcal{B}(\mathbf{x}),\\ \mathbf{x} &:= \mathbf{x} - \mathbf{v}, \end{align*}\end{split}\]

where \(\mathbf{v}\) is the current velocity and \(\gamma\) is the momentum parameter. The learning rate \(\eta\) and the stochastic gradient \(\nabla f_\mathcal{B}(\mathbf{x})\) with respect to the sampled mini-batch \(\mathcal{B}\) are both defined in the previous chapter.

It is important to highlight that, the scale of advancement at each iteration now also depends on how aligned the directions of the past gradients are. This scale is the largest when all the past gradients are perfectly aligned to the same direction.

To better understand the momentum parameter \(\gamma\), let us simplify the scenario by assuming the stochastic gradients \(\nabla f_\mathcal{B}(\mathbf{x})\) are the same as \(\mathbf{g}\) throughout the iterations. Since all the gradients are perfectly aligned to the same direction, the momentum algorithm accelerates the advancement along the same direction of \(\mathbf{g}\) as

\[\begin{split}\begin{align*} \mathbf{v}_1 &:= \eta\mathbf{g},\\ \mathbf{v}_2 &:= \gamma \mathbf{v}_1 + \eta\mathbf{g} = \eta\mathbf{g} (\gamma + 1),\\ \mathbf{v}_3 &:= \gamma \mathbf{v}_2 + \eta\mathbf{g} = \eta\mathbf{g} (\gamma^2 + \gamma + 1),\\ &\ldots\\ \mathbf{v}_\inf &:= \frac{\eta\mathbf{g}}{1 - \gamma}. \end{align*}\end{split}\]

Thus, if \(\gamma = 0.99\), the final velocity is 100 times faster than that of the corresponding gradient descent where the gradient is \(\mathbf{g}\).

Now with the momentum algorithm, a sample search path can be improved as illustrated in the following figure.


For demonstrating the momentum algorith,, we still use the regression problem in the linear regression chapter as a case study. Specifically, we investigate stochastic gradient descent with momentum.

As usual, we import related libraries, generate the synthetic data, and construct the model.

In [1]:
from __future__ import print_function
import mxnet as mx
from mxnet import autograd
from mxnet import gluon
import numpy as np

X = np.random.randn(10000, 2)
Y = 2 * X[:,0] - 3.4 * X[:,1] + 4.2 + .01 * np.random.normal(size=10000)

ctx = mx.cpu()
net = gluon.nn.Sequential()
loss = gluon.loss.L2Loss()

Then we specify the batch sizes and learning rates for stochastic gradient descent algorithms with momentum. Specifically, the momentum parameter is set to 0.5.

In [2]:
batch_sizes = [1, 10, 100, 1000]
learning_rates = [0.1, 0.1, 0.5, 0.5]
momentum_param = 0.5

epochs = 3

Now we are ready to train the models and observe the inferred parameter values after the model training.

In [3]:
for batch_size, learning_rate in zip(batch_sizes, learning_rates):
    trainer = gluon.Trainer(net.collect_params(), 'sgd',
                            {'learning_rate': learning_rate,
                            'momentum': momentum_param})
                                    ctx=ctx, force_reinit=True)
    train_data =, Y, batch_size=batch_size, shuffle=True)

    for e in range(epochs):
        for i, batch in enumerate(train_data):
            data =[0].as_in_context(ctx)
            label = batch.label[0].as_in_context(ctx).reshape((-1, 1))
            with autograd.record():
                output = net(data)
                mse = loss(output, label)

    for para_name, para_value in net.collect_params().items():
        print("Batch size:", batch_size, para_name,
Batch size: 1 dense0_weight [ 2.00212836 -3.39304566]
Batch size: 1 dense0_bias 4.20125
Batch size: 10 dense0_weight [ 2.00067949 -3.40136981]
Batch size: 10 dense0_bias 4.20053
Batch size: 100 dense0_weight [ 2.00005388 -3.40086675]
Batch size: 100 dense0_bias 4.19988
Batch size: 1000 dense0_weight [ 2.00004411 -3.39969349]
Batch size: 1000 dense0_bias 4.20037

As expected, all the investigated algorithms find the weight vector to be close to [2, -3.4] and the bias term to be close to 4.2 as specified in the synthetic data generation.