In the previous tutorials, we decided which direction to move each parameter and and how much to move each parameter by taking the gradient of the loss with respect to each parameter. We also scaled each gradient by some learning rate, although we never really explained where this number comes from. We then updated the parameters by performing a gradient step $$\theta_{t+1} \gets \eta \nabla_{\theta}\mathcal{L}_t$$. Each update is called a gradient step and the process is called gradient descent.

The hope is that if we just take a whole lot of gradient steps, we’ll wind up with an awesome model that gets very low loss on our training data, and that this performance might generalize to our hold-out data. But as a sharp reader, you might have any number of doubts. You might wonder, for instance:

• Why does gradient descent work?
• Why doesn’t the gradient descent algorithm get stuck on the way to a low loss?
• How should we choose a learning rate?
• Do all the parameters need to share the same learning rate?
• Is there anything we can do to speed up the process?
• Why does the solution of gradient descent over training data generalize well to test data?

Some answers to these questions are known. For other questions, we have some answers but only for simple models like logistic regression that are easy to analyze. And for some of these questions, we know of best practices that seem to work even they’re not supported by any conclusive mathematical analysis. Optimization is a rich area of ongoing research. In this chapter, we’ll address the parts that are most relevant for training neural networks. To begin, let’s take a more formal look at gradient descent.

## Gradient descent in one dimension¶

To get going, consider a simple scenario in which we have one parameter to manipulate. Let’s also assume that our objective associates every value of this parameter with a value. Formally, we can say that this objective function has the signature $$f: \mathbb{R} \rightarrow \mathbb{R}$$. It maps from one real number to another.

Note that the domain of $$f$$ is in one-dimensional. According to its Taylor series expansion as shown in the introduction chapter, we have

$f(x + \epsilon) \approx f(x) + f'(x) \epsilon.$

Substituting $$\epsilon$$ with $$-\eta f'(x)$$ where $$\eta$$ is a constant, we have

$f(x - \eta f'(x)) \approx f(x) - \eta f'(x)^2.$

If $$\eta$$ is set as a small positive value, we obtain

$f(x - \eta f'(x)) \leq f(x).$

In other words, updating $$x$$ as

$x := x - \eta f'(x)$

may reduce the value of $$f(x)$$ if its current derivative value $$f'(x) \neq 0$$. Since the derivative $$f'(x)$$ is a special case of gradient in one-dimensional domain, the above update of $$x$$ is gradient descent in one-dimensional domain.

The positive scalar $$\eta$$ is called the learning rate or step size. Note that a larger learning rate increases the chance of overshooting the global minimum and oscillating. However, if the learning rate is too small, the convergence can be very slow. In practice, a proper learning rate is usually selected with experiments.

## Gradient descent over multi-dimensional parameters¶

Consider the objective function $$f: \mathbb{R}^d \rightarrow \mathbb{R}$$ that takes any multi-dimensional vector $$\mathbf{x} = [x_1, x_2, \ldots, x_d]^\top$$ as its input. The gradient of $$f(\mathbf{x})$$ with respect to $$\mathbf{x}$$ is defined by the vector of partial derivatives:

$\nabla_\mathbf{x} f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_d}\bigg]^\top.$

To keep our notation compact we may use the notation $$\nabla f(\mathbf{x})$$ and $$\nabla_\mathbf{x} f(\mathbf{x})$$ interchangeably when there is no ambiguity about which parameters we are optimizing over. In plain English, each element $$\partial f(\mathbf{x})/\partial x_i$$ of the gradient indicates the rate of change for $$f$$ at the point $$\mathbf{x}$$ with respect to the input $$x_i$$ only. To measure the rate of change of $$f$$ in any direction that is represented by a unit vector $$\mathbf{u}$$, in multivariate calculus, we define the directional derivative of $$f$$ at $$\mathbf{x}$$ in the direction of $$\mathbf{u}$$ as

$D_\mathbf{u} f(\mathbf{x}) = \lim_{h \rightarrow 0} \frac{f(\mathbf{x} + h \mathbf{u}) - f(\mathbf{x})}{h},$

which can be rewritten according to the chain rule as

$D_\mathbf{u} f(\mathbf{x}) = \nabla f(\mathbf{x}) \cdot \mathbf{u}.$

Since $$D_\mathbf{u} f(\mathbf{x})$$ gives the rates of change of $$f$$ at the point $$\mathbf{x}$$ in all possible directions, to minimize $$f$$, we are interested in finding the direction where $$f$$ can be reduced fastest. Thus, we can minimize the directional derivative $$D_\mathbf{u} f(\mathbf{x})$$ with respect to $$\mathbf{u}$$. Since $$D_\mathbf{u} f(\mathbf{x}) = \|\nabla f(\mathbf{x})\| \cdot \|\mathbf{u}\| \cdot \text{cos} (\theta) = \|\nabla f(\mathbf{x})\| \cdot \text{cos} (\theta)$$, where $$\theta$$ is the angle between $$\nabla f(\mathbf{x})$$ and $$\mathbf{u}$$, the minimum value of $$\text{cos}(\theta)$$ is -1 when $$\theta = \pi$$. Therefore, $$D_\mathbf{u} f(\mathbf{x})$$ is minimized when $$\mathbf{u}$$ is at the opposite direction of the gradient $$\nabla f(\mathbf{x})$$. Now we can iteratively reduce the value of $$f$$ with the following gradient descent update:

$\mathbf{x} := \mathbf{x} - \eta \nabla f(\mathbf{x}),$

where the positive scalar $$\eta$$ is called the learning rate or step size.

However, the gradient descent algorithm may be infeasible when the training data size is huge. Thus, a stochastic version of the algorithm is often used instead.

To motivate the use of stochastic optimization algorithms, note that when training deep learning models, we often consider the objective function as a sum of a finite number of functions:

$f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n f_i(\mathbf{x}),$

where $$f_i(\mathbf{x})$$ is a loss function based on the training data instance indexed by $$i$$. It is important to highlight that the per-iteration computational cost in gradient descent scales linearly with the training data set size $$n$$. Hence, when $$n$$ is huge, the per-iteration computational cost of gradient descent is very high.

In view of this, stochastic gradient descent offers a lighter-weight solution. At each iteration, rather than computing the gradient $$\nabla f(\mathbf{x})$$, stochastic gradient descent randomly samples $$i$$ at uniform and computes $$\nabla f_i(\mathbf{x})$$ instead. The insight is, stochastic gradient descent uses $$\nabla f_i(\mathbf{x})$$ as an unbiased estimator of $$\nabla f(\mathbf{x})$$ since

$\mathbb{E}_i \nabla f_i(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\mathbf{x}) = \nabla f(\mathbf{x}).$

In a generalized case, at each iteration a mini-batch $$\mathcal{B}$$ that consists of indices for training data instances may be sampled at uniform with replacement. Similarly, we can use

$\nabla f_\mathcal{B}(\mathbf{x}) = \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}}\nabla f_i(\mathbf{x})$

to update $$\mathbf{x}$$ as

$\mathbf{x} := \mathbf{x} - \eta \nabla f_\mathcal{B}(\mathbf{x}),$

where $$|\mathcal{B}|$$ denotes the cardinality of the mini-batch and the positive scalar $$\eta$$ is the learning rate or step size. Likewise, the mini-batch stochastic gradient $$\nabla f_\mathcal{B}(\mathbf{x})$$ is an unbiased estimator for the gradient $$\nabla f(\mathbf{x})$$:

$\mathbb{E}_\mathcal{B} \nabla f_\mathcal{B}(\mathbf{x}) = \nabla f(\mathbf{x}).$

This generalized stochastic algorithm is also called mini-batch stochastic gradient descent and we simply refer to them as stochastic gradient descent (as generalized). The per-iteration computational cost is $$\mathcal{O}(|\mathcal{B}|)$$. Thus, when the mini-batch size is small, the computational cost at each iteration is light.

There are other practical reasons that may make stochastic gradient descent more appealing than gradient descent. If the training data set has many redundant data instances, stochastic gradients may be so close to the true gradient $$\nabla f(\mathbf{x})$$ that a small number of iterations will find useful solutions to the optimization problem. In fact, when the training data set is large enough, stochastic gradient descent only requires a small number of iterations to find useful solutions such that the total computational cost is lower than that of gradient descent even for just one iteration. Besides, stochastic gradient descent can be considered as offering a regularization effect especially when the mini-batch size is small due to the randomness and noise in the mini-batch sampling. Moreover, certain hardware processes mini-batches of specific sizes more efficiently.

## Experiments¶

For demonstrating the aforementioned gradient-based optimization algorithms, we use the regression problem in the linear regression chapter as a case study.

First, We import related libraries, generate the synthetic data, and construct the model.

In [1]:

from __future__ import print_function
import mxnet as mx
from mxnet import gluon
import numpy as np

X = np.random.randn(10000, 2)
Y = 2 * X[:,0] - 3.4 * X[:,1] + 4.2 + .01 * np.random.normal(size=10000)

ctx = mx.cpu()
net = gluon.nn.Sequential()
net.collect_params().initialize()
loss = gluon.loss.L2Loss()


Then we specify the batch sizes and learning rates for stochastic gradient descent algorithms. Since the number of samples is 10,000, when the batch size is 10,000, the algorithm is essentially gradient descent.

In [2]:

batch_sizes = [1, 10, 100, 1000, 10000]
learning_rates = [0.1, 0.1, 0.5, 0.5, 1.0]
epochs = 3


Now we are ready to train the models and observe the inferred parameter values after the model training.

In [3]:

for batch_size, learning_rate in zip(batch_sizes, learning_rates):
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'learning_rate': learning_rate})
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24),
ctx=ctx, force_reinit=True)
train_data = mx.io.NDArrayIter(X, Y, batch_size=batch_size, shuffle=True)

for e in range(epochs):
train_data.reset()
for i, batch in enumerate(train_data):
data = batch.data[0].as_in_context(ctx)
label = batch.label[0].as_in_context(ctx).reshape((-1, 1))
output = net(data)
mse = loss(output, label)
mse.backward()
trainer.step(data.shape[0])

for para_name, para_value in net.collect_params().items():
print("Batch size:", batch_size, para_name,
para_value.data().asnumpy()[0])

Batch size: 1 dense0_weight [ 2.00191855 -3.40352154]
Batch size: 1 dense0_bias 4.19731
Batch size: 10 dense0_weight [ 1.99962676 -3.40070438]
Batch size: 10 dense0_bias 4.20087
Batch size: 100 dense0_weight [ 1.99993229 -3.40007687]
Batch size: 100 dense0_bias 4.20093
Batch size: 1000 dense0_weight [ 2.0000422  -3.40001583]
Batch size: 1000 dense0_bias 4.20009
Batch size: 10000 dense0_weight [ 2.00000167 -3.40008521]
Batch size: 10000 dense0_bias 4.20016


As expected, all the investigated algorithms find the weight vector to be close to [2, -3.4] and the bias term to be close to 4.2 as specified in the synthetic data generation.

Although the above demonstration uses a fixed learning rate for stochastic gradient descent, in practice a decaying learning rate is often needed. This is because the noise in random sampling does not vanish throughout the iterations. For gradient descent and for some objective functions, the true gradient tends to get smaller, approaching the zero vector. In these cases it may be ok to use a fixed learning rate because the steps will naturally get smaller as the gradient gets smaller.