Linear regression with gluon

Now that we’ve implemented a whole neural network from scratch, using nothing but mx.ndarray and mxnet.autograd, let’s see how we can make the same model while doing a lot less work.

Again, let’s import some packages, this time adding mxnet.gluon to the list of dependencies.

In [32]:
from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd, gluon

Set the context

We’ll also want to set a context to tell gluon where to do most of the computation.

In [33]:
data_ctx = mx.cpu()
model_ctx = mx.cpu()

Build the dataset

Again we’ll look at the problem of linear regression and stick with the same synthetic data.

In [34]:
num_inputs = 2
num_outputs = 1
num_examples = 10000

def real_fn(X):
    return 2 * X[:, 0] - 3.4 * X[:, 1] + 4.2

X = nd.random_normal(shape=(num_examples, num_inputs))
noise = 0.01 * nd.random_normal(shape=(num_examples,))
y = real_fn(X) + noise

Load the data iterator

We’ll stick with the DataLoader for handling out data batching.

In [35]:
batch_size = 4
train_data = gluon.data.DataLoader(gluon.data.ArrayDataset(X, y),
                                      batch_size=batch_size, shuffle=True)

Define the model

When we implemented things from scratch, we had to individually allocate parameters and then compose them together as a model. While it’s good to know how to do things from scratch, with gluon, we can just compose a network from predefined layers. For a linear model, the appropriate layer is called Dense. It’s called a dense layer because every node in the input is connected to every node in the subsequent layer. That description seems excessive because we only have one (non-input) layer here, and that layer only contains one node! But in subsequent chapters we’ll typically work with networks that have multiple outputs, so we might as well start thinking in terms of layers of nodes. Because a linear model consists of just a single Dense layer, we can instantiate it with one line.

As in the previous notebook, we have an inputdimension of 2 and an output dimension of 1. the most direct way to instantiate a Dense layer with these dimensions is to specify the number of inputs and the number of outputs.

In [36]:
net = gluon.nn.Dense(1, in_units=2)

That’s it! We’ve already got a neural network. Like our hand-crafted model in the previous notebook, this model has a weight matrix and bias vector.

In [37]:
print(net.weight)
print(net.bias)
Parameter dense4_weight (shape=(1, 2), dtype=None)
Parameter dense4_bias (shape=(1,), dtype=None)

Here, net.weight and net.bias are not actually NDArrays. They are insances of the Parameter class. We use Parameter instead of directly accessing NDAarrays for several reasons. For example, they provide convenient abstractions for initializing values. Unlike NDArrays, Parameters can be associated with multiple contexts simulataneously. This will come in handy in future chapters when we start thinking about distributed learning across multiple GPUs.

In gluon, all neural networks are made out of Blocks (gluon.Block). Blocks are just units that take inputs and generates outputs. Blocks also contain parameters that we can update. Here, our network consists of only one layer, so it’s convenient to access our parameters directly. When our networks consist of 10s of layers, this won’t be so fun. No matter how complex our network, we can grab all its parameters by calling collect_params() as follows:

In [38]:
net.collect_params()
Out[38]:
dense4_ (
  Parameter dense4_weight (shape=(1, 2), dtype=None)
  Parameter dense4_bias (shape=(1,), dtype=None)
)

The returned object is a gluon.parameter.ParameterDict. This is a convenient abstaction for retrieving and manipulating groups of Parameter objects. Most often, we’ll want to retrieve all of the parameters in a neural network

In [39]:
type(net.collect_params())
Out[39]:
mxnet.gluon.parameter.ParameterDict

Initialize parameters

Once we initialize our Parameters, we can access their underlying data and context(s), and we can also feed data through the neural network to generate output. However, we can’t get going just yet. If we try invoking your model by calling net(nd.array([[0,1]])), we’ll confront the following hideous error message:

RuntimeError: Parameter dense1_weight has not been initialized...

That’s because we haven’t yet told gluon what the initial values for our parameters should be! We initialize parameters by calling the .initialize() method of a Parameterdict. We’ll need to pass in two arguments.

  • An initializer, many of which live in the mx.init module.
  • A context where the parameters should live. In this case we’ll pass in the model_ctx. Most often this will either be a GPU or a list of GPUs.

MXNet provides a variety of common initializers in mxnet.init. To keep things consistent with the model we built by hand, we’ll initialize each parameter by sampling from a standard normal distribution, using mx.init.Normal(sigma=1.).

In [40]:
net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=model_ctx)

Deferred Initialization

When we call initialize, gluon associates each parameter with an initializer. However, the actual initialization is deferred until we make a first forward pass. In other words, the parameters are only initialized when they’re needed. If we try to call net.weight.data() we’ll get the following error:

DeferredInitializationError: Parameter dense2_weight has not been initialized yet because initialization was deferred. Actual initialization happens during the first forward pass. Please pass one batch of data through the network before accessing Parameters.

Passing data through a gluon model is easy. We just sample a batch of the appropriate shape and call net just as if it were a function. This will invoke net’s forward() method.

In [41]:
example_data = nd.array([[4,7]])
net(example_data)
Out[41]:

[[-1.33219385]]
<NDArray 1x1 @cpu(0)>

Now that net is initialized, we can access each of its parameters.

In [42]:
print(net.weight.data())
print(net.bias.data())

[[-0.25217363 -0.04621419]]
<NDArray 1x2 @cpu(0)>

[ 0.]
<NDArray 1 @cpu(0)>

Shape inference

Recall that previously, we instantiated our network with gluon.nn.Dense(1, in_units=2). One slick feature that we can take advantage of in gluon is shape inference on parameters. Because our parameters never come into action until we pass data through the network, we don’t actually have to declare the input dimension (in_units). Let’s try this again, but letting gluon do more of the work:

In [43]:
net = gluon.nn.Dense(1)
net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=model_ctx)

We’ll elaborate on this and more of gluon’s internal workings in subsequent chapters.

Define loss

Instead of writing our own loss function we’re just going to access squared error by instantiating gluon.loss.L2Loss. Just like layers, and whole networks, a loss in gluon is just a Block.

In [44]:
square_loss = gluon.loss.L2Loss()

Optimizer

Instead of writing stochastic gradient descent from scratch every time, we can instantiate a gluon.Trainer, passing it a dictionary of parameters. Note that the sgd optimizer in gluon actually uses SGD with momentum and clipping (both can be switched off if needed), since these modifications make it converge rather much better. We will discuss this later when we go over a range of optimization algorithms in detail.

In [45]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.0001})

Execute training loop

You might have noticed that it was a bit more concise to express our model in gluon. For example, we didn’t have to individually allocate parameters, define our loss function, or implement stochastic gradient descent. The benefits of relying on gluon’s abstractions will grow substantially once we start working with much more complex models. But once we have all the basic pieces in place, the training loop itself is quite similar to what we would do if implementing everything from scratch.

To refresh your memory. For some number of epochs, we’ll make a complete pass over the dataset (train_data), grabbing one mini-batch of inputs and the corresponding ground-truth labels at a time.

Then, for each batch, we’ll go through the following ritual. So that this process becomes maximally ritualistic, we’ll repeat it verbatim: * Generate predictions (yhat) and the loss (loss) by executing a forward pass through the network. * Calculate gradients by making a backwards pass through the network via loss.backward(). * Update the model parameters by invoking our SGD optimizer (note that we need not tell trainer.step about which parameters but rather just the amount of data, since we already performed that in the initialization of trainer).

In [46]:
epochs = 10
loss_sequence = []
num_batches = num_examples / batch_size

for e in range(epochs):
    cumulative_loss = 0
    # inner loop
    for i, (data, label) in enumerate(train_data):
        data = data.as_in_context(model_ctx)
        label = label.as_in_context(model_ctx)
        with autograd.record():
            output = net(data)
            loss = square_loss(output, label)
        loss.backward()
        trainer.step(batch_size)
        cumulative_loss += nd.mean(loss).asscalar()
    print("Epoch %s, loss: %s" % (e, cumulative_loss / num_examples))
    loss_sequence.append(cumulative_loss)

Epoch 0, loss: 3.44980202263
Epoch 1, loss: 2.10364257665
Epoch 2, loss: 1.28279426137
Epoch 3, loss: 0.782256319318
Epoch 4, loss: 0.477034088909
Epoch 5, loss: 0.290909814427
Epoch 6, loss: 0.177411796283
Epoch 7, loss: 0.108197494675
Epoch 8, loss: 0.0659899789031
Epoch 9, loss: 0.040249745576

Visualizing the learning curve

Now let’s check how quickly SGD learns the linear regression model by plotting the learning curve.

In [47]:
# plot the convergence of the estimated loss function
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

plt.figure(num=None,figsize=(8, 6))
plt.plot(loss_sequence)

# Adding some bells and whistles to the plot
plt.grid(True, which="both")
plt.xlabel('epoch',fontsize=14)
plt.ylabel('average loss',fontsize=14)
Out[47]:
<matplotlib.text.Text at 0x7efc87a7f0f0>
../_images/chapter02_supervised-learning_linear-regression-gluon_32_1.png

As we can see, the loss function converges quickly to the optimal solution.

Getting the learned model parameters

As an additional sanity check, since we generated the data from a Gaussian linear regression model, we want to make sure that the learner managed to recover the model parameters, which were set to weight \(2,-3.4\) with an offset of \(4.2\).

In [48]:
params = net.collect_params() # this returns a ParameterDict

print('The type of "params" is a ',type(params))

# A ParameterDict is a dictionary of Parameter class objects
# therefore, here is how we can read off the parameters from it.

for param in params.values():
    print(param.name,param.data())
The type of "params" is a  <class 'mxnet.gluon.parameter.ParameterDict'>
dense5_weight
[[ 1.7913872  -3.10427046]]
<NDArray 1x2 @cpu(0)>
dense5_bias
[ 3.85259581]
<NDArray 1 @cpu(0)>

Conclusion

As you can see, even for a simple example like linear regression, gluon can help you to write quick, clean, code. Next, we’ll repeat this exercise for multi-layer perceptrons, extending these lessons to deep neural networks and (comparatively) real datasets.