Linear regression with gluon

Now that we’ve implemented a whole neural network from scratch, using nothing but mx.ndarray and mxnet.autograd, let’s see how we can make the same model while doing a lot less work.

Again, let’s import some packages, this time adding mxnet.gluon to the list of dependencies.

In [1]:
from __future__ import print_function
import mxnet as mx
import mxnet.ndarray as nd
from mxnet import autograd
from mxnet import gluon

Set the context

We’ll also want to set a context to tell gluon where to do most of the computation.

In [2]:
ctx = mx.cpu()

Build the dataset

Again we’ll look at the problem of linear regression and stick with the same synthetic data.

In [3]:
num_inputs = 2
num_outputs = 1
num_examples = 10000

X = nd.random_normal(shape=(num_examples, num_inputs))
y = 2* X[:,0] - 3.4 * X[:,1] + 4.2 + .01 * nd.random_normal(shape=(num_examples,))

Load the data iterator

We’ll stick with the DataLoader for handling out data batching.

In [4]:
batch_size = 4
train_data =, y),
                                      batch_size=batch_size, shuffle=True)

Define the model

When we implemented things from scratch, we had to individually allocate parameters and then compose them together as a model. While it’s good to know how to do things from scratch, with gluon, we can just compose a network from predefined layers. For a linear model, the appropriate layer is called Dense. It’s called a dense layer because every node in the input is connected to every node in the subsequent layer. That description seems excessive because we only have one output here. But in most subsequent chapters we’ll work with networks that have multiple outputs.

Unless we’re planning to make some wild decisions (and at some point, we will!), the easiest way to throw together a neural network is to rely on the gluon.nn.Sequential. Once instantiated, a Sequential just stores a chain of layers. Presented with data, the Sequential executes each layer in turn, using the output of one layer as the input for the next. We’ll delve deeper into these details later when we actually have more than one layer to work with (we could have multiple parallel branches, long chains, etc.). For now let’s just instantiate the Sequential.

In [5]:
net = gluon.nn.Sequential()

Recall that in our linear regression example, the number of inputs is 2 and the number of outputs is 1. We can then add on a single Dense layer. The most direct way to do this is to specify the number of inputs and the number of outpus.

In [6]:
with net.name_scope():
    net.add(gluon.nn.Dense(1, in_units=2))

This tells gluon all that it needs in order to allocate memory for the weights. The net.name_scope tells gluon that it should name all parameters in a consistent way within net, such that we could reference individual weights explicitly at a later stage. Now all we need to do is initialize the weights, instantiate a loss and an optimizer, and we can start training.

Shape inference

One slick feature that we can take advantage of in gluon is shape inference on parameters. Instead of explicitly declaring the number of inputs to a layer, we can simply state the number of outputs.

In [7]:
net = gluon.nn.Sequential()
with net.name_scope():

You might wonder, how can gluon allocate our parameters if it doesn’t know what shape they should take? We’ll elaborate on this and more of gluon’s internal workings in our chapter on plumbing, but here’s the short version: In fact, gluon doesn’t allocate our parameters at that very moment. Instead it defers allocation to the first time we actually make a forward pass through the model with real data. Then, when gluon sees the shape of our data, it can infer the shapes of all of the parameters.

Initialize parameters

This all we need to do to define our network. However, we’re not ready to pass it data just yet. If you try calling net(nd.array([[0,1]])), you’ll find the following hideous error message:

RuntimeError: Parameter dense1_weight has not been initialized.  Note that you should initialize parameters and create Trainer  with Block.collect_params() instead of Block.params because the later does not include Parameters of nested child Blocks.

That’s because we haven’t yet told gluon what the initial values for our parameters should be. Also note that we need not tell our network about the input dimensionality and it still works. This is because the dimensions are bound the first time net(x) is called. This is a common theme in MxNet - stuff is evaluated only when needed (called lazy evaluation), using all the information available at the time when the results is needed.

Before we can do anything with this model, we must initialize its parameters. MXNet provides a variety of common initializers in mxnet.init. To keep things consistent with the model we built by hand, we’ll choose to initialize each parameter by sampling from a standard normal distribution. Note that we pass the initializer a context. This is how we tell gluon model where to store our parameters. Once we start training deep nets, we’ll generally want to keep parameters on one or more GPUs (and on more than one computer).

In [8]:
net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=ctx)

Deferred Initialization

Since gluon doesn’t know the shape of our net’s parameters, and we haven’t even allcoated memory for them yet, it might seem bizarre that we can initialize them. This is where gluon does a little more magic to make our lives easier. When we call initialize, gluon associates each parameter with an initializer. However, the actual initialization is deferred until the shapes have been inferred.

Define loss

Instead of writing our own loss function wer’e just going to call down to gluon.loss.L2Loss

In [9]:
square_loss = gluon.loss.L2Loss()


Instead of writing stochastic gradient descent from scratch every time, we can instantiate a gluon.Trainer, passing it a dictionary of parameters. Note that the sgd optimizer in gluon actually uses SGD with momentum and clipping (both can be switched off if needed), since these modifications make it converge rather much better. We will discuss this later when we go over a range of optimization algorithms in detail.

In [10]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})

Execute training loop

You might have notived that it was a bit more concise to express our model in gluon. For example, we didn’t have to individually allocate parameters, define our loss function, or implement stochastic gradient descent. The benefits of relying on gluon’s abstractions will grow substantially once we start working with much more complext models. But once we have all the basic pieces in place, the training loop itself is quite similar to what we would do if implementing everything from scratch.

To refresh your memory. For some number of epochs, we’ll make a complete pass over the dataset (train_data), grabbing one mini-batch of inputs and the corresponding ground-truth labels at a time.

Then, for each batch, we’ll go through the following ritual. So that this process becomes maximally ritualistic, we’ll repeat it verbatim: * Generate predictions (yhat) and the loss (loss) by executing a forward pass through the network. * Calculate gradients by making a backwards pass through the network via loss.backward(). * Update the model parameters by invoking our SGD optimizer (note that we need not tell trainer.step about which parameters but rather just the amount of data, since we already performed that in the initialization of trainer).

In [11]:
epochs = 2
smoothing_constant = .01

for e in range(epochs):
    for i, (data, label) in enumerate(train_data):
        data = data.as_in_context(ctx)
        label = label.as_in_context(ctx)
        with autograd.record():
            output = net(data)
            loss = square_loss(output, label)

        #  Keep a moving average of the losses
        curr_loss = nd.mean(loss).asscalar()
        moving_loss = (curr_loss if ((i == 0) and (e == 0))
                       else (1 - smoothing_constant) * moving_loss + (smoothing_constant) * curr_loss)

    print("Epoch %s. Moving avg of MSE: %s" % (e, moving_loss))
Epoch 0. Moving avg of MSE: 5.36794120756e-05
Epoch 1. Moving avg of MSE: 4.9173797211e-05


As you can see, even for a simple eample like linear regression, gluon can help you to write quick, clean, clode. Next, we’ll repeat this exercise for multilayer perceptrons, extending these lessons to deep neural networks and (comparatively) real datasets.