Multilayer perceptrons from scratch

In the previous chapters we showed how you could implement multiclass logistic regression (also called softmax regression) for classifiying images of handwritten digits into the 10 possible categories (from scratch and with gluon). This is where things start to get fun. We understand how to wrangle data, coerce our outputs into a valid probability distribution, how to apply an appropriate loss function, and how to optimize over our parameters. Now that we’ve covered these preliminaries, we can extend our toolbox to include deep neural networks.

Recall that before, we mapped our inputs directly onto our outputs through a single linear transformation.

\[\hat{y} = \mbox{softmax}(W \boldsymbol{x} + b)\]

Graphically, we could depict the model like this: image0

If our labels really were relatd to our input data by an approximately linear function, then this approah might be adequate. But linearity is a strong assumption. Linearity means that fixing one output of interest, for each input, increasing its value should either drive up the value of the output, or drive it down, irrespective of the value of the other inputs.

Imagine the case of classifying cats and dogs based on black and white images. That’s like saying that for each pixel, increasing its value either increases probability that it depicts a dog or decreases it. That’s not reasonable. After all, the world contains both black dogs and black cats, and both white dogs and white cats.

Teasing out what is depicted in an image generally requires allowing more complex relationships between our inputs and outputs, considering the possibility that our pattern might be characterized by interactions among the many features. In these cases, linear models will have low accuracy. We can model a more general class of functions by incorporating one or more hidden layers. The easiest way to do this is to stack a bunch of layers of neurons on top of each other. Each layer feeds in to the layer above it, until we generate an output. This architecture is commonly called a “multilayer perceptron”. With an MLP, we’re going to stack a bunch of layers on top of each other.

\[h_1 = \phi(W_1\boldsymbol{x} + b_1)\]
\[h_2 = \phi(W_2\boldsymbol{h_1} + b_2)\]
\[...\]
\[h_n = \phi(W_n\boldsymbol{h_{n-1}} + b_n)\]

Note that each layer requires its own set of parameters. For each hidden layer, we calculate its value by first applying a linear function to the acivatiosn of the layer below, and then applying an element-wise nonlinear activation function. Here, we’ve denoted the activation function for the hidden layers as \(\phi\). Finally, given the topmost hidden layer, we’ll generate an output. Because we’re still focusing on multiclass classification, we’ll stick with the softmax activation in the output layer.

\[\hat{y} = \mbox{softmax}(W_y \boldsymbol{h}_n + b_y)\]

Graphically, a multilayer perceptron could be depicted like this:

Multilayer perceptrons can account for complex interactions in the inputs because the hidden neurons depend on the values of each of the inputs. It’s easy to design a hidden node that that do arbitrary computation, say logical operations. And it’s even widely known that multilayer perceptrons are universal approximators. That means that even for a single-hidden-layer neural network, with enough nodes, and the right set of weights, it could model any function at all! Actually learning that function is the hard part. And it turns out that we can approximate functions much more compactly if we use deeper (vs wider) neural networks. We’ll get more into the maths in subsequent chapter. But for now, let’s actually build a MLP. In this example, we’ll implement a multilayer perceptron with two hidden layers and one output layer.

Imports

In [27]:
from __future__ import print_function
import mxnet as mx
import numpy as np
from mxnet import nd, autograd, gluon

Set contexts

In [28]:
data_ctx = mx.cpu()
model_ctx = mx.cpu()
# model_ctx = mx.gpu(1)

Load MNIST data

Let’s go ahead and grab our data.

In [29]:
num_inputs = 784
num_outputs = 10
batch_size = 64
num_examples = 60000
def transform(data, label):
    return data.astype(np.float32)/255, label.astype(np.float32)
train_data = gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transform=transform),
                                      batch_size, shuffle=True)
test_data = gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform=transform),
                                     batch_size, shuffle=False)

Allocate parameters

In [30]:
#######################
#  Set some constants so it's easy to modify the network later
#######################
num_hidden = 256
weight_scale = .01

#######################
#  Allocate parameters for the first hidden layer
#######################
W1 = nd.random_normal(shape=(num_inputs, num_hidden), scale=weight_scale, ctx=model_ctx)
b1 = nd.random_normal(shape=num_hidden, scale=weight_scale, ctx=model_ctx)

#######################
#  Allocate parameters for the second hidden layer
#######################
W2 = nd.random_normal(shape=(num_hidden, num_hidden), scale=weight_scale, ctx=model_ctx)
b2 = nd.random_normal(shape=num_hidden, scale=weight_scale, ctx=model_ctx)

#######################
#  Allocate parameters for the output layer
#######################
W3 = nd.random_normal(shape=(num_hidden, num_outputs), scale=weight_scale, ctx=model_ctx)
b3 = nd.random_normal(shape=num_outputs, scale=weight_scale, ctx=model_ctx)

params = [W1, b1, W2, b2, W3, b3]

Again, let’s allocate space for each parameter’s gradients.

In [31]:
for param in params:
    param.attach_grad()

Activation functions

If we compose a multi-layer network but use only linear operations, then our entire network will still be a linear function. That’s because $:raw-latex:hat{y} = X :raw-latex:`\cdot `W\_1 :raw-latex:`cdot W_2 :raw-latex:cdot W_2 = X :raw-latex:cdot W_4 $ for :math:`W_4 = W_1 cdot W_2 cdot W3. To give our model the capacity to capture nonlinear functions, we’ll need to interleave our linear operations with activation functions. In this case, we’ll use the rectified linear unit (ReLU):

In [32]:
def relu(X):
    return nd.maximum(X, nd.zeros_like(X))

Softmax output

As with multiclass logistic regression, we’ll want the outputs to constitute a valid probability distribution. We’ll use the same softmax activation function on our output to make sure that our outputs sum to one and are non-negative.

In [33]:
def softmax(y_linear):
    exp = nd.exp(y_linear-nd.max(y_linear))
    partition = nd.nansum(exp, axis=0, exclude=True).reshape((-1, 1))
    return exp / partition

The softmax cross-entropy loss function

In the previous example, we calculated our model’s output and then ran this output through the cross-entropy loss function:

In [34]:
def cross_entropy(yhat, y):
    return - nd.nansum(y * nd.log(yhat), axis=0, exclude=True)

Mathematically, that’s a perfectly reasonable thing to do. However, computationally, things can get hairy. We’ll revisit the issue at length in a chapter more dedicated to implementation and less interested in statistical modeling. But we’re going to make a change here so we want to give you the gist of why.

When we calculate the softmax partition function, we take a sum of exponential functions: \(\sum_{i=1}^{n} e^{z_i}\). When we also calculate our numerators as exponential functions, then this can give rise to some big numbers in our intermediate calculations. The pairing of big numbers and low precision mathematics tends to make things go crazy. As a result, if we use our naive softmax implementation, we might get horrific not a number (nan) results printed to screen.

Our salvation is that even though we’re computing these exponential functions, we ultimately plan to take their log in the cross-entropy functions. It turns out that by combining these two operators softmax and cross_entropy together, we can elude the numerical stability issues that might otherwise plague us during backpropagation. We’ll want to keep the conventional softmax function handy in case we ever want to evaluate the probabilities output by our model.

But instead of passing softmax probabilities into our loss function - we’ll just pass our yhat_linear and compute the softmax and its log all at once inside the softmax_cross_entropy loss function simultaneously, which does smart things like the log-sum-exp trick (see on Wikipedia).

In [35]:
def softmax_cross_entropy(yhat_linear, y):
    return - nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)

Define the model

Now we’re ready to define our model

In [36]:
def net(X):
    #######################
    #  Compute the first hidden layer
    #######################
    h1_linear = nd.dot(X, W1) + b1
    h1 = relu(h1_linear)

    #######################
    #  Compute the second hidden layer
    #######################
    h2_linear = nd.dot(h1, W2) + b2
    h2 = relu(h2_linear)

    #######################
    #  Compute the output layer.
    #  We will omit the softmax function here
    #  because it will be applied
    #  in the softmax_cross_entropy loss
    #######################
    yhat_linear = nd.dot(h2, W3) + b3
    return yhat_linear

Optimizer

In [37]:
def SGD(params, lr):
    for param in params:
        param[:] = param - lr * param.grad

Evaluation metric

In [38]:
def evaluate_accuracy(data_iterator, net):
    numerator = 0.
    denominator = 0.
    for i, (data, label) in enumerate(data_iterator):
        data = data.as_in_context(model_ctx).reshape((-1, 784))
        label = label.as_in_context(model_ctx)
        output = net(data)
        predictions = nd.argmax(output, axis=1)
        numerator += nd.sum(predictions == label)
        denominator += data.shape[0]
    return (numerator / denominator).asscalar()

Execute the training loop

In [39]:
epochs = 10
learning_rate = .001
smoothing_constant = .01

for e in range(epochs):
    cumulative_loss = 0
    for i, (data, label) in enumerate(train_data):
        data = data.as_in_context(model_ctx).reshape((-1, 784))
        label = label.as_in_context(model_ctx)
        label_one_hot = nd.one_hot(label, 10)
        with autograd.record():
            output = net(data)
            loss = softmax_cross_entropy(output, label_one_hot)
        loss.backward()
        SGD(params, learning_rate)
        cumulative_loss += nd.sum(loss).asscalar()


    test_accuracy = evaluate_accuracy(test_data, net)
    train_accuracy = evaluate_accuracy(train_data, net)
    print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" %
          (e, cumulative_loss/num_examples, train_accuracy, test_accuracy))
Epoch 0. Loss: 1.20780775437, Train_acc 0.883633, Test_acc 0.8877
Epoch 1. Loss: 0.328388882202, Train_acc 0.924017, Test_acc 0.9244
Epoch 2. Loss: 0.22106400394, Train_acc 0.949033, Test_acc 0.9464
Epoch 3. Loss: 0.162594895309, Train_acc 0.957433, Test_acc 0.9535
Epoch 4. Loss: 0.129279144899, Train_acc 0.96935, Test_acc 0.9637
Epoch 5. Loss: 0.105187748659, Train_acc 0.9739, Test_acc 0.9703
Epoch 6. Loss: 0.0890154179106, Train_acc 0.979033, Test_acc 0.9728
Epoch 7. Loss: 0.076162833334, Train_acc 0.982283, Test_acc 0.9723
Epoch 8. Loss: 0.0654618650412, Train_acc 0.984717, Test_acc 0.9728
Epoch 9. Loss: 0.0572528594335, Train_acc 0.987433, Test_acc 0.9751

Conclusion

Nice! With just two hidden layers containing 256 hidden nodes, respectively, we can achieve over 95% accuracy on this task.