# Multilayer perceptrons from scratch¶

Now that we’ve covered all the preliminaries, extending to deep neural networks is actually quite easy.

```
In [1]:
```

```
from __future__ import print_function
import mxnet as mx
import numpy as np
from mxnet import nd, autograd
ctx = mx.gpu()
```

## MNIST data (surprise!)¶

Let’s go ahead and grab our data.

```
In [2]:
```

```
mnist = mx.test_utils.get_mnist()
num_inputs = 784
num_outputs = 10
batch_size = 64
def transform(data, label):
return data.astype(np.float32)/255, label.astype(np.float32)
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transform=transform),
batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform=transform),
batch_size, shuffle=False)
```

## Multilayer perceptrons¶

Here’s where things start to get interesting. Before, we mapped our inputs directly onto our outputs through a single linear transformation.

This model is perfectly adequate when the underlying relationship
between our data points and labels is approximately linear. When our
data points and targets are characterized by a more complex
relationship, a linear model will produce results with insufficient
accuracy. We can model a more general class of functions by
incorporating one or more *hidden layers*.

Here, each layer will require it’s own set of parameters. To make things simple here, we’ll assume two hidden layers of computation.

```
In [3]:
```

```
#######################
# Set some constants so it's easy to modify the network later
#######################
num_hidden = 256
weight_scale = .01
#######################
# Allocate parameters for the first hidden layer
#######################
W1 = nd.random_normal(shape=(num_inputs, num_hidden), scale=weight_scale, ctx=ctx)
b1 = nd.random_normal(shape=num_hidden, scale=weight_scale, ctx=ctx)
#######################
# Allocate parameters for the second hidden layer
#######################
W2 = nd.random_normal(shape=(num_hidden, num_hidden), scale=weight_scale, ctx=ctx)
b2 = nd.random_normal(shape=num_hidden, scale=weight_scale, ctx=ctx)
#######################
# Allocate parameters for the output layer
#######################
W3 = nd.random_normal(shape=(num_hidden, num_outputs), scale=weight_scale, ctx=ctx)
b3 = nd.random_normal(shape=num_outputs, scale=weight_scale, ctx=ctx)
params = [W1, b1, W2, b2, W3, b3]
```

Again, let’s allocate space for each parameter’s gradients.

```
In [4]:
```

```
for param in params:
param.attach_grad()
```

## Activation functions¶

If we compose a multi-layer network but use only linear operations, then our entire network will still be a linear function. That’s because $:raw-latex:hat{y} = X :raw-latex:`\cdot `W\_1 :raw-latex:`cdot W_2 :raw-latex:cdot W_2 = X :raw-latex:cdot W_4 $ for :math:`W_4 = W_1 cdot W_2 cdot W3. To give our model the capacity to capture nonlinear functions, we’ll need to interleave our linear operations with activation functions. In this case, we’ll use the rectified linear unit (ReLU):

```
In [5]:
```

```
def relu(X):
return nd.maximum(X, nd.zeros_like(X))
```

## Softmax output¶

As with multiclass logistic regression, we’ll want out outputs to be
*stochastic*, meaning that they constitute a valid probability
distribution. We’ll use the same softmax activation function on our
output to make sure that our outputs sum to one and are non-negative.

```
In [6]:
```

```
def softmax(y_linear):
exp = nd.exp(y_linear-nd.max(y_linear))
partition = nd.nansum(exp, axis=0, exclude=True).reshape((-1, 1))
return exp / partition
```

## The *softmax* cross-entropy loss function¶

In the previous example, we calculate our model’s output and then ran this output through the cross-entropy loss function:

```
In [7]:
```

```
def cross_entropy(yhat, y):
return - nd.nansum(y * nd.log(yhat), axis=0, exclude=True)
```

Mathematically, that’s a perfectly reasonable thing to do. However, computationally, things can get hairy. We’ll revist the issue at length in a chapter more dedicated to implementation and less interested in statistical modeling. But we’re going to make a change here so we want to give you the gist of why.

When we calculate the softmax partition function, we take a sum of
exponential functions: \(\sum_{i=1}^{n} e^{z_i}\). When we also
calculate our numerators as exponential functions, then this can give
rise to some big numbers in our intermediate calculations. The pairing
of big numbers and low precision mathematics tends to make things go
crazy. As a result, if we use our naive softmax implemenation, we might
get horrific not a number (`nan`

) results printed to secreen.

Our salvation is that even though we’re computing these exponential
functions, we ultimately plan to take their log in the cross-entropy
functions. It turns out that by combining these two operators
`softmax`

and `cross_entropy`

together, we can elude the numerical
stability issues that might otherwise plague us during backpropagation.
We’ll want to keep the conventional softmax function handy in case we
ever want to evaluate the probabilities output by our model.

But instead of passing softmax probabilities into our loss function -
we’ll just pass our `yhat_linear`

and compute the softmax and its log
all at once inside the softmax_cross_entropy loss function
simultaneously, which does smart things like the log-sum-exp trick (see
on Wikipedia).

```
In [8]:
```

```
def softmax_cross_entropy(yhat_linear, y):
return - nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)
```

## Define the model¶

Now we’re ready to define our model

```
In [9]:
```

```
def net(X):
#######################
# Compute the first hidden layer
#######################
h1_linear = nd.dot(X, W1) + b1
h1 = relu(h1_linear)
#######################
# Compute the second hidden layer
#######################
h2_linear = nd.dot(h1, W2) + b2
h2 = relu(h2_linear)
#######################
# Compute the output layer.
# We will omit the softmax function here
# because it will be applied
# in the softmax_cross_entropy loss
#######################
yhat_linear = nd.dot(h2, W3) + b3
return yhat_linear
```

## Optimizer¶

```
In [10]:
```

```
def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
```

## Evaluation metric¶

```
In [11]:
```

```
def evaluate_accuracy(data_iterator, net):
numerator = 0.
denominator = 0.
for i, (data, label) in enumerate(data_iterator):
data = data.as_in_context(ctx).reshape((-1, 784))
label = label.as_in_context(ctx)
output = net(data)
predictions = nd.argmax(output, axis=1)
numerator += nd.sum(predictions == label)
denominator += data.shape[0]
return (numerator / denominator).asscalar()
```

## Execute the training loop¶

```
In [12]:
```

```
epochs = 10
learning_rate = .001
smoothing_constant = .01
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1, 784))
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, 10)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label_one_hot)
loss.backward()
SGD(params, learning_rate)
##########################
# Keep a moving average of the losses
##########################
curr_loss = nd.mean(loss).asscalar()
moving_loss = (curr_loss if ((i == 0) and (e == 0))
else (1 - smoothing_constant) * moving_loss + (smoothing_constant) * curr_loss)
test_accuracy = evaluate_accuracy(test_data, net)
train_accuracy = evaluate_accuracy(train_data, net)
print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" %
(e, moving_loss, train_accuracy, test_accuracy))
```

```
Epoch 0. Loss: 0.455451145229, Train_acc 0.885817, Test_acc 0.8897
Epoch 1. Loss: 0.297250172348, Train_acc 0.921383, Test_acc 0.9205
Epoch 2. Loss: 0.202016335186, Train_acc 0.946467, Test_acc 0.9451
Epoch 3. Loss: 0.151867129294, Train_acc 0.960667, Test_acc 0.9584
Epoch 4. Loss: 0.113816030109, Train_acc 0.9688, Test_acc 0.9637
Epoch 5. Loss: 0.100374131216, Train_acc 0.97185, Test_acc 0.9658
Epoch 6. Loss: 0.0873043180085, Train_acc 0.9779, Test_acc 0.9713
Epoch 7. Loss: 0.0730908748383, Train_acc 0.98085, Test_acc 0.972
Epoch 8. Loss: 0.068088298137, Train_acc 0.984883, Test_acc 0.9735
Epoch 9. Loss: 0.0573755351742, Train_acc 0.986133, Test_acc 0.9731
```

## Conclusion¶

Nice! With just two hidden layers containing 256 hidden nodes, repsectively, we can achieve over 95% accuracy on this task.