# Convolutional neural networks from scratch¶

Now let’s take a look at *convolutional neural networks* (CNNs), the
models people really use for classifying images.

```
In [1]:
```

```
from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd
import numpy as np
ctx = mx.gpu()
mx.random.seed(1)
```

## MNIST data (last one, we promise!)¶

```
In [2]:
```

```
batch_size = 64
num_inputs = 784
num_outputs = 10
def transform(data, label):
return nd.transpose(data.astype(np.float32), (2,0,1))/255, label.astype(np.float32)
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transform=transform),
batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform=transform),
batch_size, shuffle=False)
```

## Convolutional neural networks (CNNs)¶

In the previous example, we connected the nodes our neural networks in what seems like the simplest possible way. Every node in each layer was connected to every node in the subsequent layers.

This can require a lot of parameters! If our input were a 256x256 color image (still quite small for a photograph), and our network had 1,000 nodes in the first hidden layer, then our first weight matrix would require (256x256x3)x1000 parameters. That’s nearly 200 million. Moreover the hidden layer would ignore all the spatial structure in the input image even though we know the local structure represents and powerful source of prior knowledge.

Convolutional neural networks incorporate convolutional layers. These
layers associate each of their nodes with a small window, called a
*receptive field*, in the previous layer. This allows us to first learn
local features via transformations that are applied in the same way for
the top right corner as for the bottom left. Then we collect all this
local information to predict global qualities of the image (like whether
or not it depicts a dog).

(Image credit: Stanford cs231n http://cs231n.github.io/assets/cnn/depthcol.jpeg)

In short, there are two new concepts you need to grep here. First, we’ll
be introducting *convolutional* layers. Second, we’ll be interleaving
them with *pooling* layers.

## Parameters¶

Each node in convolutional layer is associated associated with a 3D
block (height x width x channel) in the input tensor. Moreover, the
convolutional layer itself has multiple output chanels. So the layer is
parameterized by a 4 dimensional weight tensor, commonly called a
*convolutional kernel*.

The output tensor is produced by sliding the kernel across the input
image skiping locations according to a pre-defined *stride* (but we’ll
just assume that to be 1 in this tutorial). Let’s initialize some such
kernels from scratch.

```
In [3]:
```

```
#######################
# Set the scale for weight initialization and choose
# the number of hidden units in the fully-connected layer
#######################
weight_scale = .01
num_fc = 128
W1 = nd.random_normal(shape=(20, 1, 3,3), scale=weight_scale, ctx=ctx)
b1 = nd.random_normal(shape=20, scale=weight_scale, ctx=ctx)
W2 = nd.random_normal(shape=(50, 20, 5, 5), scale=weight_scale, ctx=ctx)
b2 = nd.random_normal(shape=50, scale=weight_scale, ctx=ctx)
W3 = nd.random_normal(shape=(800, num_fc), scale=weight_scale, ctx=ctx)
b3 = nd.random_normal(shape=128, scale=weight_scale, ctx=ctx)
W4 = nd.random_normal(shape=(num_fc, num_outputs), scale=weight_scale, ctx=ctx)
b4 = nd.random_normal(shape=10, scale=weight_scale, ctx=ctx)
params = [W1, b1, W2, b2, W3, b3, W4, b4]
```

And assign space for gradients

```
In [4]:
```

```
for param in params:
param.attach_grad()
```

## Convolving with MXNet’s NDArrray¶

To write a convolution when using *raw MXNet*, we use the function
`nd.Convolution()`

. This function takes a few important arguments:
inputs (`data`

), a 4D weight matrix (`weight`

), a bias (`bias`

),
the shape of the kernel (`kernel`

), and a number of filters
(`num_filter`

).

```
In [5]:
```

```
for data, _ in train_data:
data = data.as_in_context(ctx)
break
conv = nd.Convolution(data=data, weight=W1, bias=b1, kernel=(3,3), num_filter=20)
print(conv.shape)
```

```
(64L, 20L, 26L, 26L)
```

Note the shape. The number of examples (64) remains unchanged. The
number of channels (also called *filters*) has increased to 20. And
because the (3,3) kernel can only be applied in 26 different heights and
widths (without the kernel busting over the image border), our output is
26,26. There are some weird padding tricks we can use when we want the
input and output to have the same height and width dimensions, but we
won’t get into that now.

## Average pooling¶

The other new component of this model is the pooling layer. Pooling gives us a way to downsample in the spatial dimensions. Early convnets typically used average pooling, but max pooling tends to give better results.

```
In [6]:
```

```
pool = nd.Pooling(data=conv, pool_type="max", kernel=(2,2), stride=(2,2))
print(pool.shape)
```

```
(64L, 20L, 13L, 13L)
```

Note that the batch and channel components of the shape are unchanged but that the height and width have been downsampled from (26,26) to (13,13).

## Activation function¶

```
In [7]:
```

```
def relu(X):
return nd.maximum(X,nd.zeros_like(X))
```

## Softmax output¶

```
In [8]:
```

```
def softmax(y_linear):
exp = nd.exp(y_linear-nd.max(y_linear))
partition =nd.sum(exp, axis=0, exclude=True).reshape((-1,1))
return exp / partition
```

## Softmax cross-entropy loss¶

```
In [9]:
```

```
def softmax_cross_entropy(yhat_linear, y):
return - nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)
```

## Define the model¶

Now we’re ready to define our model

```
In [10]:
```

```
def net(X, debug=False):
########################
# Define the computation of the first convolutional layer
########################
h1_conv = nd.Convolution(data=X, weight=W1, bias=b1, kernel=(3,3), num_filter=20)
h1_activation = relu(h1_conv)
h1 = nd.Pooling(data=h1_activation, pool_type="avg", kernel=(2,2), stride=(2,2))
if debug:
print("h1 shape: %s" % (np.array(h1.shape)))
########################
# Define the computation of the second convolutional layer
########################
h2_conv = nd.Convolution(data=h1, weight=W2, bias=b2, kernel=(5,5), num_filter=50)
h2_activation = relu(h2_conv)
h2 = nd.Pooling(data=h2_activation, pool_type="avg", kernel=(2,2), stride=(2,2))
if debug:
print("h2 shape: %s" % (np.array(h2.shape)))
########################
# Flattening h2 so that we can feed it into a fully-connected layer
########################
h2 = nd.flatten(h2)
if debug:
print("Flat h2 shape: %s" % (np.array(h2.shape)))
########################
# Define the computation of the third (fully-connected) layer
########################
h3_linear = nd.dot(h2, W3) + b3
h3 = relu(h3_linear)
if debug:
print("h3 shape: %s" % (np.array(h3.shape)))
########################
# Define the computation of the output layer
########################
yhat_linear = nd.dot(h3, W4) + b4
if debug:
print("yhat_linear shape: %s" % (np.array(yhat_linear.shape)))
return yhat_linear
```

## Test run¶

We can now print out the shapes of the activations at each layer by using the debug flag.

```
In [11]:
```

```
output = net(data, debug=True)
```

```
h1 shape: [64 20 13 13]
h2 shape: [64 50 4 4]
Flat h2 shape: [ 64 800]
h3 shape: [ 64 128]
yhat_linear shape: [64 10]
```

## Optimizer¶

```
In [12]:
```

```
def SGD(params, lr):
for param in params:
param[:] = param - lr * param.grad
```

## Evaluation metric¶

```
In [13]:
```

```
def evaluate_accuracy(data_iterator, net):
numerator = 0.
denominator = 0.
for i, (data, label) in enumerate(data_iterator):
data = data.as_in_context(ctx)
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, 10)
output = net(data)
predictions = nd.argmax(output, axis=1)
numerator += nd.sum(predictions == label)
denominator += data.shape[0]
return (numerator / denominator).asscalar()
```

## The training loop¶

```
In [14]:
```

```
epochs = 10
learning_rate = .01
smoothing_constant = .01
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx)
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, num_outputs)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label_one_hot)
loss.backward()
SGD(params, learning_rate)
##########################
# Keep a moving average of the losses
##########################
curr_loss = nd.mean(loss).asscalar()
moving_loss = (curr_loss if ((i == 0) and (e == 0))
else (1 - smoothing_constant) * moving_loss + (smoothing_constant) * curr_loss)
test_accuracy = evaluate_accuracy(test_data, net)
train_accuracy = evaluate_accuracy(train_data, net)
print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, moving_loss, train_accuracy, test_accuracy))
```

```
Epoch 0. Loss: 0.181231728112, Train_acc 0.955433, Test_acc 0.9581
Epoch 1. Loss: 0.10250000031, Train_acc 0.975283, Test_acc 0.9733
Epoch 2. Loss: 0.0707769147591, Train_acc 0.976317, Test_acc 0.9742
Epoch 3. Loss: 0.064788929063, Train_acc 0.97945, Test_acc 0.9784
Epoch 4. Loss: 0.0560155621438, Train_acc 0.985917, Test_acc 0.9838
Epoch 5. Loss: 0.0484381217348, Train_acc 0.986733, Test_acc 0.983
Epoch 6. Loss: 0.0363726502, Train_acc 0.988833, Test_acc 0.9836
Epoch 7. Loss: 0.0360761842955, Train_acc 0.991117, Test_acc 0.9864
Epoch 8. Loss: 0.0349220247952, Train_acc 0.991767, Test_acc 0.9855
Epoch 9. Loss: 0.0227754070213, Train_acc 0.992433, Test_acc 0.9851
```

## Conclusion¶

Contained in this example are nearly all the important ideas you’ll need to start attacking problem in computer vision. While state of the art vision systems incorporate few more bells and whistles, they’re all built on this foundation. Believe it or not, if you knew just the content in this tutorial 5 years ago, you could probably have sold a startup to a Fortune 500 company for millions of dollars. Fortunately (or unfortunately?), the world has gotten marginally more sophisticated, so we’ll have to come up with some more sophisticated tutorials to follow.

For whinges or inquiries, open an issue on GitHub.

```
In [ ]:
```

```
```