# Designing a custom layer with `gluon`

¶

Now that we’ve peeled back some of the syntactic sugar conferred by
`nn.Sequential()`

and given you a feeling for how `gluon`

works
under the hood, you might feel more comfortable when writing your
high-level code. But the real reason to get to know `gluon`

more
intimately is so that we can mess around with it and write our own
Blocks.

Up until now, we’ve presented two versions of each tutorial. One from
scratch and one in `gluon`

. Empowered with such independence, you
might be wondering, “if I wanted to write my own layer, why wouldn’t I
just do it from scratch?”

In reality, writing every model completely from scratch can be cumbersome. Just like there’s only so many times a developer can code up a blog from scratch without hating life, there’s only so many times that you’ll want to write out a convolutional layer, or define the stochastic gradient descent updates. Even in pure research environment, we usually want to customize one part of the model. For example, we might want to implement a new layer, but still rely on other common layers, loss functions, optimizers, etc. In some cases it might be nontrivial to compute the gradient efficiently and the automatic differentiation subsystem might need some help: When was the last time you performed backprop through a log-determinant, a Cholesky factorization, or a matrix exponential? In other cases things might not be numerically very stable when calculated straightforwardly (e.g. taking logs of exponentials of some arguments).

By hacking `gluon`

, we can get the desired flexibility in one part of
our model, without screwing up everything else that makes our life easy.

```
In [1]:
```

```
from __future__ import print_function
import mxnet as mx
import numpy as np
from mxnet import nd, autograd, gluon
from mxnet.gluon import nn, Block
mx.random.seed(1)
###########################
# Speficy the context we'll be using
###########################
ctx = mx.cpu()
###########################
# Load up our dataset
###########################
batch_size = 64
def transform(data, label):
return data.astype(np.float32)/255, label.astype(np.float32)
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transform=transform),
batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform=transform),
batch_size, shuffle=False)
```

## Defining a (toy) custom layer¶

To start, let’s pretend that we want to use `gluon`

for its optimizer,
serialization, etc, but that we need a new layer. Specifically, we want
a layer that centers its input about 0 by subtracting its mean. We’ll go
ahead and define the simplest possible `Block`

. Remember from the last
tutorial that in `gluon`

a layer is called a `Block`

(after all, we
might compose multiple blocks into a larger block, etc.).

```
In [2]:
```

```
class CenteredLayer(Block):
def __init__(self, **kwargs):
super(CenteredLayer, self).__init__(**kwargs)
def forward(self, x):
return x - nd.mean(x)
```

That’s it. We can just call instantiate this block and make a forward pass. Note that this layer doesn’t actually care what it’s input dimension or output dimensions are. So we can just feed in an arbitrary array and should expect appropriately transformed output. Whenever we are happy with whatever the automatic differentiation generates, this is all we need.

```
In [3]:
```

```
net = CenteredLayer()
net(nd.array([1,2,3,4,5]))
```

```
Out[3]:
```

```
[-2. -1. 0. 1. 2.]
<NDArray 5 @cpu(0)>
```

We can also incorporate this layer into a more complicated network, as
by using `nn.Sequential()`

.

```
In [4]:
```

```
net2 = nn.Sequential()
net2.add(nn.Dense(128))
net2.add(nn.Dense(10))
net2.add(CenteredLayer())
```

This network contains Blocks (Dense) that contain parameters and thus require initialization

```
In [5]:
```

```
net2.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
```

Now we can pass some data through it, say the first image from our MNIST dataset.

```
In [6]:
```

```
for data, _ in train_data:
data = data.as_in_context(ctx)
break
output = net2(data[0:1])
print(output)
```

```
[[ 0.47172713 -0.27308482 -0.95690644 0.17887072 0.03658688 0.30082721
0.15491416 -0.09305321 0.02917496 0.15094344]]
<NDArray 1x10 @cpu(0)>
```

And we can verify that as expected, the resulting vector has mean 0.

```
In [7]:
```

```
nd.mean(output)
```

```
Out[7]:
```

```
[ 4.47034854e-09]
<NDArray 1 @cpu(0)>
```

There’s a good chance you’ll see something other than 0. When I ran this
code, I got `2.68220894e-08`

. That’s roughly `.000000027`

. This is
due to the fact that MXNet often uses low precision arithmetics. For
deep learning research, this is often a compromise that we make. In
exchange for giving up a few significant digits, we get tremendous
speedups on modern hardware. And it turns out that most deep learning
algorihtms don’t suffer too much from the loss of precision. This is
probably due to the fact that work and worse on account of the loss of
precision.

## Custom layers with parameters¶

While `CenteredLayer`

should give you some sense of how to implement a
custom layer, it’s missing a few important pieces. Most importantly,
`CenteredLayer`

doesn’t care about the dimensions of its input or
output, and it doesn’t contain any trainable parameters. Since you
already know how to implement a fully-connected layer from scratch,
let’s learn how to make parameteric `Block`

by implementing MyDense,
our own version of a fully-connected (Dense) layer.

## Parameters¶

Before we can add parameters to our custom `Block`

, we should get to
know how `gluon`

deals with parmaeters generally. Instead of working
with NDArrays directly, each `Block`

is associated with some number
(as few as zero) of `Parameter`

(groups).

At a high level, you can think of a `Parameter`

as a wrapper on an
`NDArray`

. However, the `Parameter`

can be instantiated before the
corresponding NDArray is. For example, when we instantiate a `Block`

but the shapes of each parameter still need to be inferred, the
Parameter will wait for the shape to be inferred before allocating
memory.

To get a hands-on feel for mxnet.Parameter, let’s just instantiate one
outside of a `Block`

:

```
In [8]:
```

```
my_param = gluon.Parameter("exciting_parameter_yay", grad_req='write', shape=(5,5))
print(my_param)
```

```
Parameter exciting_parameter_yay (shape=(5, 5), dtype=<type 'numpy.float32'>)
```

Here we’ve instantiated a parameter, giving it the name
“exciting_parameter_yay”. We’ve also specified that we’ll want to
capture gradients for this Parameter. Under the hood, that lets
`gluon`

know that it has to call `.attach_grad()`

on the underlying
NDArray. We also specified the shape. Now that we have a Parameter, we
can initialize its values via `.initialize()`

and print extract its
data by calling `.data()`

.

```
In [9]:
```

```
my_param.initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
print(my_param.data())
```

```
[[ 0.50735062 -0.65750605 -0.56013602 0.46934015 0.1596154 ]
[-0.65080845 -0.11559016 0.31085443 -0.49285054 0.57047993]
[ 0.35613006 0.29938424 0.61431509 0.13020623 0.21408975]
[-0.38888294 0.65209502 -0.08793807 -0.03835624 0.63372332]
[-0.42945772 -0.36274379 -0.06317961 -0.58671117 0.2023437 ]]
<NDArray 5x5 @cpu(0)>
```

For data parallelism, a Parameter can also be initialzied on multiple
contexts. The Parameter will then keep a copy of its value on each
context. Keep in mind that you need to maintain consistency amount the
copies when updating the Parameter (usually `gluon.Trainer`

does this
for you).

Note that you need at least two GPUs to run this section.

```
In [10]:
```

```
# my_param = gluon.Parameter("exciting_parameter_yay", grad_req='write', shape=(5,5))
# my_param.initialize(mx.init.Xavier(magnitude=2.24), ctx=[mx.gpu(0), mx.gpu(1)])
# print(my_param.data(mx.gpu(0)), my_param.data(mx.gpu(1)))
```

## Parameter dictionaries (introducing `ParameterDict`

)¶

Rather than directly store references to each of its `Parameters`

,
`Block`

s typicaly contain a parameter dictionary
(`ParameterDict`

). In practice, we’ll rarely instantiate our own
`ParamterDict`

. That’s because whenever we call the `Block`

constructor it’s generated automatically. For pedagogical purposes,
we’ll do it from scratch this one time.

```
In [11]:
```

```
pd = gluon.ParameterDict(prefix="block1_")
```

MXNet’s `ParameterDict`

does a few cool things for us. First, we can
instantiate a new Parameter by calling `pd.get()`

```
In [12]:
```

```
pd.get("exciting_parameter_yay", grad_req='write', shape=(5,5))
```

```
Out[12]:
```

```
Parameter block1_exciting_parameter_yay (shape=(5, 5), dtype=<type 'numpy.float32'>)
```

Note that the new parameter is (i) contained in the ParameterDict and
(ii) appends the prefix to it’s name. This naming convention helps us to
know which parameters belong to which `Block`

or sub-`Block`

. It’s
especially useful when we want to write parameters to disc (i.e.
serialize), or read them from disc (i.e. deserialize).

Like a regular Python dictionary, we can get the names of all parameters
with `.keys()`

and can

```
In [13]:
```

```
pd["block1_exciting_parameter_yay"]
```

```
Out[13]:
```

```
Parameter block1_exciting_parameter_yay (shape=(5, 5), dtype=<type 'numpy.float32'>)
```

## Craft a bespoke fully-connected `gluon`

layer¶

Now that we know how parameters work, we’re ready to create our very own fully-connected layer. We’ll use the familiar relu activation from previous tutorials.

```
In [14]:
```

```
def relu(X):
return nd.maximum(X, 0)
```

Now we can define our `Block`

.

```
In [15]:
```

```
class MyDense(Block):
####################
# We add arguments to our constructor (__init__)
# to indicate the number of input units (``in_units``)
# and output units (``units``)
####################
def __init__(self, units, in_units=0, **kwargs):
super(MyDense, self).__init__(**kwargs)
with self.name_scope():
self.units = units
self._in_units = in_units
#################
# We add the required parameters to the ``Block``'s ParameterDict ,
# indicating the desired shape
#################
self.weight = self.params.get(
'weight', init=mx.init.Xavier(magnitude=2.24),
shape=(in_units, units))
self.bias = self.params.get('bias', shape=(units,))
#################
# Now we just have to write the forward pass.
# We could rely upong the FullyConnected primitative in NDArray,
# but it's better to get our hands dirty and write it out
# so you'll know how to compose arbitrary functions
#################
def forward(self, x):
with x.context:
linear = nd.dot(x, self.weight.data()) + self.bias.data()
activation = relu(linear)
return activation
```

Recall that every Block can be run just as if it were an entire network. In fact, linear models are nothing more than neural networks consisting of a single layer as a network.

So let’s go ahead and run some data though our bespoke layer. We’ll want to first instantiate the layer and initialize its parameters.

```
In [16]:
```

```
dense = MyDense(20, in_units=10)
dense.collect_params().initialize(ctx=ctx)
```

```
In [17]:
```

```
dense.params
```

```
Out[17]:
```

```
mydense0_ (
Parameter mydense0_weight (shape=(10, 20), dtype=<type 'numpy.float32'>)
Parameter mydense0_bias (shape=(20,), dtype=<type 'numpy.float32'>)
)
```

Now we can run through some dummy data.

```
In [18]:
```

```
dense(nd.ones(shape=(2,10)))
```

```
Out[18]:
```

```
[[ 0. 0.59868848 0. 1.08994353 0. 0.
0.02280135 0.26122352 0.15244918 0. 0. 1.23705149
0.53500706 0. 0. 0.61897928 0.09488952 0. 0.
0.46094608]
[ 0. 0.59868848 0. 1.08994353 0. 0.
0.02280135 0.26122352 0.15244918 0. 0. 1.23705149
0.53500706 0. 0. 0.61897928 0.09488952 0. 0.
0.46094608]]
<NDArray 2x20 @cpu(0)>
```

## Using our layer to build an MLP¶

While it’s a good sanity check to run some data though the layer, the
real proof that it works will be if we can compose a network entirely
out of `MyDense`

layers and achieve respectable accuracy on a real
task. So we’ll revist the MNIST digit classification task, and use the
familiar `nn.Sequential()`

syntax to build our net.

```
In [19]:
```

```
net = gluon.nn.Sequential()
with net.name_scope():
net.add(MyDense(128, in_units=784))
net.add(MyDense(64, in_units=128))
net.add(MyDense(10, in_units=64))
```

## Initialize Parameters¶

```
In [20]:
```

```
net.collect_params().initialize(ctx=ctx)
```

## Instantiate a loss¶

```
In [21]:
```

```
loss = gluon.loss.SoftmaxCrossEntropyLoss()
```

## Optimizer¶

```
In [22]:
```

```
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1})
```

## Evaluation Metric¶

```
In [23]:
```

```
metric = mx.metric.Accuracy()
def evaluate_accuracy(data_iterator, net):
numerator = 0.
denominator = 0.
for i, (data, label) in enumerate(data_iterator):
with autograd.record():
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
label_one_hot = nd.one_hot(label, 10)
output = net(data)
metric.update([label], [output])
return metric.get()[1]
```

## Training loop¶

```
In [24]:
```

```
epochs = 10
moving_loss = 0.
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
with autograd.record():
output = net(data)
cross_entropy = loss(output, label)
cross_entropy.backward()
trainer.step(data.shape[0])
##########################
# Keep a moving average of the losses
##########################
if i == 0:
moving_loss = nd.mean(cross_entropy).asscalar()
else:
moving_loss = .99 * moving_loss + .01 * nd.mean(cross_entropy).asscalar()
test_accuracy = evaluate_accuracy(test_data, net)
train_accuracy = evaluate_accuracy(train_data, net)
print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, moving_loss, train_accuracy, test_accuracy))
```

```
Epoch 0. Loss: 0.662074091477, Train_acc 0.756814285714, Test_acc 0.7559
Epoch 1. Loss: 0.588702041132, Train_acc 0.763771428571, Test_acc 0.758025
Epoch 2. Loss: 0.572036757495, Train_acc 0.769761904762, Test_acc 0.764713333333
Epoch 3. Loss: 0.538418785097, Train_acc 0.773721428571, Test_acc 0.770240909091
Epoch 4. Loss: 0.538585911311, Train_acc 0.776274285714, Test_acc 0.773951724138
Epoch 5. Loss: 0.309081547325, Train_acc 0.793864285714, Test_acc 0.778911111111
Epoch 6. Loss: 0.291082330554, Train_acc 0.806795918367, Test_acc 0.795658139535
Epoch 7. Loss: 0.291905886004, Train_acc 0.816110714286, Test_acc 0.808068
Epoch 8. Loss: 0.261846415965, Train_acc 0.824328571429, Test_acc 0.817166666667
Epoch 9. Loss: 0.273562002079, Train_acc 0.830831428571, Test_acc 0.8251140625
```

## Conclusion¶

It works! There’s a lot of other cool things you can do. In more advanced chapters, we’ll show how you can make a layer that takes in multiple inputs, or one that cleverly calls down to MXNet’s symbolic API to squeeze out extra performance without screwing up your convenient imperative workflow.

## Next¶

Serialization: saving your models and parameters for later re-use

For whinges or inquiries, open an issue on GitHub.

```
In [ ]:
```

```
```