# Overfitting and regularization (with gluon)¶

Now that we’ve built a regularized logistic regression model from scratch, let’s make this more efficient with gluon. We recommend that you read that section for a description as to why regularization is a good idea. As always, we begin by loading libraries and some data.

[REFINED DRAFT - RELEASE STAGE: CATFOOD]

In [ ]:

from __future__ import print_function
import mxnet as mx
from mxnet import gluon
import mxnet.ndarray as nd
import numpy as np
ctx = mx.cpu()

# for plotting purposes
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt


## The MNIST Dataset¶

In [ ]:

mnist = mx.test_utils.get_mnist()
num_examples = 1000
batch_size = 64
mx.gluon.data.ArrayDataset(mnist["train_data"][:num_examples],
mnist["train_label"][:num_examples].astype(np.float32)),
batch_size, shuffle=True)
mx.gluon.data.ArrayDataset(mnist["test_data"][:num_examples],
mnist["test_label"][:num_examples].astype(np.float32)),
batch_size, shuffle=False)


## Multiclass Logistic Regression¶

In [ ]:

net = gluon.nn.Sequential()
with net.name_scope():


## Parameter initialization¶

In [ ]:

net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)


## Softmax Cross Entropy Loss¶

In [ ]:

loss = gluon.loss.SoftmaxCrossEntropyLoss()


## Optimizer¶

By default gluon tries to keep the coefficients from diverging by using a weight decay penalty. So, to get the real overfitting experience we need to switch it off. We do this by passing 'wd': 0.0' when we instantiate the trainer.

In [ ]:

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01, 'wd': 0.0})


## Evaluation Metric¶

In [ ]:

def evaluate_accuracy(data_iterator, net, loss_fun):
acc = mx.metric.Accuracy()
loss_avg = 0.
for i, (data, label) in enumerate(data_iterator):
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
output = net(data)
loss = loss_fun(output, label)
predictions = nd.argmax(output, axis=1)
acc.update(preds=predictions, labels=label)
loss_avg = loss_avg*i/(i+1) + nd.mean(loss).asscalar()/(i+1)
return acc.get()[1], loss_avg

def plot_learningcurves(loss_tr,loss_ts, acc_tr,acc_ts):
xs = list(range(len(loss_tr)))

f = plt.figure(figsize=(12,6))

fg1.set_xlabel('epoch',fontsize=14)
fg1.set_title('Comparing loss functions')
fg1.semilogy(xs, loss_tr)
fg1.semilogy(xs, loss_ts)
fg1.grid(True,which="both")

fg1.legend(['training loss', 'testing loss'],fontsize=14)

fg2.set_title('Comparing accuracy')
fg1.set_xlabel('epoch',fontsize=14)
fg2.plot(xs, acc_tr)
fg2.plot(xs, acc_ts)
fg2.grid(True,which="both")
fg2.legend(['training accuracy', 'testing accuracy'],fontsize=14)


## Execute training loop¶

In [ ]:

epochs = 700
moving_loss = 0.
niter=0

loss_seq_train = []
loss_seq_test = []
acc_seq_train = []
acc_seq_test = []

for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
output = net(data)
cross_entropy = loss(output, label)
cross_entropy.backward()
trainer.step(data.shape[0])

##########################
#  Keep a moving average of the losses
##########################
niter +=1
moving_loss = .99 * moving_loss + .01 * nd.mean(cross_entropy).asscalar()
est_loss = moving_loss/(1-0.99**niter)

test_accuracy, test_loss = evaluate_accuracy(test_data, net, loss)
train_accuracy, train_loss = evaluate_accuracy(train_data, net, loss)

# save them for later
loss_seq_train.append(train_loss)
loss_seq_test.append(test_loss)
acc_seq_train.append(train_accuracy)
acc_seq_test.append(test_accuracy)

if e % 20 == 0:
print("Completed epoch %s. Train Loss: %s, Test Loss %s, Train_acc %s, Test_acc %s" %
(e+1, train_loss, test_loss, train_accuracy, test_accuracy))

## Plotting the learning curves
plot_learningcurves(loss_seq_train,loss_seq_test,acc_seq_train,acc_seq_test)


## Regularization¶

Now let’s see what this mysterious weight decay is all about. We begin with a bit of math. When we add an L2 penalty to the weights we are effectively adding $$\frac{\lambda}{2} \|w\|^2$$ to the loss. Hence, every time we compute the gradient it gets an additional $$\lambda w$$ term that is added to $$g_t$$, since this is the very derivative of the L2 penalty. As a result we end up taking a descent step not in the direction $$-\eta g_t$$ but rather in the direction $$-\eta (g_t + \lambda w)$$. This effectively shrinks $$w$$ at each step by $$\eta \lambda w$$, thus the name weight decay. To make this work in practice we just need to set the weight decay to something nonzero.

In [ ]:

net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx, force_reinit=True)
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01, 'wd': 0.001})

moving_loss = 0.
niter=0
loss_seq_train = []
loss_seq_test = []
acc_seq_train = []
acc_seq_test = []

for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1,784))
label = label.as_in_context(ctx)
output = net(data)
cross_entropy = loss(output, label)
cross_entropy.backward()
trainer.step(data.shape[0])

##########################
#  Keep a moving average of the losses
##########################
niter +=1
moving_loss = .99 * moving_loss + .01 * nd.mean(cross_entropy).asscalar()
est_loss = moving_loss/(1-0.99**niter)

test_accuracy, test_loss = evaluate_accuracy(test_data, net,loss)
train_accuracy, train_loss = evaluate_accuracy(train_data, net, loss)

# save them for later
loss_seq_train.append(train_loss)
loss_seq_test.append(test_loss)
acc_seq_train.append(train_accuracy)
acc_seq_test.append(test_accuracy)

if e % 20 == 0:
print("Completed epoch %s. Train Loss: %s, Test Loss %s, Train_acc %s, Test_acc %s" %
(e+1, train_loss, test_loss, train_accuracy, test_accuracy))

## Plotting the learning curves
plot_learningcurves(loss_seq_train,loss_seq_test,acc_seq_train,acc_seq_test)


As we can see, the test accuracy improves a bit. Note that the amount by which it improves actually depends on the amount of weight decay. We recommend that you try and experiment with different extents of weight decay. For instance, a larger weight decay (e.g. $$0.01$$) will lead to inferior performance, one that’s larger still ($$0.1$$) will lead to terrible results. This is one of the reasons why tuning parameters is quite so important in getting good experimental results in practice.