Transfering knowledge through finetuning

In previous chapters, we demonstrated how to train a neural network to recognize the categories corresponding to objects in images. We looked at toy datasets like hand-written digits, and thumbnail-sized pictures of animals. And we talked about the ImageNet dataset, the default academic benchmark, which contains 1M million images, 1000 each from 1000 separate classes.

The ImageNet dataset categorically changed what was possible in computer vision. It turns out some things are possible (these days, even easy) on gigantic datasets, that simply aren’t with smaller datasets. In fact, we don’t know of any technique that can comparably powerful model on a similar photograph dataset but containing only, say, 10k images.

And that’s a problem. Because however impressive the results of CNNs on ImageNet may be, most people aren’t interested in ImageNet itself. They’re interested in their own problems. Recognize people based on pictures of their faces. Distinguish between photographs of \(10\) different types of coral on the ocean floor. Usually when individuals (and not Amazon, Google, or inter-institutional big science initiatives) are interested in solving a computer vision problem, they come to the table with modestly sized datasets. A few hundred examples may be common and a few thousand examples may be as much as you can reasonably ask for.

So one natural question emerges. Can we somehow use the powerful models trained on millions of examples for one dataset, and apply them to improve performance on a new problem with a much smaller dataset? This kind of problem (learning on source dataset, bringing knowledge to target dataset), is appropriately called transfer learning. Fortunately, we have some effective tools for solving this problem.

For deep neural networks, the most popular approach is called finetuning and the idea is both simple and effective:

  • Train a neural network on the source task \(S\).

  • Decapitate it, replacing it’s output layer appropriate to target task \(T\).

  • Initialize the weights on the new output layer randomly, keeping all other (pretrained) weights the same.

  • Begin training on the new dataset.

This might be clearer if we visualize the algorithm:


In this section, we’ll demonstrate fine-tuning, using the popular and compact SqueezeNet architecture. Since we don’t want to saddle you with the burden of downloading ImageNet, or of training on ImageNet from scratch, we’ll pull the weights of the pretrained Squeeze net from the internet. Specifically, we’ll be fine-tuning a squeezenet-1.1 that was pre-trained on imagenet-12. Finally, we’ll fine-tune it to recognize hotdogs.

hot dog

We’ll start with the obligatory ritual of importing a bunch of stuff that you’ll need later.

[ ]:
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 6)


We’ll set a few settings up here that you can configure later to manipulate the behavior of the algorithm. These are mostly familiar. Hybrid mode, uses the just in time compiler described in our chapter on high performance training to make the network much faster to train. Since we’re not working with any crazy dynamic graphs that can’t be compiled, there’s no reason not to hybridize. The batch size, number of training epochs, weight decay, and learing rate should all be familiar by now. The positive class weight, says how much more we should upweight the importance of positive instances (photos of hot dogs) in the objective function. We use this to combat the extreme class imbalance (not surprisingly, most pictures do not depict hot dogs).

[ ]:
# Demo mode uses the validation dataset for training, which is smaller and faster to train.
demo = True
log_interval = 100

# Options are imperative or hybrid. Use hybrid for better performance.
mode = 'hybrid'

# training hyperparameters
batch_size = 256
if demo:
    epochs = 5
    learning_rate = 0.02
    wd = 0.002
    epochs = 40
    learning_rate = 0.05
    wd = 0.002

# the class weight for hotdog class to help the imbalance problem.
positive_class_weight = 5
[ ]:
from __future__ import print_function
import logging
import os
import time
from collections import OrderedDict
import as io

import mxnet as mx
from mxnet.test_utils import download

# setup the contexts; will use gpus if avaliable, otherwise cpu
gpus = mx.test_utils.list_gpus()
contexts = [mx.gpu(i) for i in gpus] if len(gpus) > 0 else [mx.cpu()]


Formally, hot dog recognition is a binary classification problem. We’ll use \(1\) to represent the hotdog class, and \(0\) for the not hotdog class. Our hot dog dataset (the target dataset which we’ll fine-tune the model to) contains 18,141 sample images, 2091 of which are hotdogs. Because the dataset is imbalanced (e.g. hotdog class is only 1% in mscoco dataset), sampling interesting negative samples can help to improve the performance of our algorithm. Thus, in the negative class in the our dataset, two thirds are images from food categories (e.g. pizza) other than hotdogs, and 30% are images from all other categories.


We prepare the dataset in the format of MXRecord using im2rec tool. As of the current draft, rec files are not yet explained in the book, but if you’re reading after November or December 2017 and you still see this note, open an issue on GitHub and let us know to stop slacking off.

  • not_hotdog_train.rec 641M (1882 positive, 10000 interesting negative, and 5000 random negative)

  • not_hotdog_validation.rec 49M (209 positive, 700 interesting negative, and 350 random negative)

[ ]:
dataset_files = {'train': ('not_hotdog_train-e6ef27b4.rec', '0aad7e1f16f5fb109b719a414a867bbee6ef27b4'),
                 'validation': ('not_hotdog_validation-c0201740.rec', '723ae5f8a433ed2e2bf729baec6b878ac0201740')}

To demo the model here, we’re justgoing to use the smaller validation set. But if you’re interested in training on the full set, set ‘demo’ to False in the settings at the beginning. Now we’re ready to download and verify the dataset.

[ ]:
if demo:
    training_dataset, training_data_hash = dataset_files['validation']
    training_dataset, training_data_hash = dataset_files['train']

validation_dataset, validation_data_hash = dataset_files['validation']

def verified(file_path, sha1hash):
    import hashlib
    sha1 = hashlib.sha1()
    with open(file_path, 'rb') as f:
        while True:
            data =
            if not data:
    matched = sha1.hexdigest() == sha1hash
    if not matched:
        logging.warn('Found hash mismatch in file {}, possibly due to incomplete download.'
    return matched

url_format = '{}'
if not os.path.exists(training_dataset) or not verified(training_dataset, training_data_hash):'Downloading training dataset.')
if not os.path.exists(validation_dataset) or not verified(validation_dataset, validation_data_hash):'Downloading validation dataset.')


The record files can be read using

[ ]:
# load dataset
train_iter =,
                                   data_shape=(3, 224, 224),
val_iter =,
                                 data_shape=(3, 224, 224),


The model we are finetuning is SqueezeNet. Gluon module offers squeezenet v1.0 and v1.1 that are pretrained on ImageNet. This is just a convolutional neural network, with an architecture chosen to have a small number of parameters and to require a minimal amount of computation. It’s especially popular for folks that need to run CNNs on low-powered devices like cell phones and other internet-of-things devices.

Pulling the pre-trained model

Fortunately, MXNet has a model zoo that gives us convenient access to a number of popular models, both their architectres and their pretrained parameters. Let’s download SqueezeNet right now with just a few lines of code.

[ ]:
from mxnet.gluon import nn
from mxnet.gluon.model_zoo import vision as models

# get pretrained squeezenet
net = models.squeezenet1_1(pretrained=True, prefix='deep_dog_', ctx=contexts)
# hot dog happens to be a class in imagenet.
# we can reuse the weight for that class for better performance
# here's the index for that class for later use
imagenet_hotdog_index = 713

DeepDog net

We can now use the feature extractor part from the pretrained squeezenet to build our own network. The model zoo, even handles the decaptiation for us. All we have to do is specify the number out of output classes in our new task, which we do via the keyword argument classes=2.

[ ]:
deep_dog_net = models.squeezenet1_1(prefix='deep_dog_', classes=2)
deep_dog_net.features = net.features

The network can already be used for prediction. However, since it hasn’t been finetuned yet, the network performance could be bad.

[ ]:
from skimage.color import rgba2rgb

def classify_hotdog(net, url, contexts):
    I = io.imread(url)
    if I.shape[2] == 4:
        I = rgba2rgb(I)
    image = mx.nd.array(I).astype(np.uint8)
    plt.subplot(1, 2, 1)
    image = mx.image.resize_short(image, 256)
    image, _ = mx.image.center_crop(image, (224, 224))
    plt.subplot(1, 2, 2)
    image = mx.image.color_normalize(image.astype(np.float32)/255,
                                     mean=mx.nd.array([0.485, 0.456, 0.406]),
                                     std=mx.nd.array([0.229, 0.224, 0.225]))
    image = mx.nd.transpose(image.astype('float32'), (2,1,0))
    image = mx.nd.expand_dims(image, axis=0)
    out = mx.nd.SoftmaxActivation(net(image.as_in_context(contexts[0])))
    print('Probabilities are: '+str(out[0].asnumpy()))
    result = np.argmax(out.asnumpy())
    outstring = ['Not hotdog!', 'Hotdog!']
[ ]:
classify_hotdog(deep_dog_net, '../img/real_hotdog.jpg', contexts)

Reuse class weights

As mentioned earlier, in addition to the feature extractor, we can reuse the class weights for hot dog from the pretrained model, since hot dog was already a class in the imagenet. To do that, we need to get the weight from the classifier layers of the pretrained model, find the right slice, and put it into our two-class classifier.

[ ]:
# let's examine the output layer and find the last conv layer
[ ]:
# the last conv layer is the second layer
pretrained_conv_params = net.output[0].params

# weights can then be found from the above parameter dict
pretrained_weight_param = pretrained_conv_params.get('weight')
pretrained_bias_param = pretrained_conv_params.get('bias')

# next, we locate the right slice that we're interested in.
hotdog_w = mx.nd.split([0]),
                       1000, axis=0)[imagenet_hotdog_index]
hotdog_b = mx.nd.split([0]),
                       1000, axis=0)[imagenet_hotdog_index]

# our classifier is for two classes. here, we reuse the hotdog class weight,
# and randomly initialize the 'not hotdog' class.
new_classifier_w = mx.nd.concat(mx.nd.random_normal(shape=hotdog_w.shape, scale=0.02, ctx=contexts[0]),
new_classifier_b = mx.nd.concat(mx.nd.random_normal(shape=hotdog_b.shape, scale=0.02, ctx=contexts[0]),

# finally, we initialize the parameter buffers and set the values.
# since classifier is a HybridSequential/Sequential, the following
# takes the zero-indexed 1-st layer of the classifier
final_conv_layer_params = deep_dog_net.output[0].params


Our task is a binary classification problem with imbalanced classes. So we’ll monitor performance both using accuracy and F1 score, a metric favored in settings with extreme class imbalance. [Note to authors: ensure that F1 score is explained earlier or explain it here in full]

[ ]:
# return metrics string representation
def metric_str(names, accs):
    return ', '.join(['%s=%f'%(name, acc) for name, acc in zip(names, accs)])
metric = mx.metric.create(['acc', 'f1'])

The following snippet performs inferences on evaluation dataset, and updates the metrics. Once the evaluation data iterator is exhausted, it returns the values of each of the metrics.

[ ]:
import mxnet.gluon as gluon
from mxnet.image import color_normalize

def evaluate(net, data_iter, ctx):
    for batch in data_iter:
        data = color_normalize([0]/255,
                               mean=mx.nd.array([0.485, 0.456, 0.406]).reshape((1,3,1,1)),
                               std=mx.nd.array([0.229, 0.224, 0.225]).reshape((1,3,1,1)))
        data = gluon.utils.split_and_load(data, ctx_list=ctx, batch_axis=0)
        label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis=0)
        outputs = []
        for x in data:
        metric.update(label, outputs)
    out = metric.get()
    return out


We now can train the model just as we would any supervised model. In this example, we set up the training loop for multi-GPU use as described from first principles here and in the context of gluon here.

[ ]:
import mxnet.autograd as autograd

def train(net, train_iter, val_iter, epochs, ctx):
    if isinstance(ctx, mx.Context):
        ctx = [ctx]
    trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': learning_rate, 'wd': wd})
    loss = gluon.loss.SoftmaxCrossEntropyLoss()

    best_f1 = 0
    val_names, val_accs = evaluate(net, val_iter, ctx)'[Initial] validation: %s'%(metric_str(val_names, val_accs)))
    for epoch in range(epochs):
        tic = time.time()
        btic = time.time()
        for i, batch in enumerate(train_iter):
            # the model zoo models expect normalized images
            data = color_normalize([0]/255,
                                   mean=mx.nd.array([0.485, 0.456, 0.406]).reshape((1,3,1,1)),
                                   std=mx.nd.array([0.229, 0.224, 0.225]).reshape((1,3,1,1)))
            data = gluon.utils.split_and_load(data, ctx_list=ctx, batch_axis=0)
            label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis=0)
            outputs = []
            Ls = []
            with autograd.record():
                for x, y in zip(data, label):
                    z = net(x)
                    # rescale the loss based on class to counter the imbalance problem
                    L = loss(z, y) * (1+y*positive_class_weight)/positive_class_weight
                    # store the loss and do backward after we have done forward
                    # on all GPUs for better speed on multiple GPUs.
                for L in Ls:
            metric.update(label, outputs)
            if log_interval and not (i+1)%log_interval:
                names, accs = metric.get()
      '[Epoch %d Batch %d] speed: %f samples/s, training: %s'%(
                               epoch, i, batch_size/(time.time()-btic), metric_str(names, accs)))
            btic = time.time()

        names, accs = metric.get()
        metric.reset()'[Epoch %d] training: %s'%(epoch, metric_str(names, accs)))'[Epoch %d] time cost: %f'%(epoch, time.time()-tic))
        val_names, val_accs = evaluate(net, val_iter, ctx)'[Epoch %d] validation: %s'%(epoch, metric_str(val_names, val_accs)))

        if val_accs[1] > best_f1:
            best_f1 = val_accs[1]
  'Best validation f1 found. Checkpointing...')

if mode == 'hybrid':
if epochs > 0:
    train(deep_dog_net, train_iter, val_iter, epochs, contexts)

Try it out!

Once our model is trained, we can either use the deep_dog_net model in the notebook kernel, or load it from the best checkpoint.

[ ]:
# Uncomment below line and replace the file name with the last checkpoint.
# deep_dog_net.load_parameters('deep-dog-3.params', contexts)
# Alternatively, you can uncomment the following lines to get the model that we finetuned,
# with validation F1 score of 0.74.
deep_dog_net.load_parameters('deep-dog-5a342a6f.params', contexts)
[ ]:
classify_hotdog(deep_dog_net, '../img/real_hotdog.jpg', contexts)
[ ]:
classify_hotdog(deep_dog_net, '../img/leg_hotdog.jpg', contexts)
[ ]:
classify_hotdog(deep_dog_net, '../img/dog_hotdog.jpg', contexts)


As you can see, given a pretrained model, we can get a great classifier, even for tasks where we simply don’t have enough data to train from scratch. That’s because the representations necessary to perform both tasks have a lot in common. Since they both address natural images, they both require recognizing textures, shapes, edges, etc. Whenever you have a small enough dataset that you fear impoverishing your model, try thinking about what larger datasets you might be able to pre-train your model on, so that you can just perform fine-tuning on the task at hand.