Training on multiple GPUs with gluon

Gluon makes it easy to implement data parallel training. In this notebook, we’ll implement dataparallel training for a convolutional neural network. If you’d like a finer grained view of the concepts, you might want to first read the previous notebook, multi gpu from scratch with gluon.

To get started, let’s first define a simple convolutional neural network and loss function.

In [1]:
from mxnet import gluon, gpu
net = gluon.nn.Sequential(prefix='cnn_')
with net.name_scope():
    net.add(gluon.nn.Conv2D(channels=20, kernel_size=3, activation='relu'))
    net.add(gluon.nn.MaxPool2D(pool_size=(2,2), strides=(2,2)))
    net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='relu'))
    net.add(gluon.nn.MaxPool2D(pool_size=(2,2), strides=(2,2)))
    net.add(gluon.nn.Dense(128, activation="relu"))

loss = gluon.loss.SoftmaxCrossEntropyLoss()

Initialize on multiple devices

Gluon supports initialization of network parameters over multiple devices. We accomplish this by passing in an array of device contexts, instead of the single contexts we’ve used in earlier notebooks. When we pass in an array of contexts, the parameters are initialized to be identical across all of our devices.

In [2]:
ctx = [gpu(0), gpu(1)]

Given a batch of input data, we can split it into parts (equal to the number of contexts) by calling gluon.utils.split_and_load(batch, ctx). The split_and_load function doesn’t just split the data, it also loads each part onto the appropriate device context.

So now when we call the forward pass on two separate parts, each one is computed on the appropriate corresponding device and using the version of the parameters stored there.

In [3]:
from mxnet.test_utils import get_mnist
mnist = get_mnist()
batch = mnist['train_data'][0:4, :]
data = gluon.utils.split_and_load(batch, ctx)

[[-0.01017658  0.03012515  0.02999702  0.01175333 -0.01746453  0.00707828
   0.02404996  0.00616632 -0.02094562  0.0136827 ]
 [-0.01249129  0.0305641   0.02823936 -0.00159418 -0.00722831  0.00538148
   0.01476716  0.0225275  -0.02458289  0.0246105 ]]
<NDArray 2x10 @gpu(0)>

[[-0.00349744  0.01896121  0.02959755  0.00261514  0.00015916 -0.00355723
   0.0040103   0.03075583 -0.00761715  0.00599077]
 [-0.00557119  0.02766508  0.02406837 -0.0007478  -0.00511122  0.00538528
   0.00292899  0.01488838 -0.00191687  0.01074106]]
<NDArray 2x10 @gpu(1)>

At any time, we can access the version of the parameters stored on each device. Recall from the first Chapter that our weights may not actually be initialized when we call initialize because the parameter shapes may not yet be known. In these cases, initialization is deferred pending shape inference.

In [1]:
weight = net.collect_params()['cnn_conv2d0_weight']

for c in ctx:
    print('=== channel 0 of the first conv2d on {} ==={}'.format(
NameError                                 Traceback (most recent call last)
<ipython-input-1-7bc043c57319> in <module>()
----> 1 weight = net.collect_params()['cnn_conv2d0_weight']
      3 for c in ctx:
      4     print('=== channel 0 of the first conv2d on {} ==={}'.format(
      5         c,[0]))

NameError: name 'net' is not defined

Similarly, we can access the gradients on each of the GPUs. Because each GPU gets a different part of the batch (a different subset of examples), the gradients on each GPU vary.

In [5]:
def forward_backward(net, data, label):
    with gluon.autograd.record():
        losses = [loss(net(X), Y) for X, Y in zip(data, label)]
    for l in losses:

label = gluon.utils.split_and_load(mnist['train_label'][0:4], ctx)
forward_backward(net, data, label)
for c in ctx:
    print('=== grad of channel 0 of the first conv2d on {} ==={}'.format(
        c, weight.grad(ctx=c)[0]))
=== grad of channel 0 of the first conv2d on gpu(0) ===
[[[-0.00481181  0.02549154  0.05066926]
  [ 0.01503928  0.04740802  0.04111018]
  [ 0.04527877  0.06305876  0.04087965]]]
<NDArray 1x3x3 @gpu(0)>
=== grad of channel 0 of the first conv2d on gpu(1) ===
[[[-0.01102538 -0.02251887 -0.02211753]
  [-0.01587106 -0.03848278 -0.03960424]
  [-0.03371563 -0.06092874 -0.064744  ]]]
<NDArray 1x3x3 @gpu(1)>

Put all things together

Now we can implement the remaining functions. Most of them are the same as when we did everything by hand, one notable difference is that a gluon trainer recognizes multi-devices, it will automatically aggregate the gradients and synchronize the parameters.

In [6]:
from mxnet import nd
from import NDArrayIter
from time import time

def train_batch(batch, ctx, net, trainer):
    # split the data batch and load them on GPUs
    data = gluon.utils.split_and_load([0], ctx)
    label = gluon.utils.split_and_load(batch.label[0], ctx)
    # compute gradient
    forward_backward(net, data, label)
    # update parameters

def valid_batch(batch, ctx, net):
    data =[0].as_in_context(ctx[0])
    pred = nd.argmax(net(data), axis=1)
    return nd.sum(pred == batch.label[0].as_in_context(ctx[0])).asscalar()

def run(num_gpus, batch_size, lr):
    # the list of GPUs will be used
    ctx = [gpu(i) for i in range(num_gpus)]
    print('Running on {}'.format(ctx))

    # data iterator
    mnist = get_mnist()
    train_data = NDArrayIter(mnist["train_data"], mnist["train_label"], batch_size)
    valid_data = NDArrayIter(mnist["test_data"], mnist["test_label"], batch_size)
    print('Batch size is {}'.format(batch_size))

    trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
    for epoch in range(5):
        # train
        start = time()
        for batch in train_data:
            train_batch(batch, ctx, net, trainer)
        nd.waitall()  # wait until all computations are finished to benchmark the time
        print('Epoch %d, training time = %.1f sec'%(epoch, time()-start))

        # validating
        correct, num = 0.0, 0.0
        for batch in valid_data:
            correct += valid_batch(batch, ctx, net)
            num +=[0].shape[0]
        print('         validation accuracy = %.4f'%(correct/num))

run(1, 64, .3)
run(2, 128, .6)
Running on [gpu(0)]
Batch size is 64
Epoch 0, training time = 4.9 sec
         validation accuracy = 0.9764
Epoch 1, training time = 4.7 sec
         validation accuracy = 0.9839
Epoch 2, training time = 4.6 sec
         validation accuracy = 0.9836
Epoch 3, training time = 4.7 sec
         validation accuracy = 0.9861
Epoch 4, training time = 4.7 sec
         validation accuracy = 0.9862
Running on [gpu(0), gpu(1)]
Batch size is 128
Epoch 0, training time = 2.8 sec
         validation accuracy = 0.8951
Epoch 1, training time = 2.8 sec
         validation accuracy = 0.9687
Epoch 2, training time = 2.8 sec
         validation accuracy = 0.9759
Epoch 3, training time = 2.8 sec
         validation accuracy = 0.9785
Epoch 4, training time = 2.9 sec
         validation accuracy = 0.9813


Both parameters and trainers in gluon support multi-devices. Moving from one device to multi-devices is straightforward.