# Training on multiple GPUs with gluon¶

Gluon makes it easy to implement data parallel training. In this notebook, we’ll implement data parallel training for a convolutional neural network. If you’d like a finer grained view of the concepts, you might want to first read the previous notebook, multi gpu from scratch with gluon.

To get started, let’s first define a simple convolutional neural network and loss function.

In [1]:

import mxnet as mx
from mxnet import nd, gluon, autograd
net = gluon.nn.Sequential(prefix='cnn_')
with net.name_scope():

loss = gluon.loss.SoftmaxCrossEntropyLoss()


## Initialize on multiple devices¶

Gluon supports initialization of network parameters over multiple devices. We accomplish this by passing in an array of device contexts, instead of the single contexts we’ve used in earlier notebooks. When we pass in an array of contexts, the parameters are initialized to be identical across all of our devices.

In [2]:

GPU_COUNT = 2 # increase if you have more
ctx = [mx.gpu(i) for i in range(GPU_COUNT)]
net.collect_params().initialize(ctx=ctx)


Given a batch of input data, we can split it into parts (equal to the number of contexts) by calling gluon.utils.split_and_load(batch, ctx). The split_and_load function doesn’t just split the data, it also loads each part onto the appropriate device context.

So now when we call the forward pass on two separate parts, each one is computed on the appropriate corresponding device and using the version of the parameters stored there.

In [3]:

from mxnet.test_utils import get_mnist
mnist = get_mnist()
batch = mnist['train_data'][0:GPU_COUNT*2, :]
print(net(data[0]))
print(net(data[1]))


[[-0.01876061 -0.02165037 -0.01293943  0.03837404 -0.00821797 -0.00911531
0.00416799 -0.00729158 -0.00232711 -0.00155549]
[ 0.00441474 -0.01953595 -0.00128483  0.02768224  0.01389615 -0.01320441
-0.01166505 -0.00637776  0.0135425  -0.00611765]]
<NDArray 2x10 @gpu(0)>

[[ -6.78736670e-03  -8.86893831e-03  -1.04004676e-02   1.72976423e-02
2.26115398e-02  -6.36630831e-03  -1.54974898e-02  -1.22633884e-02
1.19591374e-02  -6.60043515e-05]
[ -1.17358668e-02  -2.16879714e-02   1.71219767e-03   2.49827504e-02
1.16810966e-02  -9.52543691e-03  -1.03610428e-02   5.08510228e-03
7.06662657e-03  -9.25292261e-03]]
<NDArray 2x10 @gpu(1)>


At any time, we can access the version of the parameters stored on each device. Recall from the first Chapter that our weights may not actually be initialized when we call initialize because the parameter shapes may not yet be known. In these cases, initialization is deferred pending shape inference.

In [4]:

weight = net.collect_params()['cnn_conv0_weight']

for c in ctx:
print('=== channel 0 of the first conv on {} ==={}'.format(
c, weight.data(ctx=c)[0]))

=== channel 0 of the first conv on gpu(0) ===
[[[ 0.04118239  0.05352169 -0.04762455]
[ 0.06035256 -0.01528978  0.04946674]
[ 0.06110793 -0.00081179  0.02191102]]]
<NDArray 1x3x3 @gpu(0)>
=== channel 0 of the first conv on gpu(1) ===
[[[ 0.04118239  0.05352169 -0.04762455]
[ 0.06035256 -0.01528978  0.04946674]
[ 0.06110793 -0.00081179  0.02191102]]]
<NDArray 1x3x3 @gpu(1)>


Similarly, we can access the gradients on each of the GPUs. Because each GPU gets a different part of the batch (a different subset of examples), the gradients on each GPU vary.

In [5]:

def forward_backward(net, data, label):
losses = [loss(net(X), Y) for X, Y in zip(data, label)]
for l in losses:
l.backward()

forward_backward(net, data, label)
for c in ctx:
print('=== grad of channel 0 of the first conv2d on {} ==={}'.format(

=== grad of channel 0 of the first conv2d on gpu(0) ===
[[[-0.02078936 -0.00562428  0.01711007]
[ 0.01138539  0.0280002   0.04094725]
[ 0.00993335  0.01218192  0.02122578]]]
<NDArray 1x3x3 @gpu(0)>
=== grad of channel 0 of the first conv2d on gpu(1) ===
[[[-0.02543036 -0.02789939 -0.00302115]
[-0.04816786 -0.03347274 -0.00403483]
[-0.03178394 -0.01254033  0.00855637]]]
<NDArray 1x3x3 @gpu(1)>


## Put all things together¶

Now we can implement the remaining functions. Most of them are the same as when we did everything by hand; one notable difference is that if a gluon trainer recognizes multi-devices, it will automatically aggregate the gradients and synchronize the parameters.

In [ ]:

from mxnet.io import NDArrayIter
from time import time

def train_batch(batch, ctx, net, trainer):
# split the data batch and load them on GPUs
forward_backward(net, data, label)
# update parameters
trainer.step(batch.data[0].shape[0])

def valid_batch(batch, ctx, net):
data = batch.data[0].as_in_context(ctx[0])
pred = nd.argmax(net(data), axis=1)
return nd.sum(pred == batch.label[0].as_in_context(ctx[0])).asscalar()

def run(num_gpus, batch_size, lr):
# the list of GPUs will be used
ctx = [mx.gpu(i) for i in range(num_gpus)]
print('Running on {}'.format(ctx))

# data iterator
mnist = get_mnist()
train_data = NDArrayIter(mnist["train_data"], mnist["train_label"], batch_size)
valid_data = NDArrayIter(mnist["test_data"], mnist["test_label"], batch_size)
print('Batch size is {}'.format(batch_size))

net.collect_params().initialize(force_reinit=True, ctx=ctx)
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
for epoch in range(5):
# train
start = time()
train_data.reset()
for batch in train_data:
train_batch(batch, ctx, net, trainer)
nd.waitall()  # wait until all computations are finished to benchmark the time
print('Epoch %d, training time = %.1f sec'%(epoch, time()-start))

# validating
valid_data.reset()
correct, num = 0.0, 0.0
for batch in valid_data:
correct += valid_batch(batch, ctx, net)
num += batch.data[0].shape[0]
print('         validation accuracy = %.4f'%(correct/num))

run(1, 64, .3)
run(GPU_COUNT, 64*GPU_COUNT, .3)

Running on [gpu(0)]
Batch size is 64
Epoch 0, training time = 5.0 sec
validation accuracy = 0.9738
Epoch 1, training time = 4.8 sec
validation accuracy = 0.9841
Epoch 2, training time = 4.7 sec
validation accuracy = 0.9863
Epoch 3, training time = 4.7 sec
validation accuracy = 0.9868
Epoch 4, training time = 4.7 sec
validation accuracy = 0.9877
Running on [gpu(0), gpu(1)]
Batch size is 128


## Conclusion¶

Both parameters and trainers in gluon support multi-devices. Moving from one device to multi-devices is straightforward.