Fast, portable neural networks with Gluon HybridBlocks

The tutorials we saw so far adopt the imperative, or define-by-run, programming paradigm. It might not even occur to you to give a name to this style of programming because it’s how we always write Python programs.

Take for example a prototypical program written below in pseudo-Python. We grab some input arrays, we compute upon them to produce some intermediate values, and finally we produce the result that we actually care about.

def our_function(A, B, C, D):
    # Compute some intermediate values
    E = basic_function1(A, B)
    F = basic_function2(C, D)

    # Finally, produce the thing you really care about
    G = basic_function3(E, F)
    return G

# Load up some data
W = some_stuff()
X = some_stuff()
Y = some_stuff()
Z = some_stuff()

result = our_function(W, X, Y, Z)

As you might expect when we compute E, we’re actually performing some numerical operation, like multiplication, and returning an array that we assign to the variable E. Same for F. And if we want to do a similar computation many times by putting these lines in a function, each time our program will have to step through these three lines of Python.

The advantage of this approach is it’s so natural that it might not even occur to some people that there is another way. But the disadvantage is that it’s slow. That’s because we are constantly engaging the Python execution environment (which is slow) even though our entire function performs the same three low-level operations in the same sequence every time. It’s also holding on to all the intermediate values D and E until the function returns even though we can see that they’re not needed. We might have made this program more efficient by re-using memory from either E or F to store the result G.

There actually is a different way to do things. It’s called symbolic programming and most of the early deep learning libraries, including Theano and Tensorflow, embraced this approach exclusively. You might have also heard this approach referred to as declarative programming or define-then-run programming. These all mean the exact same thing. The approach consists of three basic steps:

  • Define a computation workflow, like a pass through a neural network, using placeholder data
  • Compile the program into a front-end language, e.g. Python, independent format
  • Invoke the compiled function, feeding it real data

Revisiting our previous pseudo-Python example, a symbolic version of the same program might look something like this:

# Create some placeholders to stand in for real data that might be supplied to the compiled function.
A = placeholder()
B = placeholder()
C = placeholder()
D = placeholder()

# Compute some intermediate values
E = symbolic_function1(A, B)
F = symbolic_function2(C, D)

# Finally, produce the thing you really care about
G = symbolic_function3(E, F)

our_function = library.compile(inputs=[A, B, C, D], outputs=[G])

# Load up some data
W = some_stuff()
X = some_stuff()
Y = some_stuff()
Z = some_stuff()

result = our_function(W, X, Y, Z)

Here, when we run the line E = symbolic_function1(A, B), no numerical computation actually happens. Instead, the symbolic library notes the way that E is related to A and B and records this information. We don’t do actual computation, we just make a roadmap for how to go from inputs to outputs. Because we can draw all of the variables and operations (both inputs and intermediate values) a nodes, and the relationships between nodes with edges, we call the resulting roadmap a computational graph. In the symbolic approach, we first define the entire graph, and then compile it.

Imperative Programs Tend to be More Flexible

When you’re using an imperative-style library from Python, you are writing in Python. Nearly anything that would be intuitive to write in Python, you could accelerate by calling down in the appropriate places to the imperative deep learning library. On the other hand, when you write a symbolic program, you may not have access to all the familiar Python constructs, like iteration. It’s also easy to debug an imperative program. For one, because all the intermediate values hang around, it’s easy to introspect the program later. Imperative programs are also much easier to debug because we can just stick print statements in between operations.

In short, from a developer’s standpoint, imperative programs are just better. They’re a joy to work with. You don’t have the tricky indirection of working with placeholders. You can do anything that you can do with native Python. And faster debugging, means you get to try out more ideas. But the catch is that imperative programs are comparatively slow.

Symbolic Programs Tend to be More Efficient

The main reason is efficiency, both in terms of memory and speed. Let’s revisit our toy example from before. Consider the following program:

import numpy as np
a = np.ones(10)
b = np.ones(10) * 2
c = b * a
d = c + 1
...

Assume that each cell in the array occupies 8 bytes of memory. How much memory do we need to execute this program in the Python console? As an imperative program we need to allocate memory at each line. That leaves us allocating 4 arrays of size 10. So we’ll need \(4 * 10 * 8 = 320\) bytes. On the other hand, if we built a computation graph, and knew in advance that we only needed d, we could reuse the memory originally allocated for intermediate values. For example, by performing computations in-place, we might recycle the bits allocated for b to store c. And we might recycle the bits allocated for c to store d. In the end we could cut our memory requirement in half, requiring just \(2 * 10 * 8\) = 160 bytes.

Symbolic programs can also perform another kind of optimization, called operation folding. Returning to our toy example, the multiplication and addition operations can be folded into one operation. If the computation runs on a GPU processor, one GPU kernel will be executed, instead of two. In fact, this is one way we hand-craft operations in optimized libraries, such as CXXNet and Caffe. Operation folding improves computation efficiency. Note, you can’t perform operation folding in imperative programs, because the intermediate values might be referenced in the future. Operation folding is possible in symbolic programs because we get the entire computation graph in advance, before actually doing any calculation, giving us a clear specification of which values will be needed and which will not.

Getting the best of both worlds with MXNet Gluon’s HybridBlocks

Most libraries deal with the imperative / symbolic design problem by simply choosing a side. Theano and those frameworks it inspired, like TensorFlow, run with the symbolic way. And because the first versions of MXNet optimized performance, they also went symbolic. Chainer and its descendants like PyTorch are fully imperative way. In designing MXNet Gluon, we asked the following question. Is it possible to get all of the benefits of imperative programming but to still exploit, whenever possible, the speed and memory efficiency of symbolic programming. In other words, a user should be able to use Gluon fully imperatively. And if they never want their lives to be more complicated then they can get on just fine imagining that the story ends there. But when a user needs production-level performance, it should be easy to compile the entire compute graph, or at least to compile large subsets of it.

MXNet accomplishes this through the use of HybridBlocks. Each HybridBlock can run fully imperatively defining their computation with real functions acting on real inputs. But they’re also capable of running symbolically, acting on placeholders. Gluon hides most of this under the hood so you’ll only need to know how it works when you want to write your own layers. Given a HybridBlock whose forward computation consists of going through other HybridBlocks, you can compile that section of the network by calling the HybridBlocks .hybridize() method.

All of MXNet’s predefined layers are HybridBlocks. This means that any network consisting entirely of predefined MXNet layers can be compiled and run at much faster speeds by calling .hybridize().

HybridSequential

We already learned how to use Sequential to stack the layers. The regular Sequential can be built from regular Blocks and so it too has to be a regular Block. However, when you want to build a network using sequential and run it at crazy speeds, you can construct your network using HybridSequential instead. The functionality is the same Sequential:

In [1]:
import mxnet as mx
from mxnet.gluon import nn
from mxnet import nd

def get_net():
    # construct a MLP
    net = nn.HybridSequential()
    with net.name_scope():
        net.add(nn.Dense(256, activation="relu"))
        net.add(nn.Dense(128, activation="relu"))
        net.add(nn.Dense(2))
    # initialize the parameters
    net.collect_params().initialize()
    return net

# forward
x = nd.random_normal(shape=(1, 512))
net = get_net()
print('=== net(x) ==={}'.format(net(x)))
=== net(x) ===
[[ 0.16526183 -0.14005636]]
<NDArray 1x2 @cpu(0)>

To compile and optimize the HybridSequential, we can then call its hybridize method. Only HybridBlocks, e.g. HybridSequential, can be compiled. But you can still call hybridize on normal Block and its HybridBlock children will be compiled instead. We will talk more about HybridBlocks later.

In [2]:
net.hybridize()
print('=== net(x) ==={}'.format(net(x)))
=== net(x) ===
[[ 0.16526183 -0.14005636]]
<NDArray 1x2 @cpu(0)>

Performance

To get a sense of the speedup from hybridizing, we can compare the performance before and after hybridizing by measuring in either case the time it takes to make 1000 forward passes through the network.

In [3]:
from time import time
def bench(net, x):
    mx.nd.waitall()
    start = time()
    for i in range(1000):
        y = net(x)
    mx.nd.waitall()
    return time() - start

net = get_net()
print('Before hybridizing: %.4f sec'%(bench(net, x)))
net.hybridize()
print('After hybridizing: %.4f sec'%(bench(net, x)))
Before hybridizing: 0.4646 sec
After hybridizing: 0.2424 sec

As you can see, hybridizing gives a significant performance boost, almost 2x the speed.

Get the symbolic program

Previously, we feed net with NDArray data x, and then net(x) returned the forward results. Now if we feed it with a Symbol placeholder, then the corresponding symbolic program will be returned.

In [4]:
from mxnet import sym
x = sym.var('data')
print('=== input data holder ===')
print(x)

y = net(x)
print('\n=== the symbolic program of net===')
print(y)

y_json = y.tojson()
print('\n=== the according json definition===')
print(y_json)
=== input data holder ===
<Symbol data>

=== the symbolic program of net===
<Symbol hybridsequential1_dense2_fwd>

=== the according json definition===
{
  "nodes": [
    {
      "op": "null",
      "name": "data",
      "inputs": []
    },
    {
      "op": "null",
      "name": "hybridsequential1_dense0_weight",
      "attr": {
        "__dtype__": "0",
        "__lr_mult__": "1.0",
        "__shape__": "(256, 0)",
        "__wd_mult__": "1.0"
      },
      "inputs": []
    },
    {
      "op": "null",
      "name": "hybridsequential1_dense0_bias",
      "attr": {
        "__dtype__": "0",
        "__init__": "zeros",
        "__lr_mult__": "1.0",
        "__shape__": "(256,)",
        "__wd_mult__": "1.0"
      },
      "inputs": []
    },
    {
      "op": "FullyConnected",
      "name": "hybridsequential1_dense0_fwd",
      "attr": {"num_hidden": "256"},
      "inputs": [[0, 0, 0], [1, 0, 0], [2, 0, 0]]
    },
    {
      "op": "Activation",
      "name": "hybridsequential1_dense0_relu_fwd",
      "attr": {"act_type": "relu"},
      "inputs": [[3, 0, 0]]
    },
    {
      "op": "null",
      "name": "hybridsequential1_dense1_weight",
      "attr": {
        "__dtype__": "0",
        "__lr_mult__": "1.0",
        "__shape__": "(128, 0)",
        "__wd_mult__": "1.0"
      },
      "inputs": []
    },
    {
      "op": "null",
      "name": "hybridsequential1_dense1_bias",
      "attr": {
        "__dtype__": "0",
        "__init__": "zeros",
        "__lr_mult__": "1.0",
        "__shape__": "(128,)",
        "__wd_mult__": "1.0"
      },
      "inputs": []
    },
    {
      "op": "FullyConnected",
      "name": "hybridsequential1_dense1_fwd",
      "attr": {"num_hidden": "128"},
      "inputs": [[4, 0, 0], [5, 0, 0], [6, 0, 0]]
    },
    {
      "op": "Activation",
      "name": "hybridsequential1_dense1_relu_fwd",
      "attr": {"act_type": "relu"},
      "inputs": [[7, 0, 0]]
    },
    {
      "op": "null",
      "name": "hybridsequential1_dense2_weight",
      "attr": {
        "__dtype__": "0",
        "__lr_mult__": "1.0",
        "__shape__": "(2, 0)",
        "__wd_mult__": "1.0"
      },
      "inputs": []
    },
    {
      "op": "null",
      "name": "hybridsequential1_dense2_bias",
      "attr": {
        "__dtype__": "0",
        "__init__": "zeros",
        "__lr_mult__": "1.0",
        "__shape__": "(2,)",
        "__wd_mult__": "1.0"
      },
      "inputs": []
    },
    {
      "op": "FullyConnected",
      "name": "hybridsequential1_dense2_fwd",
      "attr": {"num_hidden": "2"},
      "inputs": [[8, 0, 0], [9, 0, 0], [10, 0, 0]]
    }
  ],
  "arg_nodes": [0, 1, 2, 5, 6, 9, 10],
  "node_row_ptr": [
    0,
    1,
    2,
    3,
    4,
    5,
    6,
    7,
    8,
    9,
    10,
    11,
    12
  ],
  "heads": [[11, 0, 0]],
  "attrs": {"mxnet_version": ["int", 1001]}
}

Now we can save both the program and parameters onto disk, so that it can be loaded later not only in Python, but in all other supported languages, such as C++, R, and Scala, as well.

In [5]:
y.save('model.json')
net.save_params('model.params')

HybridBlock

Now let’s dive deeper into how hybridize works. Remember that gluon networks are composed of Blocks each of which subclass gluon.Block. With normal Blocks, we just need to define a forward function that takes an input x and computes the result of the forward pass through the network. MXNet can figure out the backward pass for us automatically with autograd.

To define a HybridBlock, we instead have a hybrid_forward function:

In [6]:
from mxnet import gluon

class Net(gluon.HybridBlock):
    def __init__(self, **kwargs):
        super(Net, self).__init__(**kwargs)
        with self.name_scope():
            self.fc1 = nn.Dense(256)
            self.fc2 = nn.Dense(128)
            self.fc3 = nn.Dense(2)

    def hybrid_forward(self, F, x):
        # F is a function space that depends on the type of x
        # If x's type is NDArray, then F will be mxnet.nd
        # If x's type is Symbol, then F will be mxnet.sym
        print('type(x): {}, F: {}'.format(
                type(x).__name__, F.__name__))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

The hybrid_forward function takes an additional input, F, which stands for a backend. This exploits one awesome feature of MXNet. MXNet has both a symbolic API (mxnet.symbol) and an imperative API (mxnet.ndarray). In this book, so far, we’ve only focused on the latter. Owing to fortuitous historical reasons, the imperative and symbolic interfaces both support roughly the same API. They have many of same functions (currently about 90% overlap) and when they do, they support the same arguments in the same order. When we define hybrid_forward, we pass in F. When running in imperative mode, hybrid_forward is called with F as mxnet.ndarray and x as some ndarray input. When we compile with hybridize, F will be mxnet.symbol and x will be some placeholder or intermediate symbolic value. Once we call hybridize, the net is compiled, so we’ll never need to call hybrid_forward again.

Let’s demonstrate how this all works by feeding some data through the network twice. We’ll do this for both a regular network and a hybridized net. You’ll see that in the first case, hybrid_forward is actually called twice.

In [7]:
net = Net()
net.collect_params().initialize()
x = nd.random_normal(shape=(1, 512))
print('=== 1st forward ===')
y = net(x)
print('=== 2nd forward ===')
y = net(x)
=== 1st forward ===
type(x): NDArray, F: mxnet.ndarray
=== 2nd forward ===
type(x): NDArray, F: mxnet.ndarray

Now run it again after hybridizing.

In [8]:
net.hybridize()
print('=== 1st forward ===')
y = net(x)
print('=== 2nd forward ===')
y = net(x)
=== 1st forward ===
type(x): Symbol, F: mxnet.symbol
=== 2nd forward ===

It differs from the previous execution in two aspects:

  1. the input data type now is Symbol even when we fed an NDArray into net, because gluon implicitly constructed a symbolic data placeholder.
  2. hybrid_forward is called once at the first time we run net(x). It is because gluon will construct the symbolic program on the first forward, and then keep it for reuse later.

One main reason that the network is faster after hybridizing is because we don’t need to repeatedly invoke the Python forward function, while keeping all computations within the highly efficient C++ backend engine.

But the potential drawback is the loss of flexibility to write the forward function. In other ways, inserting print for debugging or control logic such as if and for into the forward function is not possible now.

Conclusion

Through HybridSequental and HybridBlock, we can convert an imperative program into a symbolic program by calling hybridize.