Bayes by Backprop with gluon (NN, classification)

After discussing Bayes by Backprop from scratch in a previous notebook, we can now look at the corresponding implementation as gluon components.

We start off with the usual set of imports.

[ ]:
from __future__ import print_function
import collections
import mxnet as mx
import numpy as np
from mxnet import nd, autograd
from matplotlib import pyplot as plt
from mxnet import gluon

For easy tuning and experimentation, we define a dictionary holding the hyper-parameters of our model.

[ ]:
config = {
    "num_hidden_layers": 2,
    "num_hidden_units": 400,
    "batch_size": 128,
    "epochs": 10,
    "learning_rate": 0.001,
    "num_samples": 1,
    "pi": 0.25,
    "sigma_p": 1.0,
    "sigma_p1": 0.75,
    "sigma_p2": 0.01,

Also, we specify the device context for MXNet.

[ ]:
ctx = mx.cpu()

Load MNIST data

We will again train and evaluate the algorithm on the MNIST data set and therefore load the data set as follows:

[ ]:
def transform(data, label):
    return data.astype(np.float32)/126.0, label.astype(np.float32)

mnist = mx.test_utils.get_mnist()
num_inputs = 784
num_outputs = 10
batch_size = config['batch_size']

train_data =, transform=transform),
                                      batch_size, shuffle=True)
test_data =, transform=transform),
                                     batch_size, shuffle=False)

num_train = sum([batch_size for i in train_data])
num_batches = num_train / batch_size

In order to reproduce and compare the results from the paper, we preprocess the pixels by dividing by 126.

Model definition

Neural net modeling

As our model we are using a straightforward MLP and we are wiring up our network just as we are used to in gluon. Note that we are not using any special layers during the definition of our network, as we believe that Bayes by Backprop should be thought of as a training method, rather than a special architecture.

[ ]:
num_layers = config['num_hidden_layers']
num_hidden = config['num_hidden_units']

net = gluon.nn.Sequential()
with net.name_scope():
    for i in range(num_layers):
        net.add(gluon.nn.Dense(num_hidden, activation="relu"))

Build objective/loss

Again, we define our loss function as described in Bayes by Backprop from scratch. Note that we are bundling all of this functionality as part of a gluon.loss.Loss subclass, where the loss computation is performed in the hybrid_forward function.

[ ]:
class BBBLoss(gluon.loss.Loss):
    def __init__(self, log_prior="gaussian", log_likelihood="softmax_cross_entropy",
                 sigma_p1=1.0, sigma_p2=0.1, pi=0.5, weight=None, batch_axis=0, **kwargs):
        super(BBBLoss, self).__init__(weight, batch_axis, **kwargs)
        self.log_prior = log_prior
        self.log_likelihood = log_likelihood
        self.sigma_p1 = sigma_p1
        self.sigma_p2 = sigma_p2
        self.pi = pi

    def log_softmax_likelihood(self, yhat_linear, y):
        return nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)

    def log_gaussian(self, x, mu, sigma):
        return -0.5 * np.log(2.0 * np.pi) - nd.log(sigma) - (x - mu) ** 2 / (2 * sigma ** 2)

    def gaussian_prior(self, x):
        sigma_p = nd.array([self.sigma_p1], ctx=ctx)
        return nd.sum(self.log_gaussian(x, 0., sigma_p))

    def gaussian(self, x, mu, sigma):
        scaling = 1.0 / nd.sqrt(2.0 * np.pi * (sigma ** 2))
        bell = nd.exp(- (x - mu) ** 2 / (2.0 * sigma ** 2))

        return scaling * bell

    def scale_mixture_prior(self, x):
        sigma_p1 = nd.array([self.sigma_p1], ctx=ctx)
        sigma_p2 = nd.array([self.sigma_p2], ctx=ctx)
        pi = self.pi

        first_gaussian = pi * self.gaussian(x, 0., sigma_p1)
        second_gaussian = (1 - pi) * self.gaussian(x, 0., sigma_p2)

        return nd.log(first_gaussian + second_gaussian)

    def hybrid_forward(self, F, output, label, params, mus, sigmas, sample_weight=None):
        log_likelihood_sum = nd.sum(self.log_softmax_likelihood(output, label))
        prior = None
        if self.log_prior == "gaussian":
            prior = self.gaussian_prior
        elif self.log_prior == "scale_mixture":
            prior = self.scale_mixture_prior
        log_prior_sum = sum([nd.sum(prior(param)) for param in params])
        log_var_posterior_sum = sum([nd.sum(self.log_gaussian(params[i], mus[i], sigmas[i])) for i in range(len(params))])
        return 1.0 / num_batches * (log_var_posterior_sum - log_prior_sum) - log_likelihood_sum

bbb_loss = BBBLoss(log_prior="scale_mixture", sigma_p1=config['sigma_p1'], sigma_p2=config['sigma_p2'])

Parameter initialization

First, we need to initialize all the network’s parameters, which are only point estimates of the weights at this point. We will soon see, how we can still train the network in a Bayesian fashion, without interfering with the network’s architecture.

[ ]:
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)

Then we have to forward-propagate a single data set entry once to set up all network parameters (weights and biases) with the desired initliaizer specified above.

[ ]:
for i, (data, label) in enumerate(train_data):
    data = data.as_in_context(ctx).reshape((-1, 784))
[ ]:
weight_scale = .1
rho_offset = -3

# initialize variational parameters; mean and variance for each weight
mus = []
rhos = []

shapes = list(map(lambda x: x.shape, net.collect_params().values()))

for shape in shapes:
    mu = gluon.Parameter('mu', shape=shape, init=mx.init.Normal(weight_scale))
    rho = gluon.Parameter('rho',shape=shape, init=mx.init.Constant(rho_offset))

variational_params = mus + rhos

raw_mus = list(map(lambda x:, mus))
raw_rhos = list(map(lambda x:, rhos))


Now, we still have to choose the optimizer we wish to use for training. This time, we are using the adam optimizer.

[ ]:
trainer = gluon.Trainer(variational_params, 'adam', {'learning_rate': config['learning_rate']})

Main training loop


Recall the 3-step process for the variational parameters:

  1. Sample \(\mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I}^{d})\)

[ ]:
def sample_epsilons(param_shapes):
    epsilons = [nd.random_normal(shape=shape, loc=0., scale=1.0, ctx=ctx) for shape in param_shapes]
    return epsilons
  1. Transform \(\mathbf{\rho}\) to a positive vector via the softplus function: \(\mathbf{\sigma} = \text{softplus}(\mathbf{\rho}) = \log(1 + \exp(\mathbf{\rho}))\)

[ ]:
def softplus(x):
    return nd.log(1. + nd.exp(x))

def transform_rhos(rhos):
    return [softplus(rho) for rho in rhos]
  1. Compute \(\mathbf{w}\): \(\mathbf{w} = \mathbf{\mu} + \mathbf{\sigma} \circ \mathbf{\epsilon}\), where the \(\circ\) operator represents the element-wise multiplication. This is the “reparametrization trick” for separating the randomness from the parameters of \(q\).

[ ]:
def transform_gaussian_samples(mus, sigmas, epsilons):
    samples = []
    for j in range(len(mus)):
        samples.append(mus[j] + sigmas[j] * epsilons[j])
    return samples

Putting these three steps together we get:

[ ]:
def generate_weight_sample(layer_param_shapes, mus, rhos):
    # sample epsilons from standard normal
    epsilons = sample_epsilons(layer_param_shapes)

    # compute softplus for variance
    sigmas = transform_rhos(rhos)

    # obtain a sample from q(w|theta) by transforming the epsilons
    layer_params = transform_gaussian_samples(mus, sigmas, epsilons)

    return layer_params, sigmas

Evaluation metric

In order to being able to assess our model performance we define a helper function which evaluates our accuracy on an ongoing basis.

[ ]:
def evaluate_accuracy(data_iterator, net, layer_params):
    numerator = 0.
    denominator = 0.
    for i, (data, label) in enumerate(data_iterator):
        data = data.as_in_context(ctx).reshape((-1, 784))
        label = label.as_in_context(ctx)

        for l_param, param in zip(layer_params, net.collect_params().values()):
            param._data[0] = l_param

        output = net(data)
        predictions = nd.argmax(output, axis=1)
        numerator += nd.sum(predictions == label)
        denominator += data.shape[0]
    return (numerator / denominator).asscalar()

Complete loop

The complete training loop is given below.

[ ]:
epochs = config['epochs']
learning_rate = config['learning_rate']
smoothing_constant = .01
train_acc = []
test_acc = []

for e in range(epochs):
    for i, (data, label) in enumerate(train_data):
        data = data.as_in_context(ctx).reshape((-1, 784))
        label = label.as_in_context(ctx)
        label_one_hot = nd.one_hot(label, 10)

        with autograd.record():
            # generate sample
            layer_params, sigmas = generate_weight_sample(shapes, raw_mus, raw_rhos)

            # overwrite network parameters with sampled parameters
            for sample, param in zip(layer_params, net.collect_params().values()):
                param._data[0] = sample

            # forward-propagate the batch
            output = net(data)

            # calculate the loss
            loss = bbb_loss(output, label_one_hot, layer_params, raw_mus, sigmas)

            # backpropagate for gradient calculation


        # calculate moving loss for monitoring convergence
        curr_loss = nd.mean(loss).asscalar()
        moving_loss = (curr_loss if ((i == 0) and (e == 0))
                       else (1 - smoothing_constant) * moving_loss + (smoothing_constant) * curr_loss)

    test_accuracy = evaluate_accuracy(test_data, net, raw_mus)
    train_accuracy = evaluate_accuracy(train_data, net, raw_mus)
    print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" %
          (e, moving_loss, train_accuracy, test_accuracy))


For demonstration purposes, we can now take a look at one particular weight by plotting its distribution.

[ ]:
def gaussian(x, mu, sigma):
    scaling = 1.0 / nd.sqrt(2.0 * np.pi * (sigma ** 2))
    bell = nd.exp(- (x - mu) ** 2 / (2.0 * sigma ** 2))

    return scaling * bell

def show_weight_dist(mean, variance):
    sigma = nd.sqrt(variance)
    x = np.linspace(mean.asscalar() - 4*sigma.asscalar(), mean.asscalar() + 4*sigma.asscalar(), 100)
    plt.plot(x, gaussian(nd.array(x, ctx=ctx), mean, sigma).asnumpy())

mu = raw_mus[0][0][0]
var = softplus(raw_rhos[0][0][0]) ** 2

show_weight_dist(mu, var)

Weight pruning

To measure the degree of redundancy present in the trained network and to reduce the model’s parameter count, we now want to examine the effect of setting some of the weights to \(0\) and evaluate the test accuracy afterwards. We can achieve this by ordering the weights according to their signal-to-noise-ratio, \(\frac{|\mu_i|}{\sigma_i}\), and setting a certain percentage of the weights with the lowest ratios to \(0\).

We can calculate the signal-to-noise-ratio as follows:

[ ]:
def signal_to_noise_ratio(mus, sigmas):
    sign_to_noise = []
    for j in range(len(mus)):
        sign_to_noise.extend([nd.abs(mus[j]) / sigmas[j]])
    return sign_to_noise

We further introduce a few helper methods which turn our list of weights into a single vector containing all weights. This will make our subsequent actions easier.

[ ]:
def vectorize_matrices_in_vector(vec):
    for i in range(0, (num_layers + 1) * 2, 2):
        if i == 0:
            vec[i] = nd.reshape(vec[i], num_inputs * num_hidden)
        elif i == num_layers * 2:
            vec[i] = nd.reshape(vec[i], num_hidden * num_outputs)
            vec[i] = nd.reshape(vec[i], num_hidden * num_hidden)

    return vec

def concact_vectors_in_vector(vec):
    concat_vec = vec[0]
    for i in range(1, len(vec)):
        concat_vec = nd.concat(concat_vec, vec[i], dim=0)

    return concat_vec

def transform_vector_structure(vec):
    vec = vectorize_matrices_in_vector(vec)
    vec = concact_vectors_in_vector(vec)

    return vec

In addition, we also have a helper method which transforms the pruned weight vector back to the original layered structure.

[ ]:
from functools import reduce
import operator

def prod(iterable):
    return reduce(operator.mul, iterable, 1)

def restore_weight_structure(vec):
    pruned_weights = []

    index = 0

    for shape in shapes:
        incr = prod(shape)
        pruned_weights.extend([nd.reshape(vec[index : index + incr], shape)])
        index += incr

    return pruned_weights

The actual pruning of the vector happens in the following function. Note that this function accepts an ordered list of percentages to evaluate the performance at different pruning rates. In this setting, pruning at each iteration means extracting the index of the lowest signal-to-noise-ratio weight and setting the weight at this index to \(0\).

[ ]:
def prune_weights(sign_to_noise_vec, prediction_vector, percentages):
    pruning_indices = nd.argsort(sign_to_noise_vec, axis=0)

    for percentage in percentages:
        prediction_vector = mus_copy_vec.copy()
        pruning_indices_percent = pruning_indices[0:int(len(pruning_indices)*percentage)]
        for pr_ind in pruning_indices_percent:
            prediction_vector[int(pr_ind.asscalar())] = 0
        pruned_weights = restore_weight_structure(prediction_vector)
        test_accuracy = evaluate_accuracy(test_data, net, pruned_weights)
        print("%s --> %s" % (percentage, test_accuracy))

Putting the above function together:

[ ]:
sign_to_noise = signal_to_noise_ratio(raw_mus, sigmas)
sign_to_noise_vec = transform_vector_structure(sign_to_noise)

mus_copy = raw_mus.copy()
mus_copy_vec = transform_vector_structure(mus_copy)

prune_weights(sign_to_noise_vec, mus_copy_vec, [0.1, 0.25, 0.5, 0.75, 0.95, 0.98, 1.0])

Depending on the number of units used in the original network, the highest achievable pruning percentages (without significantly reducing the predictive performance) can vary. The paper, for example, reports almost no change in the test accuracy when pruning 95% of the weights in a 1200 unit Bayesian neural network, which creates a significantly sparser network, leading to faster predictions and reduced memory requirements.


We have taken a look at an efficient Bayesian treatment for neural networks using variational inference via the “Bayes by Backprop” algorithm (introduced by the “Weight Uncertainity in Neural Networks” paper). We have implemented a stochastic version of the variational lower bound and optimized it in order to find an approximation to the posterior distribution over the weights of a MLP network on the MNIST data set. As a result, we achieve regularization on the network’s parameters and can quantify our uncertainty about the weights accurately. Finally, we saw that it is possible to significantly reduce the number of weights in the neural network after training while still keeping a high accuracy on the test set.

We also note that, given this model implementation, we were able to reproduce the paper’s results on the MNIST data set, achieving a comparable test accuracy for all documented instances of the MNIST classification problem.

For whinges or inquiries, open an issue on GitHub.