Binary classification with logistic regression

Over the last two tutorials we worked through how to implement a linear regression model, both *from scratch* and using Gluon to automate most of the repetitive work like allocating and initializing parameters, defining loss functions, and implementing optimizers.

Regression is the hammer we reach for when we want to answer how much? or how many? questions. If you want to predict the number of dollars (the price) at which a house will be sold, or the number of wins a baseball team might have, or the number of days that a patient will remain hospitalized before being discharged, then you’re probably looking for a regression model.

Based on our experience, in industry, we’re more often interested in making categorical assignments. Does this email belong in the spam folder or the inbox? How likely is this custromer to sign up for subscription service? When we’re interested in either assigning datapoints to categories or assessing the probability that a category applies, we call this task classification.

The simplest kind of classification problem is binary classification, when there are only two categories, so let’s start there. Let’s call our two categories the positive class \(y_i=1\) and the negative class \(y_i = 0\). Even with just two categories, and even confining ourselves to linear models, there are many ways we might approach the problem. For example, we might try to draw a line that best separates the points.

A whole family of algorithms called support vector machines pursue this approach. The main idea here is choose a line that maximizes the marigin to the closest data points on either side of the decision boundary. In these appraoches, only the points closest to the decision boundary (the support vectors) actually influence the choice of the linear separator.

With neural networks, we usually approach the problem differently. Instead of just trying to separate the points, we train a probabilistic classifiers which estimates, for each data point, the conditional probability that it belongs to the positive class.

Recall that in linear regression, we made predictions of the form

\[\hat{y} = \boldsymbol{w}^T \boldsymbol{x} + b.\]

We are interested in asking the question “what is the probability that example :math:`x` belongs to the positive class?” A regular linear model is a poor choice here because it can output values greater than \(1\) or less than \(0\). To coerce reasonable answers from our model, we’re going to modify it slightly, by running the linear function through a sigmoid activation function \(\sigma\):

\[\hat{y} =\sigma(\boldsymbol{w}^T \boldsymbol{x} + b).\]

The sigmoid function \(\sigma\), sometimes called a squashing function or a logistic function - thus the name logistic regression - maps a real-valued input to the range 0 to 1. Specifically, it has the functional form:

\[\sigma(z) = \frac{1}{1 + e^{-z}}\]

Let’s get our imports out of the way and visualize the logistic function using mxnet and matplotlib.

In [ ]:
import mxnet as mx
from mxnet import nd, autograd, gluon
import matplotlib.pyplot as plt

def logistic(z):
    return 1. / (1. + nd.exp(-z))

x = nd.arange(-5, 5, .1)
y = logistic(x)

plt.plot(x.asnumpy(),y.asnumpy())
plt.show()

Because the sigmoid outputs a value between \(0\) and \(1\), it’s more reasonable to think of it as a probability. Note that and input of \(0\) gives a value of \(.5\). So in the common case, where we want to predict positive whenever the probability is greater than \(.5\) and negative whenever the probability is less than \(.5\), we can just look at the sign of \(\boldsymbol{w}^T \boldsymbol{x} + b\).

Binary cross-entropy loss

Now that we’ve got a model that outputs probabilities, we need to choose a loss function. When we wanted to predict how much we used squared error \(y-\hat{y}^2\), as our measure our model’s performance.

Since now we’re thinking about outputing probabilities, one natural objective is to say that we should choose the weights that give the actual labels in the training data the highest probability.

\[\max_{\theta} P_{\theta}( (y_1, ..., y_n) | \boldsymbol{x}_1,...,\boldsymbol{x}_n )\]

Because each example is independent of the others, and each label depends only on the features of the corresponding examples, we can rewirte the above as

\[\max_{\theta} P_{\theta}(y_1|\boldsymbol{x}_1)P_{\theta}(y_2|\boldsymbol{x}_2) ... P(y_n|\boldsymbol{x}_n)\]

This function is a product over the examples, but in general, because we want to train by stochastic gradient descent, it’s a lot easier to work with a loss function that breaks down as a sum over the training examples.

\[\max_{\theta} \log P_{\theta}(y_1|\boldsymbol{x}_1) + ... + \log P(y_n|\boldsymbol{x}_n)\]

Because we typically express our objective as a loss we can just flip the sign, giving us the negative log probability:

\[\min_{\theta} \left(- \sum_{i=1}^n \log P_{\theta}(y_i|\boldsymbol{x}_i)\right)\]

If we interpret \(\hat{y_i}\) as the probability that the \(i\)-th example belongs to the positive class (i.e \(y_i=1\)), then \(1 - \hat{y_i}\) is the probability that the \(i\)-th example belongs to the negative class (i.e \(y_i=0\)). This is equivalent to saying

\[\begin{split} P_{\theta}(y_i|\boldsymbol{x}_i) = \begin{cases} \hat{y}_i, & \text{if } y_i = 1\\ 1-\hat{y}_i, & \text{if } y_i = 0 \end{cases}\end{split}\]

which can be written in a more compact form

\[P_{\theta}(y_i|\boldsymbol{x}_i) = \hat{y}_i^{y_i}(1-\hat{y}_i)^{1-y_i}\]

Thus we can express our learning objective as:

\[\ell (\boldsymbol{y}, \boldsymbol{\hat{y}}) = - \sum_{i=1}^n y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i).\]

If you’re learning machine learning for the first time, that might have been too much information too quickly. Let’s take a look at this loss function and break down what’s going on more slowly. The loss function consists of two terms, \(y_i \log \hat{y}_i\) and \((1-y_i) \log (1-\hat{y}_i)\). Because \(y_i\) only takes values \(0\) or \(1\), for a given data point, one of these terms disappears. When \(y_i\) is \(1\), this loss says that we should maximize \(\log \hat{y}_i\), giving higher probability to the correct answer. When \(y_i\) is \(0\), this loss function takes value \(\log (1-\hat{y}_i)\). That says that we should maximize the value \(1-\hat{y}\) which we already know is the probability assigned to \(\boldsymbol{x}_i\) belonging to the negative class.

Note that this loss function is commonly called log loss and is also commonly referred to as binary cross entropy. It is a special case of negative log likelihood. And it is a special case of cross-entropy, which can apply to the multi-class (\(>2\)) setting.

While for linear regression, we demonstrated a completely different implementation from scratch and with ``gluon``, here we’re going to demonstrate how we can mix and match the two. We’ll use gluon for our modeling, but we’ll write our loss function from scratch.

Data

As usual, we’ll want to work out these concepts using a real dataset. This time around, we’ll use the Adult dataset taken from the UCI repository. The dataset was constructed by Barry Becker from 1994 census data. In its original form, the dataset contained \(14\) features, including age, education, occupation, sex, native-country, among others. In this version, hosted by National Taiwan University, the data have been re-processed to \(123\) binary features each representing quantiles among the original features. The label is a binary indicator indicating whether the person corresponding to each row made more (\(y_i = 1\)) or less (\(y_i = 0\)) than $50,000 of income in 1994. The dataset we’re working with contains 30,956 training examples and 1,605 examples set aside for testing. We can read the datasets into main memory like so:

In [ ]:
data_ctx = mx.cpu()
# Change this to `mx.gpu(0) if you would like to train on an NVIDIA GPU
model_ctx = mx.cpu()

with open("../data/adult/a1a.train") as f:
    train_raw = f.read()

with open("../data/adult/a1a.test") as f:
    test_raw = f.read()

The data consists of lines like the following:

-1 4:1 6:1 15:1 21:1 35:1 40:1 57:1 63:1 67:1 73:1 74:1 77:1 80:1 83:1 \n

The first entry in each row is the value of the label. The following tokens are the indices of the non-zero features. The number \(1\) here is redundant. But we don’t always have control over where our data comes from, so we might as well get used to mucking around with weird file formats. Let’s write a simple script to process our dataset.

In [ ]:
def process_data(raw_data):
    train_lines = raw_data.splitlines()
    num_examples = len(train_lines)
    num_features = 123
    X = nd.zeros((num_examples, num_features), ctx=data_ctx)
    Y = nd.zeros((num_examples, 1), ctx=data_ctx)
    for i, line in enumerate(train_lines):
        tokens = line.split()
        label = (int(tokens[0]) + 1) / 2  # Change label from {-1,1} to {0,1}
        Y[i] = label
        for token in tokens[1:]:
            index = int(token[:-2]) - 1
            X[i, index] = 1
    return X, Y
In [ ]:
Xtrain, Ytrain = process_data(train_raw)
Xtest, Ytest = process_data(test_raw)

We can now verify that our data arrays have the right shapes.

In [ ]:
print(Xtrain.shape)
print(Ytrain.shape)
print(Xtest.shape)
print(Ytest.shape)

We can also check the fraction of positive examples in our training and test sets. This will give us one nice (necessay but insufficient) sanity check that our training and test data really are drawn from the same distribution.

In [ ]:
print(nd.sum(Ytrain)/len(Ytrain))
print(nd.sum(Ytest)/len(Ytest))

Instantiate a dataloader

In [ ]:
batch_size = 64

train_data = gluon.data.DataLoader(gluon.data.ArrayDataset(Xtrain, Ytrain),
                                      batch_size=batch_size, shuffle=True)

test_data = gluon.data.DataLoader(gluon.data.ArrayDataset(Xtest, Ytest),
                                      batch_size=batch_size, shuffle=True)

Define the model

In [ ]:
net = gluon.nn.Dense(1)
net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=model_ctx)

Instantiate an optimizer

In [ ]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01})

Define log loss

In [ ]:
def log_loss(output, y):
    yhat = logistic(output)
    return  - nd.sum(  y * nd.log(yhat) + (1-y) * nd.log(1-yhat))
In [ ]:
epochs = 30
loss_sequence = []
num_examples = len(Xtrain)

for e in range(epochs):
    cumulative_loss = 0
    for i, (data, label) in enumerate(train_data):
        data = data.as_in_context(model_ctx)
        label = label.as_in_context(model_ctx)
        with autograd.record():
            output = net(data)
            loss = log_loss(output, label)
        loss.backward()
        trainer.step(batch_size)
        cumulative_loss += nd.sum(loss).asscalar()
    print("Epoch %s, loss: %s" % (e, cumulative_loss ))
    loss_sequence.append(cumulative_loss)

Visualize the learning curve

In [ ]:
# plot the convergence of the estimated loss function
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

plt.figure(num=None,figsize=(8, 6))
plt.plot(loss_sequence)

# Adding some bells and whistles to the plot
plt.grid(True, which="both")
plt.xlabel('epoch',fontsize=14)
plt.ylabel('average loss',fontsize=14)

Calculating accuracy

While the negative log likelihood gives us a sense of how well the predicted probabilities agree with the observed labels, it’s not the only way to assess the performance of our classifiers. For example, at the end of the day, we’ll often want to apply a threshold to the predicted probabilities in order to make hard predictions. For example, if we were building a spam filter, we’ll need to either send the email to the spam folder or to the inbox. In these cases, we might not care about negative log likelihood, but instead we want know how many errors our classifier makes. Let’s code up a simple script that calculates the accuracy of our classifier.

In [ ]:
num_correct = 0.0
num_total = len(Xtest)
for i, (data, label) in enumerate(test_data):
    data = data.as_in_context(model_ctx)
    label = label.as_in_context(model_ctx)
    output = net(data)
    prediction = (nd.sign(output) + 1) / 2
    num_correct += nd.sum(prediction == label)
print("Accuracy: %0.3f (%s/%s)" % (num_correct.asscalar()/num_total, num_correct.asscalar(), num_total))

This isn’t too bad! A naive classifier would predict that nobody had an income greater than $50k (the majority class). This classifier would achieve an accuracy of roughly 75%. By contrast, our classifier gets an accuracy of .84 (results may vary a small amount on each run owing to random initializations and random sampling of the batches).

By now you should have some feeling for the two most fundamental tasks in supervised learning: regression and classification. In the following chapters we’ll go deeper into these problems, exploring more complex models, loss functions, optimizers, and training schemes. We’ll also look at more interesting datasets. And finally, in the following chapters we’ll also look more advanced problems where we want, for example, to predict more structured objects.