# Binary classification with logistic regression¶

Over the last two tutorials we worked through how to implement a linear regression model, both *from scratch* and using Gluon to automate most of the repetitive work like allocating and initializing parameters, defining loss functions, and implementing optimizers.

Regression is the hammer we reach for when we want to answer *how much?*
or *how many?* questions. If you want to predict the number of dollars
(the *price*) at which a house will be sold, or the number of wins a
baseball team might have, or the number of days that a patient will
remain hospitalized before being discharged, then you’re probably
looking for a regression model.

Based on our experience, in industry, we’re more often interested in
making categorical assignments. *Does this email belong in the spam
folder or the inbox*? *How likely is this custromer to sign up for
subscription service?* When we’re interested in either assigning
datapoints to categories or assessing the *probability* that a category
applies, we call this task *classification*.

The simplest kind of classification problem is *binary classification*,
when there are only two categories, so let’s start there. Let’s call our
two categories the positive class \(y_i=1\) and the negative class
\(y_i = 0\). Even with just two categories, and even confining
ourselves to linear models, there are many ways we might approach the
problem. For example, we might try to draw a line that best separates
the points.

A whole family of algorithms called support vector machines pursue this approach. The main idea here is choose a line that maximizes the margin to the closest data points on either side of the decision boundary. In these approaches, only the points closest to the decision boundary (the support vectors) actually influence the choice of the linear separator.

With neural networks, we usually approach the problem differently. Instead of just trying to separate the points, we train a probabilistic classifier which estimates, for each data point, the conditional probability that it belongs to the positive class.

Recall that in linear regression, we made predictions of the form

We are interested in asking the question *“what is the probability that
example :math:`x` belongs to the positive class?”* A regular linear
model is a poor choice here because it can output values greater than
\(1\) or less than \(0\). To coerce reasonable answers from our
model, we’re going to modify it slightly, by running the linear function
through a sigmoid activation function \(\sigma\):

The sigmoid function \(\sigma\), sometimes called a squashing
function or a *logistic* function - thus the name logistic regression -
maps a real-valued input to the range 0 to 1. Specifically, it has the
functional form:

Let’s get our imports out of the way and visualize the logistic function
using `mxnet`

and `matplotlib`

.

```
In [ ]:
```

```
import mxnet as mx
from mxnet import nd, autograd, gluon
import matplotlib.pyplot as plt
def logistic(z):
return 1. / (1. + nd.exp(-z))
x = nd.arange(-5, 5, .1)
y = logistic(x)
plt.plot(x.asnumpy(),y.asnumpy())
plt.show()
```

Because the sigmoid outputs a value between \(0\) and \(1\), it’s more reasonable to think of it as a probability. Note that an input of \(0\) gives a value of \(.5\). So in the common case, where we want to predict positive whenever the probability is greater than \(.5\) and negative whenever the probability is less than \(.5\), we can just look at the sign of \(\boldsymbol{w}^T \boldsymbol{x} + b\).

## Binary cross-entropy loss¶

Now that we’ve got a model that outputs probabilities, we need to choose
a loss function. When we wanted to predict *how much* we used squared
error \(y-\hat{y}^2\), as our measure our model’s performance.

Since now we’re thinking about outputing probabilities, one natural objective is to say that we should choose the weights that give the actual labels in the training data the highest probability.

Because each example is independent of the others, and each label depends only on the features of the corresponding examples, we can rewrite the above as

This function is a product over the examples, but in general, because we want to train by stochastic gradient descent, it’s a lot easier to work with a loss function that breaks down as a sum over the training examples.

Because we typically express our objective as a *loss* we can just flip
the sign, giving us the *negative log probability:*

If we interpret \(\hat{y_i}\) as the probability that the \(i\)-th example belongs to the positive class (i.e \(y_i=1\)), then \(1 - \hat{y_i}\) is the probability that the \(i\)-th example belongs to the negative class (i.e \(y_i=0\)). This is equivalent to saying

which can be written in a more compact form

Thus we can express our learning objective as:

If you’re learning machine learning for the first time, that might have
been too much information too quickly. Let’s take a look at this loss
function and break down what’s going on more slowly. The loss function
consists of two terms, \(y_i \log \hat{y}_i\) and
\((1-y_i) \log (1-\hat{y}_i)\). Because \(y_i\) only takes
values \(0\) or \(1\), for a given data point, one of these
terms disappears. When \(y_i\) is \(1\), this loss says that we
should maximize \(\log \hat{y}_i\), giving higher probability to the
*correct* answer. When \(y_i\) is \(0\), this loss function
takes value \(\log (1-\hat{y}_i)\). That says that we should
maximize the value \(1-\hat{y}\) which we already know is the
probability assigned to \(\boldsymbol{x}_i\) belonging to the
negative class.

Note that this loss function is commonly called *log loss* and is also
commonly referred to as *binary cross entropy*. It is a special case of
negative log likelihood. And it is a special case of cross-entropy,
which can apply to the multi-class (\(>2\)) setting.

While for linear regression, we demonstrated a completely different
implementation *from scratch* and *with ``gluon``*, here we’re going to
demonstrate how we can mix and match the two. We’ll use `gluon`

for
our modeling, but we’ll write our loss function from scratch.

## Data¶

As usual, we’ll want to work out these concepts using a real dataset.
This time around, we’ll use the *Adult* dataset taken from the UCI
repository. The dataset was
constructed by Barry Becker from 1994 census data. In its original form,
the dataset contained \(14\) features, including age, education,
occupation, sex, native-country, among others. In this version, hosted
by National Taiwan
University,
the data have been re-processed to \(123\) binary features each
representing quantiles among the original features. The label is a
binary indicator indicating whether the person corresponding to each row
made more (\(y_i = 1\)) or less (\(y_i = 0\)) than $50,000 of
income in 1994. The dataset we’re working with contains 30,956 training
examples and 1,605 examples set aside for testing. We can read the
datasets into main memory like so:

```
In [ ]:
```

```
data_ctx = mx.cpu()
# Change this to `mx.gpu(0) if you would like to train on an NVIDIA GPU
model_ctx = mx.cpu()
with open("../data/adult/a1a.train") as f:
train_raw = f.read()
with open("../data/adult/a1a.test") as f:
test_raw = f.read()
```

The data consists of lines like the following:

`-1 4:1 6:1 15:1 21:1 35:1 40:1 57:1 63:1 67:1 73:1 74:1 77:1 80:1 83:1 \n`

The first entry in each row is the value of the label. The following tokens are the indices of the non-zero features. The number \(1\) here is redundant. But we don’t always have control over where our data comes from, so we might as well get used to mucking around with weird file formats. Let’s write a simple script to process our dataset.

```
In [ ]:
```

```
def process_data(raw_data):
train_lines = raw_data.splitlines()
num_examples = len(train_lines)
num_features = 123
X = nd.zeros((num_examples, num_features), ctx=data_ctx)
Y = nd.zeros((num_examples, 1), ctx=data_ctx)
for i, line in enumerate(train_lines):
tokens = line.split()
label = (int(tokens[0]) + 1) / 2 # Change label from {-1,1} to {0,1}
Y[i] = label
for token in tokens[1:]:
index = int(token[:-2]) - 1
X[i, index] = 1
return X, Y
```

```
In [ ]:
```

```
Xtrain, Ytrain = process_data(train_raw)
Xtest, Ytest = process_data(test_raw)
```

We can now verify that our data arrays have the right shapes.

```
In [ ]:
```

```
print(Xtrain.shape)
print(Ytrain.shape)
print(Xtest.shape)
print(Ytest.shape)
```

We can also check the fraction of positive examples in our training and test sets. This will give us one nice (necessay but insufficient) sanity check that our training and test data really are drawn from the same distribution.

```
In [ ]:
```

```
print(nd.sum(Ytrain)/len(Ytrain))
print(nd.sum(Ytest)/len(Ytest))
```

## Instantiate a dataloader¶

```
In [ ]:
```

```
batch_size = 64
train_data = gluon.data.DataLoader(gluon.data.ArrayDataset(Xtrain, Ytrain),
batch_size=batch_size, shuffle=True)
test_data = gluon.data.DataLoader(gluon.data.ArrayDataset(Xtest, Ytest),
batch_size=batch_size, shuffle=True)
```

## Define the model¶

```
In [ ]:
```

```
net = gluon.nn.Dense(1)
net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=model_ctx)
```

## Instantiate an optimizer¶

```
In [ ]:
```

```
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01})
```

## Define log loss¶

```
In [ ]:
```

```
def log_loss(output, y):
yhat = logistic(output)
return - nd.nansum( y * nd.log(yhat) + (1-y) * nd.log(1-yhat))
```

```
In [ ]:
```

```
epochs = 30
loss_sequence = []
num_examples = len(Xtrain)
for e in range(epochs):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx)
label = label.as_in_context(model_ctx)
with autograd.record():
output = net(data)
loss = log_loss(output, label)
loss.backward()
trainer.step(batch_size)
cumulative_loss += nd.sum(loss).asscalar()
print("Epoch %s, loss: %s" % (e, cumulative_loss ))
loss_sequence.append(cumulative_loss)
```

## Visualize the learning curve¶

```
In [ ]:
```

```
# plot the convergence of the estimated loss function
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.figure(num=None,figsize=(8, 6))
plt.plot(loss_sequence)
# Adding some bells and whistles to the plot
plt.grid(True, which="both")
plt.xlabel('epoch',fontsize=14)
plt.ylabel('average loss',fontsize=14)
```

## Calculating accuracy¶

While the negative log likelihood gives us a sense of how well the
predicted probabilities agree with the observed labels, it’s not the
only way to assess the performance of our classifiers. For example, at
the end of the day, we’ll often want to apply a threshold to the
predicted probabilities in order to make hard predictions. For example,
if we were building a spam filter, we’ll need to either send the email
to the spam folder or to the inbox. In these cases, we might not care
about negative log likelihood, but instead we want know *how many
errors* our classifier makes. Let’s code up a simple script that
calculates the accuracy of our classifier.

```
In [ ]:
```

```
num_correct = 0.0
num_total = len(Xtest)
for i, (data, label) in enumerate(test_data):
data = data.as_in_context(model_ctx)
label = label.as_in_context(model_ctx)
output = net(data)
prediction = (nd.sign(output) + 1) / 2
num_correct += nd.sum(prediction == label)
print("Accuracy: %0.3f (%s/%s)" % (num_correct.asscalar()/num_total, num_correct.asscalar(), num_total))
```

This isn’t too bad! A naive classifier would predict that nobody had an income greater than $50k (the majority class). This classifier would achieve an accuracy of roughly 75%. By contrast, our classifier gets an accuracy of .84 (results may vary a small amount on each run owing to random initializations and random sampling of the batches).

By now you should have some feeling for the two most fundamental tasks in supervised learning: regression and classification. In the following chapters we’ll go deeper into these problems, exploring more complex models, loss functions, optimizers, and training schemes. We’ll also look at more interesting datasets. And finally, in the following chapters we’ll also look more advanced problems where we want, for example, to predict more structured objects.