Before we could begin writing, the authors of this book, like much of the work force, had to become caffeinated. We hopped in the car and started driving. Having an Android, Alex called out “Okay Google”, awakening the phone’s voice recognition system. Then Mu commanded “directions to Blue Bottle coffee shop”. The phone quickly displayed the transcription of his command. It also recognized that we were asking for directions and launched the Maps application to fulfill our request. Once launched, the Maps app identified a number of routes. Next to each route, the phone displayed a predicted transit time. While we fabricated this story for pedagogical convenience, it demonstrates that in the span of just a few seconds, our everyday interactions with a smartphone can engage several machine learning models.

If you’ve never worked with machine learning before, you might be wondering what the hell we’re talking about. You might ask, “Isn’t that just programming?” Or, “What does machine learning even mean?” First, to be clear, we implement all machine learning algorithms by writing computer programs. And we use many of the same languages and hardware as used in other fields of computer science. But not all computer programs involve machine learning. In response to the second question, precisely defining a field of study as vast as machine learning is hard. It’s a bit like answering, “what is math?”. But we’ll try to give you enough intuition to get started.

A motivating example

Most of the computer programs we interact with every day can be coded up from first principles. When you add an item to your shopping cart, you trigger an e-commerce application to store an entry in a shopping cart database table, associating your user ID with the product’s ID. We can write such a program from first principles, launch without ever having seen a real customer. And when it’s this easy to write an application you should not be using machine learning.

But fortunately (for the community of ML scientists), for many problems, solutions aren’t so easy. Returning to our fake story about going to get coffee, imagine just writing a program to respond to a wake word like “Alexa”, “Okay, Google” or “Siri”. Try coding it up in a room by yourself with nothing but a computer and a code editor. How would you write such a program from first principles? Think about it… the problem is hard. Every second, the microphone will collect roughly 44,000 samples. What rule could map reliably from a snippet of raw audio to confident predictions {yes, no} on whether the snippet contains the wake word? If you’re stuck, don’t worry. We don’t know how to write such a program from scratch either. That’s why we use machine learning.

Here’s the trick. Often, even when we don’t know how to tell a computer explicitly how to map from inputs to outputs, we ourselves are nonetheless capable of performing the cognitive feat ourselves. In other words, even if you don’t know how to program a computer to recognize the word “Alexa”, you yourself are able to recognize the word “Alexa”. Armed with this ability, we can collect a huge data set containing examples of audio and label those that do and that do not contain the wake word. In the machine learning approach, we do not design a system explicitly to recognize wake words right away. Instead, we define a flexible program with a number of parameters. These are knobs that we can tune to change the behavior of the program. We call this program a model. Generally, our model is just a machine that transforms its input into some output. In this case, the model receives as input a snippet of audio, and it generates as output an answer {yes, no}, which we hope reflects whether (or not) the snippet contains the wake word.

If we choose the right kind of model, then there should exist one setting of the knobs such that the model fires yes every time it hears the word “Alexa”. There should also be another setting of the knobs that might fire yes on the word “Apricot”. We expect that the same model should apply to “Alexa” recognition and “Apricot” recognition because these are similar tasks. However, we might need a different model to deal with fundamentally different inputs or outputs. For example, we might choose a different sort of machine to map from images to captions, or from English sentences to Chinese sentences.

As you might guess, if we just set the knobs randomly, the model will probably recognize neither “Alexa”, “Apricot”, nor any other word in the English language. In most deep learning, the learning refers precisely to updating the model’s behavior (by twisting the knobs) over the course of a training period.

The training process usually looks like this:

  1. Start off with a randomly initialized model that can’t do anything useful
  2. Grab some of your labeled data (e.g. audio snippets and corresponding {yes,no} labels)
  3. Tweak the knobs so the model sucks less with respect to those examples
  4. Repeat until the model is dope.

To summarize, rather than code up a wake word recognizer, we code up a program that, when presented with a large labeled dataset, can learn to recognize wake words. You can think of this act, of determining a program’s behavior by presenting it with a dataset, as programming with data.

The dizzying versatility of machine learning

This is the core idea behind machine learning: Rather than code programs with fixed behavior, we design programs with the ability to improve as they acquire more experience. This basic idea can take many forms. Machine learning can address many different application domains, involve many different types of models, and update them according to many different learning algorithms. In this particular case, we described an instance of supervised learning applied to a problem in automated speech recognition.

Machine Learning is a versatile set of tools that lets you work with data in many different situations where simple rule-based systems would fail or might be very difficult to build. Due to its versatility, machine learning can be quite confusing to newcomers. For example, machine learning techniques are already widely used in applications as diverse as search engines, self driving cars, machine translation, medical diagnosis, spam filtering, game playing (chess, go), face recognition, data matching, calculating insurance premiums, and adding filters to photos.

Despite the superficial differences between these problems many of them share a common structure and are addressable with deep learning tools. They’re mostly similar because they are problems where we wouldn’t be able to program their behavior directly in code, but we can program them with data. Often times the most direct language for communicating these kinds of programs is math. In this book, we’ll introduce a minimal amount of mathematical notation, but unlike other books on machine learning and neural networks, we’ll always keep the conversation grounded in real examples and real code.

To make this conversation more concrete, let’s consider a few examples and start writing some code.

Basics of machine learning

When we considered the task of recognizing wake-words, we put together a dataset consisting of snippets and labels. We then described (albeit abstractly) how you might train a machine learning model to predict the label given a snippet. This set-up, predicting labels from examples, is just one flavor of ML and it’s called supervised learning. Even within deep learning, there are many other approaches, and we’ll discuss each in subsequent sections. To get going with at machine learning, we need four things: data, a model of how to transform the data, a loss function to measure how well we’re doing, and an algorithm to tweak the model parameters such that the loss function is minimized.


Generally, the more data we have, the easier our job as modelers. When we have more data, we can train more powerful models. Data is at the heart of the resurgence of deep learning and many of most exciting models in deep learning don’t work without large data sets. Here are some examples of the kinds of data machine learning practitioners often engage with:

  • Images: Pictures taken by smartphones or harvested from the web, satellite images, photographs of medical conditions, ultrasounds, and radiologic images like CT scans and MRIs, etc.
  • Text: Emails, high school essays, tweets, news articles, doctor’s notes, books, and corpora of translated sentences, etc.
  • Audio: Voice commands sent to smart devices like Amazon Echo, or iPhone or Android phones, audio books, phone calls, music recordings, etc.
  • Video: Television programs and movies, YouTube videos, cell phone footage, home surveillance, multi-camera tracking, etc.
  • Structured data: This Jupyter notebook (it contains text, images, code), webpages, electronic medical records, car rental records, electricity bills, etc.


Usually, the data looks quite different from what we want to accomplish with it. For example, we might have photos of people and want to know whether they appear to be happy. So we might desire a model capable of ingesting a high-resolution image and outputting a happiness score. While some simple problems might be addressable with simple models, We’re asking a lot in this case. To do its job, our happiness detector needs to transform hundreds of thousands of low-level features (pixel values) into something quite abstract on the other end (happiness scores). Choosing the right model is hard, and different models are better suited to different datasets. In this book, we’ll be focusing mostly on deep neural networks. These models consist of many successive transformations of the data that are chained together top to bottom, thus the name deep learning. On our way to discussing deep nets, we’ll also discuss some simpler, shallower models.

  • Loss function. To assess how well we’re doing we need to compare the output from the model with the truth. Loss functions allow us to determine whether a stock prediction of $1,500 for AMZN by December 31, 2017 is correct. Depending on whether we decided to go short or long on it, we would incur different losses (or realize profits), hence our loss functions might be quite different.
  • Training. Usually, models have many parameters. These are the ones that we need to ‘learn’, by minimizing the loss incurred on training data. Unfortunately, doing well on the latter doesn’t guarantee that we will do well on (unseen) test data, as the analogy below illustrates.
    • Training Error - This is the error on the dataset used to find \(f\) by minimizing the loss on the training set. This is equivalent to doing well on all the practice exams that a student might use to prepare for the real exam. Encouraging but by no means a guarantee.
    • Test Error - This is the error incurred on an unseen test set. This can be off by quite a bit (statisticians call this overfitting). In real-life terms, this is the equivalent of screwing up the real exam despite doing well on the practice exams.

In the following sections, we will discuss a few types of machine learning in some more detail. This helps to understand what exactly one aims to do. We begin with a list of objectives, i.e. a list of things that machine learning can do. Note that the objectives are complemented with a set of techniques of how to accomplish them, i.e. training, types of data, etc. The list below is really only sufficient to whet the readers’ appetite and to give us a common language when we talk about problems. We will introduce a larger number of such problems as we go along.

Supervised learning

Supervised learning describes the task of predicting targets \(y\) given inputs \(x\) by training on labeled examples. In probabilistic terms, supervised learning is concerned with estimating the conditional probability \(P(y|x)\). And while it’s just one among several approaches to machine learning, supervised learning accounts for the majority of machine learning in practice. Partly, that’s because many important or valuable tasks can be described crisply as supervised learning. Predict cancer vs not cancer, given a CT image. Predict the correct translation in French, given a sentence in English. Predict the price of a stock next month based on this month’s financial reporting data.

Even with the simple description “predict targets from inputs” supervised learning can take a great many forms and require a great many modeling decisions, depending on the type, size, and the number of inputs and outputs. For example, we use different models to process sequences (like strings of text or time series data) and for processing fixed-length vector representations. We’ll visit many of these problems in depth throughout the first 9 parts of this book.


Perhaps the simplest supervised learning task to wrap your head around is Regression. Here, we have some input data consisting of numerical values, like the square footage of a house, the number of bedrooms and the number of bathrooms, and the number of minutes (walking) to the center of town. Formally we call such a collection of features a vector. If you live in New York or San Francisco and you are not the CEO of Amazon, Google, Microsoft, or Facebook, the (sqft, no. of bedrooms, no. bathrooms, walking distance) feature vector for your home might look something like: \([100, 0, .5, 60]\). However, if you live in Pittsburgh, it might look more like \([3000, 4, 3, 10]\). Feature vectors like this are essential for all the classic machine learning problems. We’ll typically denote all the features for any one example (like a house) \(\mathbf{x_i}\) and the set of feature vectors for all our examples \(X\).

What makes a problem regression is actually the outputs. Say that you’re in the market for a new home, you might want to estimate the fair market value of a house, given some features like these. The target value, the price of sale, is a real number. We denote any individual target \(y_i\) (corresponding to example \(\mathbf{x_i}\)) and the set of all targets \(\mathbf{y}\) (corresponding to all examples X). When our targets take on arbitrary real values in some range, we call this a regression problem. The goal of our model is to produce predictions (guesses of the price, in our example) that closely approximate the actual target values.
We denote these predictions \(\hat{y}_i\) and if the notation seems whacky, then just ignore it for now. We’ll unpack it more thoroughly in the subsequent chapters.

Lots of practical problems are well-described regression problems. Predicting the rating that a user will assign to a movie is a regression problem. And if you designed a great algorithm to accomplish this feat in 2009, you might have won the $1 million Netflix prize. Predicting the length of stay for patients in the hospital is also a regression problem. A good rule of thumb is that any How much? or How many? problem should suggest regression. * “How many hours will this surgery take?”… regression * “How many dogs are in this photo?” … regression. However, if you can easily pose your problem as “Is this a ___?”, then it’s likely, classification, a different fundamental problem type that we’ll cover next.

Even if you’ve never worked with machine learning before, you’ve probably worked through a regression problem informally. Imagine, for example, that you had your drains repaired and the contractor spent \(x_1=3\) hours removing gunk from your sewage pipes, and then sent you a bill of $y_1 = 350$ $. Now imagine that your friend hired the same contractor for \(x_2 = 2\) hours and that he received a bill of $y_2 = 250$ $. If someone then asked you how much to expect on their upcoming gunk-removal invoice you might make some reasonable assumptions (more hours \(\rightarrow\) more dollars) You might also assume that there’s some base charge and that the contractor then charges per hour. If these assumptions held, then given these two data points, you could already identify the contractor’s pricing structure: $100 per hour plus $50 to show up at your house. If you followed that much then you already understand the high-level idea behind linear regression.

In this case, we could produce the parameters that exactly matched the contractor’s prices. Sometimes that’s not possible, e.g., if some of the variance owes to some factors besides your two features. In these cases, we’ll try to learn models that minimize the distance between our predictions and the observed values. In most of our chapters, we’ll focus on one of two very common losses, the L1 loss where \(l(y,y') = \sum_i |y_i-y_i'|\) and the L2 loss where \(l(y,y') = \sum_i (y_i - y_i')^2\). As we will see later, the \(L_2\) loss corresponds to the assumption that our data was corrupted by Gaussian noise, whereas the \(L_1\) loss corresponds to an assumption of noise from a Laplace distribution.


While regression models are great for addressing how many? questions, lots of problems don’t bend comfortably to this template. For example, say we wanted to build an automated system for optical character recognition (OCR). In other words, given an image of a hand-written character, we’d like to say which letter or number or punctuation mark is depicted. This kind of problem is called a classification and it’s treated with a distinct set of algorithms. In classification, we want to look at a feature vector and then say which among a set of categories (formally called classes) an example belongs to.

More formally, given example data \(X\), such as images, text, sound, video, medical diagnostics, performance of a car, motion sensor data, etc., we want to answer the question as to which class \(y \in Y\) the data belongs to. The simplest form of classification is when there are only two classes, a problem which we call binary classification. For example, our dataset \(X\) could consist of images of animals and our labels \(Y\) might be the classes \(\mathrm{\{cat, dog\}}\). While in regression, we sought a regressor to output a real value \(\hat{y}\), in classification, we seek a classifier, whose output \(\hat{y}\) is the predicted class assignment.

For reasons that we’ll get into as the book gets more technical, it’s pretty hard to optimize a model that can only output a hard categorical assignment, e.g. either cat or dog. It’s a lot easier instead to express the model in the language of probabilities. Given an example \(x\), the model assigns a probability \(\hat{y}_k\) to each label \(k\). Because these are probabilities, they need to be positive numbers and add up to \(1\). This means that we only need \(K-1\) numbers to give the probabilities of \(K\) categories. This is easy to see for binary classification. If there’s a .6 probability that an unfair coin comes up heads, then there’s a .4 probability that it comes up tails. Returning to our animal classification example, a classifier might see an image and output the probability that the image is a cat \(\Pr(y=\mathrm{cat}\mid x) = 0.9\). We can interpret this number by saying that the classifier is 90% sure that the image depicts a cat. The magnitude of the probability for the predicted class is one notion of confidence. It’s not the only notion of confidence and we’ll discuss different notions of uncertainty in more advanced chapters.

When we have more than two possible classes, we call the problem multiclass classification. Common examples include hand-written character recognition [0, 1, 2, 3 ... 9, a, b, c, ...]. While we attacked regression problems by trying to minimize the L1 or L2 loss functions, the common loss function for classification problems is called cross-entropy. In MXNet Gluon, the corresponding loss function can be found here.

Note that the most likely class is not necessarily the one that you’re going to use for your decision. Assume that you find this beautiful mushroom in your backyard:

Death cap - do not eat!

Now, assume that you built a classifier and trained it to predict if a mushroom is poisonous based on a photograph. Say our poison-detection classifier outputs \(\Pr(y=\mathrm{death cap}\mid\mathrm{image}) = 0.2\). In other words, the classifier is rather confident that our mushroom is not a death cap. Still, you’d have to be a fool to eat it. That’s because the certain benefit of a delicious dinner isn’t worth a 20% chance of dying from it. In other words, the effect of the uncertain risk by far outweighs the benefit. Let’s look at this in math. Basically, we need to compute the expected risk that we incur, i.e. we need to multiply the probability of the outcome with the benefit (or harm) associated with it:

\[L(\mathrm{action}\mid x) = \mathbf{E}_{y \sim p(y\mid x)}[\mathrm{loss}(\mathrm{action},y)]\]

Hence, the loss \(L\) incurred by eating the mushroom is \(L(a=\mathrm{eat}\mid x) = 0.2 * \infty + 0.8 * 0 = \infty\), whereas the cost of discarding it is \(L(a=\mathrm{discard}\mid x) = 0.2 * 0 + 0.8 * 1 = 0.8\). We got lucky - as any botanist would tell us, the above actually is a death cap.

Classification can get much more complicated than just binary or even multiclass classification. For instance, there are some variants of classification for addressing hierarchies. Hierarchies assume that there exist some relationships among the many classes. So not all errors are equal - we prefer to misclassify to a related class than to a distant class. Usually, this is referred to as hierarchical classification. One early example of a hierarchy is due to Linnaeus, who applied the idea to animals.

image1 So in this case, it might not be so bad to mistake a poodle for a schnauzer but our model would pay a huge penalty if it confused a poodle for a dinosaur. What hierarchy is relevant might depend on how you plan to use the model. For example, rattle snakes and garter snakes might be close on the phylogenetic tree, but mistaking a rattler for a garter could be deadly.


Some classification problems don’t fit neatly into the binary or multiclass classification setups. For example, we could train a normal binary classifier to distinguish cats from dogs. And given the current state of computer vision, we can do this easily, with off-the-shelf tools. But no matter how accurate our model gets, we might find ourselves in trouble when the classifier encounters an image like this:

As you can see, there’s a cat in the picture. And a dog. And a tire, some grass, a door, concrete, rust, individual grass leaves, etc. Depending on what we want to do with our model ultimately, treating this as a binary classification problem might not make a lot of sense. Instead, we might want to give the model the option of saying the image depicts a cat and a dog. Or neither a cat nor a dog.

The problem of learning to predict classes that are not mutually exclusive is called multi-label classification. Auto-tagging problems are typically best described as multi-label classification problems. Think of the tags people might apply to posts on a tech blog, e.g. “machine learning”, “technology”, “gadgets”, “programming languages”, “linux”, “cloud computing”, “AWS”. A typical article might have 5-10 tags applied because these concepts are correlated. Posts about “cloud computing” are likely to mention “AWS” and posts about “machine learning” could also deal with “programming languages”.

This problem emerges in the biomedical literature where correctly tagging articles is important because it allows researchers to do exhaustive reviews of the literature. At the National Library of Medicine, a number of professional annotators go over each article that gets indexed in PubMed to associate each with the relevant terms from MeSH, a collection of roughly 28k tags. This is a time-consuming process and the annotators typically have a one year lag between archiving and tagging. Machine learning can be used here to provide provisional tags until each article can have a proper manual review. Indeed, for several years, the BioASQ organization has hosted a competition to do precisely this.

Search and ranking

Sometimes we don’t just want to assign each example to a bucket or to a real value. In the field of information retrieval, we want to impose a ranking on a set of items. Take web search for example, the goal is less to determine whether a particular page is relevant for a query, but rather, which one of the plethora of search results should be displayed for the user. We really care about the ordering of the relevant search results and our learning algorithm needs to produce ordered subsets of elements from a larger set. In other words, if we are asked to produce the first 5 letters from the alphabet, there is a difference between returning A B C D E and C A B E D. Even if the result set is the same, the ordering within the set matters nonetheless.

A possible solution to this problem is to score every element in the set of possible sets with a relevance score and then retrieve the top-rated elements. PageRank is an early example of such a relevance score. One of the peculiarities is that it didn’t depend on the actual query. Instead, it simply helped to order the results that contained the query terms. Nowadays search engines use machine learning and behavioral models to obtain query-dependent relevance scores. There are entire conferences devoted to this subject.

Recommender systems

Recommender systems are another problem setting that is related to search and ranking. The problems are similar insofar as the goal is to display a set of relevant items to the user. The main difference is the emphasis on personalization to specific users in the context of recommender systems. For instance, for movie recommendations, the results page for a SciFi fan and the results page for a connoisseur of Woody Allen comedies might differ significantly.

Such problems occur, e.g. for movie, product or music recommendation. In some cases, customers will provide explicit details about how much they liked the product (e.g. Amazon product reviews). In some other cases, they might simply provide feedback if they are dissatisfied with the result (skipping titles on a playlist). Generally, such systems strive to estimate some score \(y_{ij}\) as a function of user \(u_i\) and object \(o_j\). The objects \(o_j\) with the largest scores \(y_{ij}\) are then used as a recommendation. Production systems are considerably more advanced and take detailed user activity and item characteristics into account when computing such scores. Below is an example of the books recommended for deep learning, based on the author’s preferences.

Sequence Learning

So far we’ve looked at problems where we have some fixed number of inputs and produce a fixed number of outputs. Take some features of a home (square footage, number of bedrooms, number of bathrooms, walking time to downtown), and predict its value. Take an image (of fixed dimension) and produce a vector of probabilities (for a fixed number of classes). Take a user ID and a product ID and predict a star rating. And once we feed our fixed-length input into the model to generate an output, the model immediately forgets what it just saw.

This might be fine if our inputs really all look the same – or if successive inputs have nothing to do with each other. But what if we were dealing with video? Then our guess of what’s going on in each frame could be much stronger if we take into account the previous frames in the image. Or what if we wanted a model that could ingest sentences in some source language and predict their translation in another language?

And what if we wanted a model to monitor patients in the intensive care unit and to fire off an alert if their risk of death in the next 24 hours exceeded some modest threshold. We definitely wouldn’t want this model to throw away everything it knows about the patient history each hour and just make its predictions based on the most recent measurements.

Some of the more exciting applications of machine learning are sequence learning problems. They require a model either ingest sequences of inputs or to emit sequences of outputs (or both!). These latter problems are sometimes referred to as seq2seq problems. Language translation is a seq2seq problem. Transcribing text from spoken speech is also a seq2seq problem. While it is impossible to consider all types of sequence transformations, a number of special cases are worth mentioning:

Tagging and Parsing

This involves annotating a text sequence with attributes. In other words, the number of inputs and outputs is essentially the same. For instance, we might want to know where the verbs and subjects are, we might want to know which words are the named entities. In general, the goal is to decompose and annotate text \(x\) based on structural and grammatical assumptions to get some annotation \(y\). This sounds more complex than it actually is. Below is a very simple example of annotating a sentence with tags regarding which word refers to a named entity.

Tom wants to have dinner in Washington with Sally.
E   -     -  -    -      -  E          -    E

Automatic Speech Recognition

Here the input sequence \(x\) is the sound of a speaker, and the output \(y\) is the textual transcript of what the speaker said. The challenge is that there are many more audio frames (sound is typically sampled at 8kHz or 16kHz), i.e. there is no 1:1 correspondence between audio and text. In other words, this is a seq2seq problem where the output is much shorter than the input.

----D----e--e--e-----p----------- L----ea-------r---------ni-----ng-----
Deep Learning

Text to Speech

TTS is the inverse of Speech Recognition. That is, the input \(x\) is text and the output \(y\) is an audio file. There, the output is much longer than the input. While it is easy for humans to recognize a bad audio file, this isn’t quite so trivial for computers. The challenge is that the audio output is way longer than the input sequence.

Machine Translation

Unlike in the previous cases where the order of the inputs was preserved, in machine translation, order inversion can be vital. In other words, while we are still converting one sequence into another, neither the number of inputs and outputs or the order of corresponding data points are assumed to be the same. Consider the following example which illustrates the obnoxious tendency of Germans (Alex writing here) to place the verbs at the end of sentences.

German Haben Sie sich schon dieses grossartige Lehrwerk angeschaut?
English Did you already check out this excellent tutorial?
Wrong alignmen t Did you yourself already this excellent tutorial looked-at?

A number of related problems exist. For instance, determining the order in which a user reads a webpage is a two-dimensional layout analysis problem. Likewise, for dialogue problems, we need to take world-knowledge and prior state into account. This is an active area of research.

Unsupervised learning

All the examples so far are related to Supervised Learning, i.e. situations where we feed the model a bunch of examples and a bunch of corresponding target values. You could think of supervised learning as having an extremely specialized job and an extremely anal boss. The boss stands over your shoulder and tells you exactly what to do in every situation until you learn to map from situations to actions. Working for such a boss sounds pretty lame. But on the other hand, it’s easy to please this boss. You just recognize the pattern as quickly as possible and imitate their actions.

In a completely opposite way, it could be frustrating to work for a boss who has no idea what they want you to do. But if you’re a data scientist, you better get used to it. The boss might just hand you a giant dump of data and tell you to do some data science with it! This sounds vague because it is. We call this class of problems unsupervised learning, and the type and number of questions we could ask is limited only by our creativity. We will address a number of unsupervised learning techniques in later chapters. To whet your appetite for now, we describe a few of the questions you might ask:

  • Can we find a small number of prototypes that accurately summarize the data? E.g. given a set of photos, can we group them into landscape photos, pictures of dogs, babies, cats, mountain peaks, etc.? Likewise, given a collection of users (with their behavior), can we group them into users with similar behavior? This problem is typically known as clustering.
  • Can we find a small number of parameters that accurately capture the relevant properties of the data? E.g. the trajectories of a ball are quite well described by velocity, diameter, and mass of the ball. Tailors have developed a small number of parameters that describe human body shape fairly accurately for the purpose of fitting clothes. These problems are referred to as subspace estimation problems. If the dependence is linear, it is called principal component analysis.
  • Is there a representation of (arbitrary structured) objects in Euclidean space (i.e. the space of vectors in \(\mathbb{R}^n\)) such that symbolic properties can be well matched? This is called representation learning and it is used, to describe entities and their relations such as Rome - Italy + France = Paris.
  • Is there a description of the root causes of much of the data that we observe? For instance, if we have demographic data about house prices, pollution, crime, location, education, salaries, etc., can we discover how they are related simply based on empirical data? The field of directed graphical models and causality deals with this.
  • An important and exciting recent development is generative adversarial networks. They are basically a procedural way of synthesizing data. The underlying statistical mechanisms are tests to check whether real and fake data are the same. We will devote a few notebooks to them.


So far we didn’t discuss at all yet, where all the data comes from, how we need to interact with the environment, whether it remembers what we did previously, if the environment wants to help us (e.g. a user reading text into a speech recognizer) or if it is out to beat us (e.g. in a game), or if it doesn’t care (in most cases). Those problems are usually distinguished by monikers such as batch learning, online learning, control, and reinforcement learning.

We also didn’t discuss what happens when training and test data are different (statisticians call this covariate shift). This is a problem that most of us will have experienced painfully when taking exams written by the lecturer, while the homeworks were composed by his TAs. Likewise, there is a large area of situations where we want our tools to be robust against malicious or malformed training data (robustness) or equally abnormal test data. We will introduce these aspects gradually throughout this tutorial to help practitioners deal with them in their work.

When not to use machine learning

Let’s take a closer look at the idea of programming data by considering an interaction that Joel Grus reported experiencing in a job interview. The interviewer asked him to code up Fizz Buzz. This is a children’s game where the players count from 1 to 100 and will say ‘fizz’ whenever the number is divisible by 3, ‘buzz’ whenever it is divisible by 5, and ‘fizzbuzz’ whenever it satisfies both criteria. Otherwise, they will just state the number. It looks like this:

1 2 fizz 4 buzz fizz 7 8 fizz buzz 11 fizz 13 14 fizzbuzz 16 ...

The conventional way to solve such a task is quite simple.

In [1]:
res = []
for i in range(1, 101):
    if i % 15 == 0:
    elif i % 3 == 0:
    elif i % 5 == 0:
print(' '.join(res))
1 2 fizz 4 buzz fizz 7 8 fizz buzz 11 fizz 13 14 fizzbuzz 16 17 fizz 19 buzz fizz 22 23 fizz buzz 26 fizz 28 29 fizzbuzz 31 32 fizz 34 buzz fizz 37 38 fizz buzz 41 fizz 43 44 fizzbuzz 46 47 fizz 49 buzz fizz 52 53 fizz buzz 56 fizz 58 59 fizzbuzz 61 62 fizz 64 buzz fizz 67 68 fizz buzz 71 fizz 73 74 fizzbuzz 76 77 fizz 79 buzz fizz 82 83 fizz buzz 86 fizz 88 89 fizzbuzz 91 92 fizz 94 buzz fizz 97 98 fizz buzz

Needless to say, this isn’t very exciting if you’re a good programmer. Joel proceeded to ‘implement’ this problem in Machine Learning instead. For that to succeed, he needed a number of pieces:

  • Data X [1, 2, 3, 4, ...] and labels Y ['fizz', 'buzz', 'fizzbuzz', identity]
  • Training data, i.e. examples of what the system is supposed to do. Such as [(2, 2), (6, fizz), (15, fizzbuzz), (23, 23), (40, buzz)]
  • Features that map the data into something that the computer can handle more easily, e.g. x -> [(x % 3), (x % 5), (x % 15)]. This is optional but helps a lot if you have it.

Armed with this, Joel wrote a classifier in TensorFlow (code). The interviewer was nonplussed … and the classifier didn’t have perfect accuracy.

Quite obviously, this is silly. Why would you go through the trouble of replacing a few lines of Python with something much more complicated and error prone? However, there are many cases where a simple Python script simply does not exist, yet a 3-year-old child will solve the problem perfectly.

image0 image1 image2 image3
cat cat dog dog

Fortunately, this is precisely where machine learning comes to the rescue. We can ‘program’ a cat detector by providing our machine learning system with many examples of cats and dogs. This way it will eventually learn a function that will e.g. emit a very large positive number if it’s a cat, a very large negative number if it’s a dog, and something closer to zero if it isn’t sure. But this is just barely scratching the surface of what machine learning can do.


Machine Learning is vast. We cannot possibly cover it all. On the other hand, the chain rule is simple, so it’s easy to get started.