So far we did not worry very much about where the data came from and how the models that we build get deployed. Not caring about it can be problematic. Many failed machine learning deployments can be traced back to this situation. This chapter is meant to help with detecting such situations early and points out how to mitigate them. Depending on the case this might be rather simple (ask for the ‘right’ data) or really difficult (implement a reinforcement learning system).

Covariate Shift

At its heart is a problem that is easy to understand but also equally easy to miss. Consider being given the challenge of distinguishing cats and dogs. Our training data consists of images of the following kind:









At test time we are asked to classify the following images:









Obviously this is unlikely to work well. The training set consists of photos, while the test set contains only cartoons. The colors aren’t even accurate. Training on a dataset that looks substantially different from the test set without some plan for how to adapt to the new domain is a bad idea. Unfortunately, this is a very common pitfall. Statisticians call this Covariate Shift, i.e. the situation where the distribution over the covariates (aka training data) is shifted on test data relative to the training case. Mathematically speaking, we are referring the case where \(p(x)\) changes but \(p(y|x)\) remains unchanged.

Concept Shift

A related problem is that of concept shift. This is the situation where the the labels change. This sounds weird - after all, a cat is a cat is a cat. Well, cats maybe but not soft drinks. There is considerable concept shift throughout the USA, even for such a simple term:


If we were to build a machine translation system, the distribution \(p(y|x)\) would be different, e.g. depending on our location. This problem can be quite tricky to spot. A saving grace is that quite often the \(p(y|x)\) only shifts gradually (e.g. the click-through rate for NOKIA phone ads). Before we go into further details, let us discuss a number of situations where covariate and concept shift are not quite as blatantly obvious.


Medical Diagnostics

Imagine you want to design some algorithm to detect cancer. You get data of healthy and sick people; you train your algorithm; it works fine, giving you high accuracy and you conclude that you’re ready for a successful career in medical diagnostics. Not so fast …

Many things could go wrong. In particular, the distributions that you work with for training and those in the wild might differ considerably. This happened to an unfortunate startup I had the opportunity to consult for many years ago. They were developing a blood test for a disease that affects mainly older men and they’d managed to obtain a fair amount of blood samples from patients. It is considerably more difficult, though, to obtain blood samples from healthy men (mainly for ethical reasons). To compensate for that, they asked a large number of students on campus to donate blood and they performed their test. Then they asked me whether I could help them build a classifier to detect the disease. I told them that it would be very easy to distinguish between both datasets with probably near perfect accuracy. After all, the test subjects differed in age, hormone level, physical activity, diet, alcohol consumption, and many more factors unrelated to the disease. This was unlikely to be the case with real patients: Their sampling procedure had caused an extreme case of covariate shift that couldn’t be corrected by conventional means. In other words, training and test data were so different that nothing useful could be done and they had wasted significant amounts of money.

Self Driving Cars

A company wanted to build a machine learning system for self-driving cars. One of the key components is a roadside detector. Since real annotated data is expensive to get, they had the (smart and questionable) idea to use synthetic data from a game rendering engine as additional training data. This worked really well on ‘test data’ drawn from the rendering engine. Alas, inside a real car it was a disaster. As it turned out, the roadside had been rendered with a very simplistic texture. More importantly, all the roadside had been rendered with the same texture and the roadside detector learned about this ‘feature’ very quickly.

A similar thing happened to the US Army when they first tried to detect tanks in the forest. They took aerial photographs of the forest without tanks, then drove the tanks into the forest and took another set of pictures. The so-trained classifier worked ‘perfectly’. Unfortunately, all it had learned was to distinguish trees with shadows from trees without shadows - the first set of pictures was taken in the early morning, the second one at noon.

Nonstationary distributions

A much more subtle situation is where the distribution changes slowly and the model is not updated adequately. Here are a number of typical cases:

  • We train a computational advertising model and then fail to update it frequently (e.g. we forget to incorporate that an obscure new device called an iPad was just launched).

  • We build a spam filter. It works well at detecting all spam that we’ve seen so far. But then the spammers wisen up and craft new messages that look quite unlike anything we’ve seen before.

  • We build a product recommendation system. It works well for the winter. But then it keeps on recommending Santa hats after Christmas.

More Anecdotes

  • We build a classifier for “Not suitable/safe for work” (NSFW) images. To make our life easy, we scrape a few seedy Subreddits. Unfortunately the accuracy on real life data is lacking (the pictures posted on Reddit are mostly ‘remarkable’ in some way, e.g. being taken by skilled photographers, whereas most real NSFW images are fairly unremarkable …). Quite unsurprisingly the accuracy is not very high on real data.

  • We build a face detector. It works well on all benchmarks. Unfortunately it fails on test data - the offending examples are close-ups where the face fills the entire image (no such data was in the training set).

  • We build a web search engine for the USA market and want to deploy it in the UK.

In short, there are many cases where training and test distribution \(p(x)\) are different. In some cases, we get lucky and the models work despite the covariate shift. We now discuss principled solution strategies. Warning - this will require some math and statistics.

Covariate Shift Correction

Assume that we want to estimate some dependency \(p(y|x)\) for which we have labeled data \((x_i,y_i)\). Alas, the observations \(x_i\) are drawn from some distribution \(q(x)\) rather than the ‘proper’ distribution \(p(x)\). To make progress, we need to reflect about what exactly is happening during training: we iterate over training data and associated labels \(\{(x_1, y_1), \ldots (x_m, y_m)\}\) and update the weight vectors of the model after every minibatch. Depending on the situation we also apply some penalty to the parameters, e.g. \(L_2\) regularization. In other words, we want to solve

\[\mathop{\mathrm{minimize}}_w \frac{1}{m} \sum_{i=1}^m l(x_i, y_i, f(x_i)) + \frac{\lambda}{2} \|w\|_2^2\]

Statisticians call the first term an empirical average, that is an average computed over the data drawn from \(p(x) p(y|x)\). If the data is drawn from the ‘wrong’ distribution \(q\), we can correct for that by using the following simple identity:

\[\mathbf{E}_{x \sim p(x)} [f(x)] = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbf{E}_{x \sim q(x)} \left[f(x) \frac{p(x)}{q(x)}\right]\]

In other words, we need to re-weight each instance by the ratio of probabilities that it would have been drawn from the correct distribution \(\beta(x) := p(x)/q(x)\). Alas, we do not know that ratio, so before we can do anything useful we need to estimate it. Many methods are available, e.g. some rather fancy operator theoretic ones which try to recalibrate the expectation operator directly using a minimum-norm or a maximum entropy principle. Note that for any such approach, we need samples drawn from both distributions - the ‘true’ \(p\), e.g. by access to training data, and the one used for generating the training set \(q\) (the latter is trivially available).

In this case there exists a very effective approach that will give almost as good results: logistic regression. This is all that is needed to compute estimate probability ratios. We learn a classifier to distinguish between data drawn from \(p(x)\) and data drawn from \(q(x)\). If it is impossible to distinguish between the two distributions then it means that the associated instances are equally likely to come from either one of the two distributions. On the other hand, any instances that can be well discriminated should be significantly over/underweighted accordingly. For simplicity’s sake assume that we have an equal number of instances from both distributions, denoted by \(x_i \sim p(x)\) and \(x_i′ \sim q(x)\) respectively. Now denote by \(z_i\) labels which are 1 for data drawn from \(p\) and -1 for data drawn from \(q\). Then the probability in a mixed dataset is given by

\[p(z=1|x) = \frac{p(x)}{p(x)+q(x)} \text{ and hence } \frac{p(z=1|x)}{p(z=-1|x)} = \frac{p(x)}{q(x)}\]

Hence, if we use a logistic regression approach where \(p(z=1|x)=\frac{1}{1+\exp(−f(x)}\) it follows (after some simple algebra) that \(\beta(x) = \exp(f(x))\). In summary, we need to solve two problems: first one to distinguish between data drawn from both distributions, and then a reweighted minimization problem where we weigh terms by \(\beta\), e.g. via the head gradients. Here’s a prototypical algorithm for that purpose:

CovariateShiftCorrector(X, Z)
    X: Training dataset (without labels)
    Z: Test dataset (without labels)

    generate training set with {(x_i, -1) ... (z_j, 1)}
    train binary classifier using logistic regression to get function f
    weigh data using beta_i = exp(f(x_i)) or
                     beta_i = min(exp(f(x_i)), c)
    use weights beta_i for training on X with labels Y

Generative Adversarial Networks use the very idea described above to engineer a data generator such that it cannot be distinguished from a reference dataset. For this, we use one network, say \(f\) to distinguish real and fake data and a second network \(g\) that tries to fool the discriminator \(f\) into accepting fake data as real. We will discuss this in much more detail later.

Concept Shift Correction

Concept shift is much harder to fix in a principled manner. For instance, in a situation where suddenly the problem changes from distinguishing cats from dogs to one of distinguishing white from black animals, it will be unreasonable to assume that we can do much better than just training from scratch using the new labels. Fortunately, in practice, such extreme shifts almost never happen. Instead, what usually happens is that the task keeps on changing slowly. To make things more concrete, here are some examples:

  • In computational advertising, new products are launched, old products become less popular. This means that the distribution over ads and their popularity changes gradually and any click-through rate predictor needs to change gradually with it.

  • Traffic cameras lenses degrade gradually due to environmental wear, affecting image quality progressively.

  • News content changes gradually (i.e. most of the news remains unchanged but new stories appear).

In such cases, we can use the same approach that we used for training networks to make them adapt to the change in the data. In other words, we use the existing network weights and simply perform a few update steps with the new data rather than training from scratch.

A Taxonomy of Learning Problems

Armed with knowledge about how to deal with changes in \(p(x)\) and in \(p(y|x)\), let us consider a number of problems that we can solve using machine learning.

  • Batch Learning. Here we have access to training data and labels \(\{(x_1, y_1), \ldots (x_n, y_n)\}\), which we use to train a network \(f(x,w)\). Later on, we deploy this network to score new data \((x,y)\) drawn from the same distribution. This is the default assumption for any of the problems that we discuss here. For instance, we might train a cat detector based on lots of pictures of cats and dogs. Once we trained it, we ship it as part of a smart catdoor computer vision system that lets only cats in. This is then installed in a customer’s home and is never updated again (barring extreme circumstances).

  • Online Learning. Now imagine that the data \((x_i, y_i)\) arrives one sample at a time. More specifically, assume that we first observe \(x_i\), then we need to come up with an estimate \(f(x_i,w)\) and only once we’ve done this, we observe \(y_i\) and with it, we receive a reward (or incur a loss), given our decision. Many real problems fall into this category. E.g. we need to predict tomorrow’s stock price, this allows us to trade based on that estimate and at the end of the day we find out whether our estimate allowed us to make a profit. In other words, we have the following cycle where we are continuously improving our model given new observations.

\[\mathrm{model} ~ f_t \longrightarrow \mathrm{data} ~ x_t \longrightarrow \mathrm{estimate} ~ f_t(x_t) \longrightarrow \mathrm{observation} ~ y_t \longrightarrow \mathrm{loss} ~ l(y_t, f_t(x_t)) \longrightarrow \mathrm{model} ~ f_{t+1}\]
  • Bandits. They are a special case of the problem above. While in most learning problems we have a continuously parametrized function \(f\) where we want to learn its parameters (e.g. a deep network), in a bandit problem we only have a finite number of arms that we can pull (i.e. a finite number of actions that we can take). It is not very surprising that for this simpler problem stronger theoretical guarantees in terms of optimality can be obtained. We list it mainly since this problem is often (confusingly) treated as if it were a distinct learning setting.

  • Control (and nonadversarial Reinforcement Learning). In many cases the environment remembers what we did. Not necessarily in an adversarial manner but it’ll just remember and the response will depend on what happened before. E.g. a coffee boiler controller will observe different temperatures depending on whether it was heating the boiler previously. PID (proportional integral derivative) controller algorithms are a popular choice there. Likewise, a user’s behavior on a news site will depend on what we showed him previously (e.g. he will read most news only once). Many such algorithms form a model of the environment in which they act such as to make their decisions appear less random (i.e. to reduce variance).

  • Reinforcement Learning. In the more general case of an environment with memory, we may encounter situations where the environment is trying to cooperate with us (cooperative games, in particular for non-zero-sum games), or others where the environment will try to win. Chess, Go, Backgammon or StarCraft are some of the cases. Likewise, we might want to build a good controller for autonomous cars. The other cars are likely to respond to the autonomous car’s driving style in nontrivial ways, e.g. trying to avoid it, trying to cause an accident, trying to cooperate with it, etc.

One key distinction between the different situations above is that the same strategy that might have worked throughout in the case of a stationary environment, might not work throughout when the environment can adapt. For instance, an arbitrage opportunity discovered by a trader is likely to disappear once he starts exploiting it. The speed and manner at which the environment changes determines to a large extent the type of algorithms that we can bring to bear. For instance, if we know that things may only change slowly, we can force any estimate to change only slowly, too. If we know that the environment might change instantaneously, but only very infrequently, we can make allowances for that. These types of knowledge are crucial for the aspiring data scientist to deal with concept shift, i.e. when the problem that he is trying to solve changes over time.

For whinges or inquiries, open an issue on GitHub.