WritingScalewayScalewaypublished Aug 6, 2020seen 5d

Active Learning, part 1: the Theory

Open original ↗

Captured source

source ↗
published Aug 6, 2020seen 5dcaptured 3dhttp 200method plain

Active Learning, part 1: the Theory Scale • Olga Petrova • 06/08/20 • 7 min read

Active learning: when and why do we need it?

Active learning is still a relatively niche approach in the machine learning world, but that is bound to change. After all, active learning provides solutions to not one, but two challenging problems that have been poisoning the lives of many a data scientist. I am, of course, talking about the data: namely, its (a) quantity and (b) quality.

Let us start with the former. It is no secret that training modern machine learning (ML) models requires large quantities of training data. This is especially true for deep learning (ML carried out via deep artificial neural nets), where it is not uncommon for training sets to number in the hundreds of thousands and beyond. To make matters worse, many practical applications come in the form of supervised ML tasks: i.e., not only do we require all these training samples, but we also need a way to label them. Labeling data is time-consuming and expensive . It is bad enough if it involves human experts manually annotating text or images, but what if labeling entails invasive medical tests to confirm a patient's diagnosis? Or drilling down into the rock to test for oil? There are plenty of scenarios where unlabeled data may be easy to obtain, yet the labeling budget may impose severe limitations. This is precisely where semi-supervised learning shines: leveraging unlabeled data to achieve a supervised ML task with fewer labels that would be needed otherwise. It is a well-established fact that the performance of semi-supervised models often strongly depends on the selection of the training samples that have been labeled. Roughly speaking, the more "representative" or "informative" the labeled samples are, the better. If we have the choice of what samples to label, however, how can we determine which ones would be of most use for our task? Sometimes this can be seen from a manual inspection (although the approach of combing through unlabeled data manually does not scale particularly well), and other times it cannot. Would it not be nice if the model itself could tell us which datapoints it would prefer to know the labels for? Well, it can - in fact, that is precisely what active learning is all about.

Paraphrasing George Orwell, some training samples are more equal than others!

The issue of data quality does not only manifest itself in the semi-supervised setting. Even if all of your data is already labeled, more is not always better when it comes to training an ML model. On the contrary, outliers and other types of spammy data may lead your model astray. If these represent a significant portion of your training set, a model that is trained in an active learning regime, where unhelpful data is ranked down, may even outperform a fully supervised model, that had access to the entire dataset from the start!

Active learning: what is it?

Active learning is part of the so-called Human-in-the-Loop (HITL) machine learning paradigm. The idea behind HITL is to combine human and artificial intelligence to solve various data-driven tasks. Depending on how you look at it, all of machine learning is at least somewhat HITL, but some areas more than others. In active learning, human participation is as explicit as it is iterative:

The oracle (the source of the ground truth labels, e.g. the human expert) supplies the model with some labeled data.

The model gets trained on those labeled samples, and any others it may have gotten previously (it is then, most likely, tested on the validation set to keep track of its performance).

The model determines which unlabeled samples it would most like to have labeled next, and sends a request to the oracle.

And on and on it goes. Ideally, you repeat steps 1-3 until you are satisfied with the model's performance, but in the real world you might either notice that the said performance stops improving, or you run out of your labeling budget.

The model in question can be any supervised ML algorithm, active learning puts no restrictions on you in that regard. For the sake of example, let us assume that we are dealing with image classification. Implementation-wise, you can think of active learning as something that is wrapped around your classifier. In fact, the classifier itself does not require any changes compared to its plain old passive learning version. Passive learning is basically the kind of machine learning that we are all used to, where labeled examples are sampled at random, rather than in accordance with the feedback received from the classifier. Standard supervised machine learning can be viewed as a special case of passive learning - one where you just happened to randomly sample all of your available training data!

The most nontrivial part of active learning lies is step 3 above: how does the model decide which samples are the most beneficial to label at the current stage? Turns out, there are multiple ways of doing this.

Active learning: how is it done?

Chicken or egg?

This cat would have preferred an egg.

This cat would have preferred an egg.

Let us first state the obvious: one has to start somewhere. Simply initializing your classifier to a random state and throwing unlabeled data at it will not get you very far. Better alternatives include:

Transfer learning: pre-train your classifier on another labeled dataset in the hopes that some of that knowledge will carry over to your data.

Random query: label some number of training samples at random, and use them to train the initial classifier (or fine-tune a pre-trained one, if you fancy combining the two strategies).

There are other, more sophisticated things that one can do, of course. You can, for example, try clustering your data first, and sample points from each cluster. This approach is not always possible, but is particularly helpful when we do not know how many classes are there in the underlying data distribution.

However we choose to start our training, there are several choices to be made along the way. One decision to make is:

Streaming or Pooling?

In a streaming scenario, the model is presented with training samples, one at a time. The model will then either ignore the sample, or query its label. In our classifier example, one could imagine there being a hyperparameter that is a probability threshold for the most likely class. If the sample's probability does not reach this threshold, the…

Excerpt shown — open the source for the full document.