Active Learning, part 2: the Practice
Captured source
source ↗Active Learning, part 2: the Practice Scale • Olga Petrova • 06/08/20 • 18 min read
This blog post is the continuation of Active Learning, part 1: the Theory , with a focus on how to apply the said theory to an image classification task with PyTorch.
In part 1 we talked about active learning: a semi-supervised machine learning approach in which the model figures out which of the unlabelled data would be most useful to get the labels for. As the model gets access to more (data, label) pairs, its understanding of what training samples are most informative supposedly grows, allowing us to get away with fewer labeled samples without compromising the model's final performance. The hardest part of the process is determining the aforementioned informativeness of the unlabeled samples. The choices are dictated by the selected query strategy , the most common strategies having been discussed in the previous post .
Before we move on to the code, let us remind ourselves of the steps inside the active learning loop:
The _ oracle (e.g. you) labels some of the data, and adds it to the labeled dataset L .
The model gets trained on L .
Using a query strategy, the model determines which samples from the unlabelled dataset U it would most like to have labeled next.
A request to label the data chosen in step 3 gets sent to the oracle, and we go back to step 1.
There are, of course, many different ways of implementing the steps above. In this blog post, I am going to go through the decisions that I made for my PyTorch implementation, in the hope that it will help you easily adjust my code ( see the Jupyter notebook on GitHub ) to your situation.
Quickstart for the Docker Crowd
docker run -it -p 8888:8888 --shm-size=16g opetrova/active_dogs root@6baef783fa64:/workspace# jupyter notebook --port 8888 --allow-root CopyContentIcon Copy code Copy the bottom URL, paste it into your browser, and get training! Both the notebook and the datasets are included.
Poodles & Co.
To demonstrate how active learning works in practice, I chose a ten-breed subset of the Stanford Dogs Dataset for my image classification task. Technically, the project employs both transfer and active learning , as I start with a model that has been pre-trained on the ImageNet . The latter actually includes the ten dog breeds among its classes. This allows me to illustrate quite a few things with a very small number of training samples, since the training mostly comes down to the network figuring out that the ten newly added output nodes correspond to one of: chihuahua, pekinese, basset, whippet, malinois, collie, great dane, chow, and miniature and standard poodles. Ever tried telling miniature and standard poodles apart from a photo? Not an easy task, and I bet that that's the decision boundary that the margin query strategy will focus on. Let us continue to find out!
Photo by Hannah Lim
The Data
Typically, I start my machine learning projects with setting up the data pipeline. In PyTorch, this usually involves writing a custom dataset class that inherits torch.utils.data.Dataset and is then used together with an instance of torch.utils.data.DataLoader to get the data nicely shuffled and split into mini-batches, ready for training. Normally I would write two different dataset classes for unlabelled and labeled data, however, in the case of active learning, the samples will actually go from one category to the other in the course of training. Thus, here we take a different approach: all of the training data belongs to the same dataset. There are two values associated with each sample: 1) a class label (set to an arbitrary value for samples that have not been labeled yet), and 2) a unique index that the various functions that we shall write later on can use to refer to the sample. The dataset object also has a variable called unlabeled_mask : a numpy array with zeros and ones corresponding to labeled and unlabeled samples respectively.
class IndexedDataset(Dataset): def __init__(self, dir_path, transform=None, test=False): ''' Args: - dir_path (string): path to the directory containing images - transform (torchvision.transforms.) (default=None) - test (boolean): True for labeled images, False otherwise (default=False) ''' self.dir_path = dir_path self.transform = transform image_filenames = [] for (dirpath, dirnames, filenames) in os.walk(dir_path): image_filenames += [os.path.join(dirpath, file) for file in filenames if is_image(file)] self.image_filenames = image_filenames # We assume that in the beginning, the entire dataset is unlabeled, unless it is flagged as 'test': if test: # The image's label is given by the first digit of its subdirectory's name # E.g. the label for the image file ./dogs/train/6_great_dane/n02109047_22481.webp is 6 self.labels = [int(f[len(self.dir_path)+1]) for f in self.image_filenames] self.unlabeled_mask = np.zeros(len(self.image_filenames)) else: self.labels =[0]*len(self.image_filenames) self.unlabeled_mask = np.ones(len(self.image_filenames)) def __len__(self): return len(self.image_filenames) def __getitem__(self, idx): img_name = self.image_filenames[idx] image = Image.open(img_name) if self.transform: image = self.transform(image) return image, self.labels[idx], idx # Display the image [idx] and its filename def display(self, idx): img_name = self.image_filenames[idx] print(img_name) img=mpimg.imread(img_name) imgplot = plt.imshow(img) plt.show() return # Set the label of image [idx] to 'new_label' def update_label(self, idx, new_label): self.labels[idx] = new_label self.unlabeled_mask[idx] = 0 return # Set the label of image [idx] to that read from its filename def label_from_filename(self, idx): self.labels[idx] = int(self.image_filenames[idx][len(self.dir_path)+1]) self.unlabeled_mask[idx] = 0 return CopyContentIcon Copy code Now let's instantiate ourselves a couple of datasets:
train_dir = './dogs/train' test_dir = './dogs/test' train_set = IndexedDataset(train_dir, transform=transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor()])) test_set = IndexedDataset(test_dir, transform=transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor()]), test=True) test_loader = DataLoader(test_set, batch_size=1024, shuffle=False, num_workers=10) CopyContentIcon Copy code Notice that so far we have only created the test DataLoader, but not one for training . The reason…
Excerpt shown — open the source for the full document.