Smart data annotation for your computer vision projects: CVAT on Scaleway
Captured source
source ↗Smart data annotation for your computer vision projects: CVAT on Scaleway Build • Olga Petrova • 04/07/22 • 9 min read
In this article we are going to look at how to set up a data annotation platform for image and video files stored in Scaleway object storage, using the open source CVAT tool.
Introduction
We all heard the phrase " data is the new oil ". Data has certainly been fueling many of the recent technological advancements, yet the comparison holds beyond this. Much like crude oil, data needs to be processed before it can be put to use. The processing stages typically include cleanup, various transformations, and, depending on the data and the use case, manual annotation. The demand for the latter is high in the fields of computer vision and natural language processing (NLP): in other words, the data formats that are most natural for humans, as opposed to the structured data that is best viewed in table form. Manual data annotation is a time-consuming and expensive process. To make matters worse, deep artificial neural networks, the current state of the art for both computer vision and NLP, are the algorithms that require the largest amounts of data to train. Efficient annotation tools with time-saving extrapolation features and other types of automation, go a long way towards what is arguably the most crucial stage of the machine learning project's life cycle: building the training dataset.
CVAT (short for the Computer Vision Annotation Tool) is an open-source image and video annotation platform that came out of Intel. It supports the most common computer vision tasks: image-level classification, as well as object detection and image segmentation - where areas of interest on an image are selected via bounding boxes and polygonal (or pixel-wise) image masks respectively.
Image source: Ronny Restrepo
In addition to providing a Chrome-based annotation interface and basic user management features, CVAT cuts down on the number of manual annotations needed by automating a part of the process. In this blog post, we are going to focus on how to install CVAT on the Scaleway public cloud .
Data annotation on the cloud: why and how?
The most straightforward way of running CVAT on a cloud would be to simply start an instance (a virtual machine hosted by the cloud provider), and follow the Quick Installation Guide available as part of the CVAT documentation. Once installed, you can access CVAT by connecting to the instance via SSH tunneling, and going to localhost:8080 in the Google Chrome browser. You can then upload images and videos from your local server, and proceed with the annotation just as you would have in case of a local installation.
However, going about it in this notably uncloud-like way brings you none of the advantages of the cloud computing. First, think of your data storage. Computer vision projects require a lot of training data, so scalability and cost-efficiency are must-haves. Object storage has become the industry's method of choice for storing unstructured data. With virtually no limit on the size and number of files to be stored, generous free tiers (e.g. Scaleway offers 75GB of free object storage every month), and high redundancy to ensure the safety and availability of your data, it is hard to think of a better place to store "the new oil".
Depending on the size of your labeling workforce, you might also want to enable autoscaling of your annotation tool. For the time being, let us assume that the data annotation operation that you are running is manageable enough that a single instance running CVAT will suffice. Still, you do not necessarily want to give every annotator SSH access to your instance. This is something that we are also going to discuss how to do in the next section.
CVAT on Scaleway
Gathering the resources
As we have established in the previous section, there are two cloud resources that we need to take our data annotation to the next level: an object storage bucket and an instance. Here's a step-by-step guide to procuring them:
If you have not already, you will need to create an account and log into console.scaleway.com .
In order to SSH to your Scaleway instances, you'll need to create an SSH key . To mount the object storage for CVAT use, you are also going to need to generate an API key (when you do, make sure to take note of the Access and Secret Keys because you will be needing them shortly).
Now it is time to create your object storage bucket! This can be done from the Storage / Object Storage tab in the Scaleway console.
Look at that - object storage is free as long as we do not go over 75 GB!
Once your bucket is created, you can add the files that you would like to have labeled to it - e.g. via the drag and drop interface available through the Scaleway console. Let us fill our bucketofdogs with some photos of, well, dogs:
One of the pieces of information that you will be needing later on is the Bucket ID . The bucket's ID can be read off the Bucket Settings tab above, but is in fact none other than the bucket's name (i.e. bucketofdogs in my example).
4. Now that our precious dog photos are safely stored in a fancy data center, we are going to need an instance. Scaleway offers a wide range of instances suitable for different purposes. If you want to make use of CVAT's auto-annotation features (a topic for another blogpost), I would advise you to get one of the high end GPU instances . For the basic manual annotation use case, let us start with the dev range:
Here's an instance that will host your CVAT server for 1 euro cent per hour (or €7.30 per month)
Once your instance is created, you will arrive at its Overview page, where, among other things, you will find your Instance ID . Make note of it. On the same page, you will see the following SSH command : ssh root@[Public IP of your instance] . At this point, you should use it to SSH to your instance and proceed with the installation of CVAT.
Installing CVAT
Let us start with the prerequisites:
sudo apt-get update sudo apt-get --no-install-recommends install -y \ apt-transport-https \ ca-certificates \ curl \ gnupg-agent \ software-properties-common \ s3fs curl -fsSL https://download. docker .com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository \ "deb [arch=amd64] https://download. docker .com/linux/ubuntu \ $(lsb_release -cs) \ stable" sudo apt-get update sudo apt-get --no-install-recommends install -y…
Excerpt shown — open the source for the full document.