Skip to main content

Improving annotation efficiency for deep learning algorithms for computer vision

Bosch Research Blog | Post by Amit Arvind Kale, 2021-01-14

Example of annotation for semantic segmentation

Image classification, object detection, semantic segmentation or instance segmentation are all tasks of interest in computer vision. The recent success of deep learning approaches for computer vision tasks has led to the need for large amounts of annotated data. Labeling complexity and costs increase in the order of these tasks, the highest cost being incurred for instance level segmentation, where every instance of an object such as a vehicle or human has to be annotated at pixel level.

Challenge: Huge effort for labeled data

Example of annotation for semantic segmentation
Example of annotation for semantic segmentation

This figure shows an example of annotation for semantic segmentation. It is easy to see that the time and effort required by a labeler for such a task can be quite high, leading to very high labeling costs. Such a task can take anywhere from one to two hours for one image. Since the semantic segmentation algorithms are expected to perform well even for rare events, the number of images that need to be labeled can run into the millions with proportionally high cost!

Several approaches exist to improve this situation, e.g. choosing the right set of data to send for labeling instead of sending all the frames, pre-training a deep learning network using self-supervision approaches or usage of synthetic data where the labels are effectively free. In this post, however, we are focusing on the labeling task itself.

1 – 2 hours is roughly the time required by humans to label one image

Solution: pre-labeling by using deep learning

One way to reduce annotation effort can be to “pre-label” an image, whereby, instead of having the labeler annotate every pixel ground up, we convert the task to one of correcting the mistakes of the “pre-labeling”. The next question is how to pre-label an image. In an early work in the pre-deep learning era, while building an academic dataset for video surveillance, we proposed an early version of pre-labels based on background subtraction and object detection:

More about offline generation of high quality background subtraction data.

More about the interactive generation of “ground-truth” in background subtraction from partially labeled examples.

In more recent times, the pre-label solution answer lies in the usage of deep learning itself (depicted in animation below):

If we are able to label a certain quantity of data manually, a network or an ensemble of networks can be trained for this task. Ensembles of networks essentially means making use of more than one network for the task at hand in such a way that each network is good for one particular aspect. For example, one network can be highly accurate for roads and road markings, another for something else. The ensemble combines the outputs of these different networks, keeping in mind their performance for that object category. Admittedly, the time required for inferencing on an image can be higher given that we typically use 4-5 networks in the ensemble. However, one difference between the pre-labeling task and the inferencing using the perception stack on the embedded hardware is that we are not limited by time or memory, namely, we can use multiple GPUs in a PC environment to generate pre-labels.

Process of “pre-labeling” using deep learning
Process of “pre-labeling” using deep learning

The figure below shows some qualitative improvements in our work by using a single network, two networks and four networks.

  • Qualitative improvements by using a single network, two networks and four networks.
    Qualitative improvements by using a single network,
  • Qualitative improvements by using a single network, two networks and four networks.
    two networks
  • Qualitative improvements by using a single network, two networks and four networks.
    and four networks.

For every 10% improvement in pre-label accuracy, the correction effort is reduced by 25%.

User interface as gateway for inaccuracies?

One observation we made during our work is that, while such performance gains are indeed realizable, there are more interesting user aspects that come into play when we consider usage of pre-labels which are not the conventional requirement of generic DNNs, where the idea is to get the highest levels of performance in terms of mean Intersection over Union (mIoU) metric. For instance: Consider a network producing a specific mIoU for an image where the errors are equally distributed over the area of an image, and another network which produces the same mIoU but the errors are not uniformly distributed. Instead, there are regions where the segmentation is highly accurate and some regions where they are inaccurate. A labeler prefers the latter network to the former, even if the mIoU is lower, since it is easier for them to keep the accurate segmentation and correct the wrong ones from ground up instead of spending time correcting minor errors in each object.

Additionally, when shifting to using pre-labels, it is important to have the front-end correction tool to be mindful of this. For instance, typical labeling tools involve polygon-based marking of objects. When we use pre-labels, the boundaries are much denser compared to those produced by polygons, making the number of polygon points too large. In the presence of errors, it is feasible to use active contour approaches where a couple of anchor points are provided to snap the boundary correctly.

Achievement: efficiency gains of 70% in semantic segmentation

Pre-labeling promises advantages in terms of annotation effort. However, several user interface elements must be integrated in the labeling tool to achieve these gains. We at Bosch Research built a tool incorporating some of our observations above and found that we can achieve efficiency gains of almost 70% with it as compared to the standard polygon approach of annotation. Further gains can be realized in the process with access to time synchronous multi-modal sensors such as Lidar and radar.

Thank you for the contributions from our colleagues at Bosch Research and Robert Bosch India in this work.

What are your thoughts on this topic?

Please feel free to share them via LinkedIn or to contact me directly.

Amit Arvind Kale

Author: Amit Arvind Kale

Amit is a Principal Senior Expert in computer vision at the Research and Technology Center in India. His research is motivated by the desire to organize and manage large scale video and multi-sensor data (petabytes) with the goals of being able to smartly curate the right data desired by algorithm development. This includes automated approaches to select the most representative sub set of a large set of images, and search and retrieve scenes of interest from the stored images. His research has multiple goals, such as reducing the cost of ground truth generation by removing redundancies, supporting function development and testing to find difficult cases where algorithms do not work well, which can then be used to collect more such cases or synthetically generate them. In order to achieve this, he and his team explore the structure and representation power of deep convolutional neural networks. They develop human computer interfaces that go hand in hand with the deep learning approaches to ensure ease of use by the end users.

Share this on: