Data Annotation

Labeled data is required for all supervised machine learning projects.

Labels are added to raw data, such as images, text, audio, and video, in order to train algorithms to map inputs to outputs. If training is successful, the model will be able to accurately map inputs to outputs when exposed to real, unannotated data.

The ML Cycle

While synthetic data or data obtained from open or paid datasets may come pre-labeled, data collected from the real world often needs to be annotated. (See our guide to acquiring training data here.)

Whilst data annotation is an essential component of machine learning projects it’s also time-consuming and accounts for as much as 25% of an ML project’s average duration. Therefore, it’s imperative to make the right choices when it comes to labeling your data.

There are three main options for labeling data:

  1. In-house labeling
  2. Crowdsourcing
  3. Working with managed labeling providers

It’s also possible to automate some labeling tasks – read our guide to automated data labeling here.

This article will explore the three main data labeling options and compare their respective advantages and disadvantages.

1: In-House Labeling

In-house data labeling involves hiring teams of in-house annotators to work alongside data engineers and data scientists.

In some sectors and industries, in-house labeling is beneficial when exceptional domain knowledge is needed, e.g., cell microscopy, where close collaboration with scientists and imaging experts is required. Another major advantage is retaining tight control over the work environment and protocols, which is a boon for sensitive business-critical ML projects.

Larger or data-centric organizations may already work with data scientists and engineers, however using highly skilled workers for time intensive work is typically expensive

On top of this, data annotation has become its own sub-field within AI and machine learning and often requires niche skills, especially for more complex projects that require a high level of domain knowledge. It’s not always possible to simply allocate existing IT teams to data labeling without significant retraining.

As such, developing in-house labeling capability normally requires investment, which is typically only a viable option for ongoing, long-term projects that unfold over many months or years.

Advantages of In-House Labeling

  • In-house labeling provides complete internal control over all labeling processes. Enterprises working on tightly controlled, business-critical projects often use in-house teams for this reason.
  • Since the business retains all control over the labeling process, compliance is likely easier to handle.
  • For complex machine learning projects that unfold over long periods, in-house labeling provides long-term assurance and reliability.

Disadvantages of In-House Labeling

  • In-house labeling is the most expensive option, especially when employees need to be trained, retrained, and managed.
  • Pre-existing data scientists and engineers may not be able to jump straight into labeling, and even if they can they are typically much more expensive than outsourced resources.
  • Organizations might need to hire a lot of labelers, which involves negotiating contracts, pensions, and other employment laws, rules and regulations.

Contact us

2: Crowdsourced Labeling

Some forms of data-labeling are straightforward and require little to no domain knowledge or expertise in data. Crowdsourcing annotation tasks to hundreds or thousands of individual employees using a platform like Amazon’s Mechanical Turk (mTurk).

Labelers will learn what to do from a few well-labeled examples (aka. gold sets) and then label large volumes of data themselves. Crowdsource data labeling also operates through freelance platforms like Fiverr and Upwork.

Crowdsourcing is appropriate for data annotation tasks where scale is a maximum priority and when some quality can be sacrificed for an extremely large dataset. Moreover, while crowdsourcing generally results in good quality data for simple labels (e.g., bounding boxes), problems may arise for challenging annotation tasks (e.g., panoptic segmentation, tracking, or NLP).

Crowdsourcing suits simple bounding box labeling tasks

Moreover, data security, regulation, and compliance issues have plagued crowdsourcing projects. MTurk was recently integrated with Amazon SageMaker to provide a more modern, closed environment for data labeling tasks, but sending sensitive and potentially valuable data to thousands of contractors still requires something of a leap of faith.

On the upside, crowdsourcing can rapidly scale data annotation tasks at a fairly low cost, especially compared to in-house labeling.

Crowdfunded data can combine with expertly-labeled data

Large crowdfunded datasets can be combined with a small percentage of comprehensively-annotated data. For example, a large volume of street scenes can be crowdsource-labeled with simple bounding boxes for significant features, like people or cars.

Meanwhile, a managed or in-house team creates an ultra-high quality dataset that might involve image segmentation, boundary labeling, polygon annotation, tracking, and even LiDAR. Models are then trained on a combination of both datasets.

Advantages of Crowdsourced Data Labeling

  • Rapidly scale up the project’s labeling capacity without hiring internal employees. Crowdsourcing is also cost-effective for many large-scale labeling projects.
  • Access a large, cross-geography workforce who likely has some skill and experience in data labeling.
  • Ability to load labeled data straight into AI and ML platforms, like SageMaker.

Disadvantages of Crowdsourced Data Labeling

  • Crowdsourcing generally produces decent results when the target is obvious, but introducing complex tasks to the crowdsource labeling market might not produce accurate results.
  • While generally inexpensive, with fewer quality management controls labeled data may need to be checked again before use which increases cost.
  • Exposing valuable or sensitive data to crowdsourcing workforces is potentially risky.

3: Managed Labeling Services

Machine learning is a fast-developing and dynamic industry that demands reliable, accurate results. At the same time, sacrificing skill for speed is generally not an option, and it’s often more time-consuming to backtrack during data preparation than it is to get things right the first time.

Managed data labeling services are tailored to the modern AI and ML market. In contrast to crowdsourcing, managed labelers are trained in data labeling and have access to the necessary skills and data labeling software required to create exceptional datasets.

Moreover, working with a labeling partner enables access to domain specialists without hiring them in-house. The labeling partner will work with the client to establish their needs and acquire the necessary niche skills and knowledge needed to produce accurate datasets.

Managed data services produce highest-quality data

Venturebeat overviewed a study comparing managed data labeling teams to crowdsourced teams and found that managed teams produced data 25% higher quality than crowdsourced teams.

Managed data labeling and data service teams can handle annotation tasks on a project or subscription basis, making pricing as flexible as crowdsourcing, but with plenty of added value in terms of skill, insight, and customization.

Managed annotation services or ‘human in the loop’ workforces ensure high labeling standards, control over labeling requirements, accountability, and project security – all for a fraction of the price of in-house teams.

Advantages of Managed Service Data Labeling

  • Managed Service data labeling allows organizations to access specialist teams of highly-trained data labelers on demand.
  • Data labeling is often an iterative task that involves revisions and improvements as the model develops. Managed teams retain project expertise and knowledge where crowdsourced labelers do not.
  • By virtue of being a company accountable to clients for service quality, managed human in the loop (HIIT) workforces typically deliver a higher quality of service than individuals on a crowdsourcing platform.
  • With a managed service security of data can be assured through recognised and enforceable standards

Disadvantages of Managed Service Data Labeling

  • Managed data labeling is typically more expensive than crowdsourcing.
  • Managed labeling services are not typically as rapidly scalable as crowdsourced labeling, where you might gain access to millions of people on demand.
  • For projects requiring a broad variety of labelers, more than one managed service provider may be required.

Summary: How to Label Your Training Datasets

Data labeling is a non-trivial task. While the process of labeling data for supervised machine learning can seem straightforward, it’s easy to underestimate the difficulty involved in creating a high quality, accurately labeled dataset.

Crowdsourcing labels can be a rapid and cost effective way to create training data, but where projects demand sophisticated labeling, agile iteration of requirements or secure environments, it may not be the best option. In-house teams are an excellent way to build high-quality datasets in a controlled environment, but are typically an expensive option. Managed services can offer a middle ground between scalability, cost, quality and security.

Contact us to learn more about data labeling solutions across medical, industrial imaging, agriculture, autonomous vehicles, linguistics, geospatial analysis, facial mapping, and consumer analytics.

How to Find Training Data for Machine Learning

The Art of the Dataset