AI Consulting
Importance of Data Quality for Machine Learning

Despite fears of AI taking over the world, at its core, AI is still reactionary and doesn’t truly learn on its own. It can only make predictions based on the data it’s given. That is why data quality for machine learning and AI should be at the core of every project.

Without high-quality training data, you can’t have a high-quality algorithm. In this article, we will discuss the impact bad data has on businesses, what bad data is, and provide some real-world examples of how flawed training data can completely destroy AI and ML projects. So let’s start.

The Big-Picture Impact of Data Quality

Quantifying the financial impact of bad data is not easy, but there have been several attempts. A paper published in MIT Sloan Management Review estimated that most companies lose between 15% – 25% in revenue due to bad data.

Similarly, Gartner found that organizations believe that bad quality data causes them $15 million in losses annually. These numbers aren’t too surprising if you consider that only 3% of businesses compile data that meets basic quality standards.

The bottom line is that low-quality data is an issue, regardless of how you measure it. And it can be very keenly felt if you are working on ML or AI projects, as they heavily depend on good training data. So let’s discuss the impact of low-quality data on machine learning and AI project.

Importance of Data Quality for Machine Learning,Impact of Data Quality
engineers developing machine learning

What Can Be Considered Bad Data for Machine Learning?

In the most general of terms, bad data for machine learning and AI models can be any dataset that, when utilized as training data, will cause an algorithm to incorrectly perform its intended functions. Granted, this definition is very broad. To be more specific, bad data has one or more of these characteristics:

  • Low quantity – At their very core, ML and AI algorithms need to recognize patterns based on which they make predictions. To do that, they need large amounts of training data that can be analyzed. Even the simplest problems necessitate thousands of examples for algorithms to find correct patterns. So, one characteristic of bad data is insufficient quantity.
  • Non-representative data – besides quantity, training data needs to be representative of the problem relevant to the algorithm. A model predicts new cases based on the cases it uses as training data. If the training data is non-representative, the model can’t make accurate predictions.
  • Inaccurate data – again, ML and AI algorithms learn based on the input that you give them. If the training data contains a lot of errors, the algorithms will learn from those errors and make predictions based on them, naturally giving you inaccurate predictions.
  • Irrelevant features – even if a dataset is large enough, representative, and accurate, the model can still provide inaccurate predictions or need unnecessarily many resources if the training data contains too many irrelevant features. In this context, features are measurable data elements that algorithms use for analysis. Feature engineering is an ML technique that extracts relevant features from existing datasets and is often crucial in ML and AI projects.

So, if these are the characteristics of bad data, what’s good data quality for machine learning and AI? The obverse – datasets that are large enough, representative, accurate, and contain relevant features.

But getting to good training data is not an easy task. It heavily relies on professional data acquisition, cleaning, labeling and annotation, and other pre-processing tasks. In fact, data scientists generally spend about 80% of their time preparing data, which leaves only 20% for the actual analysis.

And yet, this time is still well spent, or otherwise you get low data quality for machine learning and AI projects, which can and does lead to some very bad outcomes. So let’s stop talking in generalities and discuss some real examples where bad data caused huge issues.
Contact us

Real-World Examples of Projects Based on Bad Data

Below, you’ll find 3 examples of how bad data killed AI projects for Amazon, Microsoft, and Zillow. And keep in mind that these examples are from some of the wealthiest and most innovative companies in the world. Companies that have astronomical budgets and world-class experts working for them, so you can just imagine how many ML and AI projects never see the light of day for companies that have less resources.

Zestimate

In 2018, Zillow, a US-based real estate iBuyer, launched Zestimate – an AI created to estimate home values and the chance that they could be renovated. The purpose was to use Zestimate to purchase fixer-upper homes that could be renovated and quickly flipped for a large profit.

The result? Zillow lost $881 million because of the project and Zillow Offers, the business unit running the home-flipping project based on Zestimate, had to lay off 25% of its employees. So why did this happen? Because of bad data.

Without going too deep into it, the AI algorithm overestimated home values based on its training data, leading Zillow to pay well above market price. This not only caused a large disruption in the already expensive housing market, but left Zillow with properties that could not be sold for a profit but had to be sold for less than the company paid.

In total, an AI algorithm founded on low-quality data caused millions in damage for a company and disrupted an entire market.

Amazon

As far back as 2014, Amazon had started working on an AI project that was supposed to help with recruitment. The purpose of the AI was to sort through millions of resumes and give them ratings to find the best candidates. The project was never finished so the information we have is limited, but some crucial information was reported to Reuters.

The AI was trained on historic data of resumes that were previously submitted to the company and taught the model to recognize about 50,000 keywords that were found in the resumes. Again, the information we have is limited, so we can’t say what feature engineering techniques the Amazon team applied to the datasets, whether it used data augmentation, etc.

But we do know the final result – the AI discriminated against female candidates, penalizing specific terms that were found in women’s resumes, in part because of the preponderance of male resumes that were used as training data, as most people that applied for the targeted jobs during the period that the dataset covered were male.

Amazon stated that the algorithm was never put into practice due to the deficiencies and because they could not eliminate the bias, but regardless, the impact of low-quality data on machine learning and AI algorithms is clear.

Microsoft’s Tay

The effects of this example weren’t as serious as the previous two but it perfectly encapsulates the data science credo of “garbage in, garbage out,” i.e., that ML and AI algorithms are only as good as their training data.

So, in 2016, Microsoft released an AI Chabot on Twitter, Tay, that was shut down only 16 hours after it was launched. Tay used its interactions with other Twitter users as the data it learned from. Very soon, the chatbot started giving very inflammatory answers to queries, many of them sexist and racist.

Microsoft claimed that trolls intentionally fed the chatbot with bad data, which very well could be the case. But in hindsight, that was always going to be the most likely outcome. In any case, this is just another example, although a non-impactful one, of what bad data can do to AI and ML models.

Importance of Data Quality for Machine Learning,Impact of Data Quality
artificial intelligence and machine learning

Need Some Help with Your ML and AI Projects?

Aya Data provides data annotation and data acquisition services at scale. Our teams work across the entire AI value chain to provide you with good, workable datasets that you can use to train your ML and AI models. If you wish to know how we can help you avoid the pitfalls of bad data and add value to your project, feel free to contact us to speak to an expert.

What Is Named Entity Recognition in NLP?

A Comprehensive Guide to Data Acquisition for Machine Learning