The Lack of Diversity in Datasets And Why We Should Care

A significant issue in machine learning today is the lack of diversity in training datasets. When models are trained on narrow or homogeneous data, the result is often biased AI systems that can cause real-world harm-particularly to underrepresented communities. Ethical concerns arise when these biases reinforce stereotypes or lead to unequal outcomes. While AI researchers continue to advocate for greater transparency and inclusivity in dataset creation, regulatory frameworks are still evolving. Until stronger policies are in place, the push for diverse, representative data remains largely voluntary-yet crucial for building fair and trustworthy AI.

Despite the advancement and adoption of machine learning, there is much work to be done related to bias, diversity and inclusion within datasets themselves. Leaving out specific communities from datasets results in a lack of representation embedded within algorithms. One manifestation of this problem is facial recognition being unable to process black faces, as highlighted by The Algorithmic Justice League within their documentary Coded Bias. Facial recognition also can misidentify faces, resulting in harms against those communities. One striking case occurred in 2015, when Google Photos labeled a black couple as a gorillas, resulting in Google temporarily resolving the problem by removing the ‘gorilla’ tag from their categorization. The same problem was repeated in the middle of 2020 by Meta (known at that time as Facebook), when a user watching a video from a British tabloid featuring Black men saw an automated prompt if they would like to “keep seeing videos of Primates.”

‍These machine learning examples typically involve supervised learning, which requires the involvement of humans to manually label data. However, even unsupervised machine learning algorithms, which use vast quantities of data without the involvement of humans, have problems as well. One notable example is OpenAI’s GPT-3, a language generation model that creates text with little input, which was trained on 570 GB of data and produces an anti-Muslim bias within its generated text. The training dataset includes text posted to the internet and books uploaded to the internet, such as English-language Wikipedia. The training data contains linguistic regularities that reflect human unconscious biases, such as racism, sexism, and ableism.

The inability to address ethical problems in machine learning systems has begun to impact companies on the frontier of AI development. Google Cloud recently turned down a request for a custom financial AI, citing that the research to combat unfair biases must catch up and, “until that time, we are not in a position to deploy solutions.” Meta announced plans to shut down its decade-old facial recognition system, deleting the face scan data of more than one billion users.

Without a way to properly address ethical issues, the progression of AI will be blocked and potentially stop the advancement of AI completely.

How Can We Move Forward?

Academic researchers in AI ethics are pushing for changes to decrease the chances of deploying machine learning models in contexts for which they are not well suited. One paper, Model Cards for Model Reporting, proposed that we should use model cards, which document model performance characteristics in order to avoid this issue. The model cards would accompany trained machine learning models and provide “benchmarked evaluation in a variety of conditions, such as different cultural, demographic, or phenotypic groups (e.g. race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g. age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains.” The model cards would also include the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information.

Another paper, Datasheets for Datasets, suggests a datasheet for each dataset that would describe its operating characteristics, test results, recommended uses, and other information. Datasheets would be used to improve transparency and accountability in the machine learning community. Microsoft, Google, and IBM have started to pilot datasheets for datasets within their product teams. Creating more documentation within the machine learning development process is one step closer to building more inclusive algorithms.

Changing Regulations Around Data Diversity

As more large companies are driven to build AI systems, diversity and inclusion has yet to become heavily regulated.

The EU has proposed rules that classify AI systems into three risk categories. The Canadian federal government started requiring algorithmic impact assessments for all systems delivered to the federal government. The US Federal Trade Commission published an article clarifying its authority under existing law to pursue enforcement actions against organizations that fail to mitigate AI bias or other unfair or harmful outcomes through the use of AI. The Office for AI in the UK recently released the National AI strategy, with plans to develop the UK’s position on governing and regulating AI, which is set for publication in early 2022.

Although North America and Europe have started moving towards more regulation, the actual timeline of when these rules will come into effect is unclear. The adoption of GDPR, for example, was proposed in 2012, adopted in 2014, and went into effect in 2018. And until official regulation has come into effect, companies will lack the momentum needed to bring more diversity into their datasets.

At Aya Data, we recognise the urgent need for diverse and inclusive datasets in AI. As a leader in data solutions, we are committed to tackling the ethical challenges posed by dataset biases. We invite researchers, developers, and organisations to collaborate with us in creating fairer AI systems.

Let’s prioritise diversity in our datasets to ensure accurate outcomes for all communities. Join us in advocating for better data practices and support our initiatives to make a meaningful impact in the AI field. Together, we can shape a future where technology serves everyone. Follow the link and contact us today to explore how we can work together on inclusive AI solutions.

‍

Aya Data – Domain specific data annotation services for major dataset types and industries Reliable AI data collection services to train machine learning models AI consulting experts in designing and deploying tailored AI solutions for businesses

The Lack of Diversity in Datasets And Why We Should Care

How Can We Move Forward?

Changing Regulations Around Data Diversity

Categories

Latest Posts

The Complete Guide to RLHF Services: Implementation, QA, and Model Optimization for Custom AI

A Practical Guide to Overlapping Image Annotation: Strategies, Tools, and Best Practices

The Future of AI Red Teaming: Challenges, Trends, and What’s Next

Subscribe to our Newsletter

Services

Products

Resources

Subscribe to our Newsletter

Contact With Us!

Aya Data – Domain specific data annotation services for major dataset types and industries Reliable AI data collection services to train machine learning models AI consulting experts in designing and deploying tailored AI solutions for businesses

The Lack of Diversity in Datasets And Why We Should Care

How Can We Move Forward?

Changing Regulations Around Data Diversity

Categories

Latest Posts

The Complete Guide to RLHF Services: Implementation, QA, and Model Optimization for Custom AI

A Practical Guide to Overlapping Image Annotation: Strategies, Tools, and Best Practices

The Future of AI Red Teaming: Challenges, Trends, and What’s Next

Tags

Subscribe to our Newsletter

Services

Products

Resources

Subscribe to our Newsletter

Contact With Us!