Data Annotation
The Art of the Dataset

Artists working with AI face multiple decisions during their creative process. Typically, they need to come up with a compelling concept, find a relevant dataset, choose a suitable algorithm and curate the generated images for display. Each stage offers plenty of choices that heavily influence the aesthetics of the final artwork, becoming a focus area for different types of artists. Critical media artists are intent on exploring complex concepts through AI, technology fans aim to generate the perfect visuals and fine artists consider the data that are used to train computer vision and natural language processing systems. This final group of artists investigating datasets create their own data, explore the role of human labour in AI art and address inequality by eliminating bias in dataset creation, data annotation and classification.

Anna Ridler centres her artistic practice on the idea of owning the dataset component when making AI art. By producing her own data – drawings or photographs, Ridler controls exactly which images make up the dataset that the algorithm is trained on and therefore has a direct impact on the aesthetics of the generated images. In her early work Fall of the House of Usher, the artist proceeds to make over 200 black and white ink drawings of the 1920s Edgar Allan Poe silent film. Then, Ridler trains the computer vision system on these drawings and proceeds to make an animation out of the generated images, a fine art interpretation of the film for the AI age. The animation reveals the artist’s individual style: eyes and eyebrows are drawn similarly, while background objects appear and disappear across the frames because of their secondary importance. This example illustrates how much the input data influences the final output, meaning that the particularities – style, bias, omission – will be reflected in the generated images.

Contact us

In her subsequent artwork Myriad (Tulips), Ridler emphasises the human labour involved in making a dataset. Over the course of the tulip season in the Netherlands, the artist took 10,000 photographs of tulips, labeled and classified the images to form the dataset for her work Mosaic Virus. The scale of the time and effort involved in making such a dataset is made clear when the works are exhibited: Ridler pins thousands of photographs on a wall alongside the videos of the generated flowers, highlighting the typically forgotten component of AI art. Trevor Paglen takes a similar exhibition approach in his From ‘Apple’ to ‘Anomaly’, an artwork that delves into ImageNet, a popular dataset that underpins many computer vision models. Paglen fills the entire gallery wall with 30,000 images, inviting the viewer to walk up and review which images are clustered around particular data labels. Some categories are straightforward such as ‘strawberry’ or ‘sun’, whereas others offer highly subjective classification into personal habits or character traits.

The bias in dataset creation is made clear by Mimi Onuoha in her work The Library of Missing Datasets, which highlights the groups we exclude in our data collection process by presenting a filing cabinet with empty folders, labelled with the description of data they should contain. In a similar spirit, Jake Elwes has identified the lack of representation of drag and gender fluid faces in commonly used datasets and proceeded to make Zizi – Queering the Dataset, adding 1,000 new images to an existing dataset, then retraining it to achieve a more inclusive result.

Meanwhile, in Feminist Data Set, Caroline Sinders reviews the whole AI process through a feminist lens, from data collection and data labeling to chat bot design. The project frequently takes the form of workshops and toolkits, designed to engage the public and to offer concrete means of addressing data bias. Influenced by open-source models of knowledge collection such as Wikipedia, Sinders’ focus on collective work ensures that any future datasets are as free from bias as possible by involving many female perspectives. In her accompanying toolkit, Sinders details her stance “that not one person or one single entity should be responsible for knowledge creation or gathering, it should fall to a community to reflect the community’s ideas”.

Today’s artists working with datasets deal with the tangible, human-centred aspects of machine learning that can be controlled by the artist instead of focussing on the generative powers of technologies they did not develop. By investigating the dataset, artists are able to visually explain to the audience how machine learning works and the decisions taken at each stage of the data collection, labeling and classification process, prompting companies and individuals to review what can be done to mitigate bias. For Aya Data, making unbiased datasets is at the heart of its mission: more inclusive datasets and AI systems will allow us to create a fairer society for everyone to enjoy.

By Luba Elliott

Crowdsourcing Vs. Managed Service Vs. In-House Labeling

Abductive Learning and Artificial Intelligence: Why Can’t Machines learn like Humans?