Photo: Erik Eastman

Debate: Piracy is a Part of the Foundation of the World’s Leading AI Services

Nov 14, 2023 | In media

14. November 2023

Presented here is the Rights Alliance’s opinion piece, which was published by Technology Media House’s media Datatech on Tuesday 14 November 2023. The full article can be read here.

In the development of artificial intelligence, creative content has become a sought-after source of training data. We would be wise to learn from the past as copyright is once again framed as the enemy of technology.

The development of artificial intelligence is fascinating people all over the world, not least among artists who are embracing the new tools. At the same time, creative content creators are losing the ability to control the use of their work, as quality human-generated content has become a coveted source of training data for generative AI developers.

The Rights Alliance’s takedown of the illegal training dataset Books3 has received widespread public attention. It has also sparked opposition from AI developers who are perpetuating the same ideas that fuelled the development of the illegal file-sharing environment in the early 2000s. But the notion that copyright limits freedom of expression and the development of a digitalised democratic society only serves those who profit from the use of illegal content.

In the name of technology

With the rapid development of artificial intelligence, images, text, music, and other creative content have become valuable data for training of large language models, image generation programmes, etc. One such example is the organisation EleutherAI, which is behind the distribution of the Books3 training dataset, describing books as an indispensable source of training data: “We included [books3] because books are invaluable for long-range context modeling research and coherent storytelling.”

However, Books3 is illegal, containing approximately 200,000 e-books sourced from an illegal German file-sharing service. RettighedsAlliancen successfully removed Books3 from the publisher EleutherAI and other corners of the internet after identifying illegal copies of Danish works in the dataset. This achievement has drawn significant media attention and support from rights holders globally, marking the first instance of an illegal training dataset being taken down through rights enforcement.

Yet, the takedown of Books3 faces opposition from developers and others who argue that copyright enforcement hampers inevitable technological progress or, , even “destroys the world.”, as seen in reactions on X:

This case exemplifies the justification of using illegal content in artificial intelligence training, echoing the views that fueled the illegal file-sharing community. The belief that an individual’s right to copy, share, and use content is a prerequisite for freedom of expression and democracy in the digital age is at the core of this argument.

The advent of the internet favored the illegal file-sharing community, with slogans like ‘information wants to be free’ delaying regulatory efforts by politicians and authorities. This sentiment persisted, as seen in the opposition to the ACTA treaty in 2012, framing copyright enforcement as antithetical to democracy.

Similarly, Shawn Presser, the developer behind the Books3 dataset, contends that the use of creative content in training data is vital for the widespread development of artificial intelligence:

“[I]t’s crucial that you and I can make our own ChatGPTs, for the same reason it was crucial that anybody could make their own website back in the ’90s. […] The only way to replicate models like ChatGPT is to create datasets like Books3. And every for-profit company does this secretly, without releasing the datasets to the public.”

While it is true that major tech companies are not transparent about their data usage, the reality is that Meta, StabilityAI and Bloomberg have all used Books3 in the training of previous versions of their models. It is therefore disconcerting that illegal datasets such as Books3 should democratise the development of artificial intelligence when the same data contributed to the training of the world’s leading AI services.

Piracy is no longer a question of individuals using and sharing illegal content,

it is, in part, attributable to the success of generative AI services developed by the world’s most profitable companies:

“A culture of piracy has existed since the early days of the internet, and in a sense, AI developers are doing something that’s come to seem natural. It is uncomfortably apt that today’s flagship technology is powered by mass theft.”

The ends does not justify the means

It is paradoxical that artificial intelligence relies on high-quality, human-generated content while simultaneously diluting the creative industry. First by misusing the content, and then by generating new content that competes against the rights holders’ products. This poses a risk of a downward spiral, where the supply of creative content deteriorates as creative players lose control, leading to a decline in content quality as machines replace humans in cultural content creation. This will not only impoverish artists, but also the entire world.

For rights holders to maintain control over their content, transparency from AI developers is essential. The takedown of Books3 emphasises this, as the dataset exceptionally contained information about the content and the source of the respective data. However, the developers of artificial intelligence services are increasingly concealing the data used to train their models. For instance, Meta does not disclose training data information in the latest version of its AI service LLaMA-2, unlike LLaMA-1, where training data, including Books3, was listed. Without transparency requirements, the ability of rights holders to protect their content depends on the goodwill of tech companies heavily reliant on them.

Unlike the history of file-sharing services, if creative content is to be used in artificial intelligence training, it must be done with respect for the contributors who enrich the world with art and valuable training data. Dispelling the myth that copyright impedes technological progress and individual freedom is crucial. The experience in combating illegal file sharing illustrates that this viewpoint primarily benefits those who profit from training AI services with illegal content.

Read the full opinion piece at DataTech here