The result comes after a written request from the Rights Alliance, which means Denmark continues to lead the way in enforcing rights in relation to artificial intelligence
Danish media organisations’ exclusive control over their own content is crucial to maintaining the foundation of Danish journalism. But media copyright is increasingly being challenged by ‘web crawlers’ who copy content from their websites with the aim of making articles illegally available in datasets used to train AI services.
But after a written request from the Rights Alliance, the web archive Common Crawl has now stopped copying content from the websites of a number of Danish media houses. This comes after Common Crawl has been copying full-length articles from the media organisations’ websites for a long time without either obtaining permission from or providing compensation to the rights holders.
Danish rights holders take the lead
Danish rights holders are once again at the leading position internationally when it comes to enforcing rights in relation to artificial intelligence. This also happened last year when we succeeded in removing the controversial Books3 training dataset, which included up to 200,000 illegal copies of Danish and international authors. The only other rights holder in the world to have achieved the same result is The New York Times, which in connection with their lawsuit has requested Common Crawl to remove illegal copies.
Head of Content Protection and Enforcement, Thomas Heldrup, says:
‘When content is made available for free and can be freely used by AI developers, their incentive to pay for the rights holders’ content disappears. By enforcing against the illegal copying of content used to train AI, we can give control back to the rights holders and strengthen their position in negotiations with AI developers. It also sends a clear signal to AI developers that the use of creative content requires authorisation from the respective rights holders.’
Generative AI trained on data from Common Crawl
In response to our request, Common Crawl will also review their existing datasets in order to remove content belonging to the Danish media houses in question.
But the enforcement of media house content doesn’t stop at Common Crawl, as content from the web archive forms the basis of many datasets on the web that are used by tech companies to train artificial intelligence.
One example is Google’s popular C4 dataset, which is based on copies from Common Crawl and has been used by OpenAI, Meta, Google and others to train generative AI. In July and August alone, the C4 dataset was downloaded almost 200,000 times from the Hugging Face platform. The Rights Alliance therefore continues to uncover the use of illegal copies of Danish media content in training data in order to enforce the rights of media houses.
Also read Wired’s article: Publishers Target Common Crawl In Fight Over AI Training Data
Get updates on our AI efforts on Thomas’s linkedin here