Stay up to date on AI and copyright via our newsletter.
Rights holders have criticized the design of the code of practice for providers of AI models for general use and the transparency template for training data, which were published this summer. Here are the challenges we see for copyright enforcement.
Over the summer, the European Commission published the final version of the code of practice for providers of AI models for general use, as well as the final version of a transparency template that these providers must fill out with information about the training material they have used. Finally, guidelines were also issued for understanding selected concepts in the AI Regulation itself. On August 2, the transparency obligations for providers of AI models for general use came into force.
The Danish Rights Alliance and other rights holder organizations have worked intensively to give rights holders a real opportunity to enforce their copyright. We have done this by participating in the working group on the code of practice and submitting comments on a draft transparency template issued by the EU AI Office at the beginning of the year.
Although these efforts have helped to tighten certain obligations, the final code of practice and template are far from what we wanted to ensure effective enforcement of copyright. We therefore agree with the criticism recently expressed by several rights organizations at the international level, as well as at the national level by our members Danske Forlag, Producentforeningen, and Koda, among others.
The transparency template provides limited enforcement options
While the idea behind the code of practice is that signatory AI providers can demonstrate their compliance with the AI Regulation, the hope was that the transparency template would give rights holders the much-needed and, for enforcement purposes, essential insight into AI providers’ training data. Here we summarize the main challenges we face in our work to enforce copyright.
Compliance with obligations expected in 2026 at the earliest
Although the copyright obligations in the AI Regulation have entered into force, we can expect concrete action from AI providers in about a year at the earliest. This is because the AI Office will not begin enforcing violations of the AI Regulation until August 2, 2026, and AI providers who have signed the code of practice have the same deadline to demonstrate compliance with the obligations. In addition, providers of AI models for general use that were placed on the European market before August 2, 2025, will only have to comply with the obligations from August 2, 2027.
History tells us that AI providers only comply with regulations when they are forced to do so through enforcement measures. We can already see that OpenAI has failed to publish training data for its latest GPT-5 model, even though the model was launched on the European market after August 2, 2025.
Insufficient transparency requirements for datasets, crawled domains, and illegal file-sharing services
According to the AI Regulation, providers of AI models for general use must prepare and publish a sufficiently detailed summary of the content used to train the model, in accordance with the AI Office’s template. Unfortunately, we must note that the template does not provide sufficient information for the effective exploitation and enforcement of copyright.
Below, we describe real-world examples where AI providers are not required to disclose sufficient information.
1. Datasets
For the use of datasets, only “large” publicly available datasets must be listed by name and link. This means that AI providers are only required to publish details about the content if the dataset constitutes more than 3 percent of all publicly available datasets used in training within a specific category (e.g., text, audio, or video). If, on the other hand, the dataset accounts for less than 3 percent, the content only needs to be described in general terms.
Based on our case concerning the Books3 dataset, the Danish Rights Alliance has recommended that the threshold be removed entirely, as our experience shows that AI providers use datasets with illegal content, which often constitute a small part of the total amount of training data. The threshold was lowered from 5 to 3 percent, but that is not enough.
To illustrate the shortcomings, we can look at the same case involving the Books3 dataset, which was used by Meta, among others, to train their Llama 1 AI model. In addition to Books3, Meta used a publicly available dataset with text from Common Crawl totaling 3.3 TB of data, as well as data from a number of other public datasets, which together meant that Meta used data equivalent to approximately 4.7 TB. Since Books3 consisted of a maximum of 85 GB of data text, Books3 accounted for less than 1.7 percent of the total amount of training data in the text category. This means that Meta would not have to disclose the name and link to Books3 if they placed Llama 1 on the European market today.
Since almost all providers of AI models for general use utilize Common Crawl data, this is presumed to be a general challenge for sufficient transparency in all popular AI models.
2. Collection of training material from internet domains
According to the template, AI providers are only required to list the most “relevant” domains from which they have collected data. This corresponds to the top 10 percent of domains, calculated based on the amount of data collected from a specific domain, in a representative manner across all content categories.
This limitation in the degree of transparency means that we will probably not be informed of domains belonging to Danish rights holders, as AI providers focus their collection on domains with text in the major language areas such as English and Spanish. This distorts enforcement options and particularly affects small and medium-sized rights holders, as well as those from smaller language areas such as Denmark.
For AI providers that are small or medium-sized enterprises, the prospects for transparency are even worse. Here, domains must first be listed if they are among the top 5 percent or the 1,000 most used domains.
3. Collection of training data from illegal file-sharing services
It has repeatedly emerged in US court cases concerning AI and copyright that AI providers such as Meta, Anthropic, and OpenAI have collected training data from illegal file-sharing services such as LibGen.
Since this does not involve collection using crawlers or bots, when AI providers have downloaded content from LibGen and similar sites, we are concerned that AI providers will list this under section 2.6 “other sources of data,” which only requires a “narrative” description of data sources and data, if they choose to describe content collected from illegal file-sharing services at all.
Where does this leave rights holders?
Due to the lack of intervention against AI providers until August 2, 2026, at the earliest, we do not expect any new developments from AI providers for the time being. Even after this date, rights holders will not have sufficient insight into whether their content is being used to train AI models for general use. This applies in particular to small and medium-sized rights holders and rights holders from smaller language areas such as Denmark.
The low thresholds for data disclosure and the vague requirements for the content of the documentation will mean that Danish rights holders will have limited insight into the use of their content. Our enforcement work cannot therefore be based on the limited knowledge that AI providers are required to make available as a result of the transparency template.
We and other rights holders therefore continue to face the daunting task of documenting which copyright-protected content is used to train AI models for general use, which is crucial to effectively exploiting and enforcing one’s copyrights.
The Commission will regularly assess whether there is a need to revise the transparency template, including in light of practical experience and technological developments. The Commission may choose to revise the template before its enforcement powers take effect on August 2, 2026.
The Danish Rights Alliance will therefore continue to do what we can to illustrate the significant challenges to effective copyright enforcement.
