Key takeaways:
- A German court has ruled in favor of GEMA, accepting the unfair use of copyrighted songs by OpenAI while training its LLM models.
- The court relied on the memorization vs reproduction arguments in the light of TDM exceptions under the EU Act to rule in favor of GEMA.
- This is in contrast to cases in the US, like the Anthropic vs UMG or the NYT vs OpenAI & Microsoft case, where no concrete decisions have been made.
OpenAI has been dealt a significant blow in its case against GEMA, as the court ruled against the AI giant.
GEMA is a collective management organization representing stakeholders of the music industry (including lyricists, composers, and producers) that sued OpenAI for using nine well-known German songs for training its AI models, especially its 4 and 4-o models.
First, GEMA disabled ChatGPT’s web search capability and then assigned it the role of a ‘lyrics expert.’ After this, prompts like “What are the lyrics to (this song)?”, “Can you tell me the chorus?” or “What was the first verse of (this song)?” were fed into ChatGPT.
Surprisingly (or unsurprisingly), the model was able to produce exact verses of the songs requested, albeit with a bit of hallucination. For instance, it produced 25 consecutive words from the song 36 Grad and over 70 words of “Über den Wolken.”
And hey, these aren’t any under-the-radar songs ChatGPT just came across in the dumps of the internet. They are, in fact, hyper-popular German songs.
Think of songs like “Take Me Home, Country Roads” by John Denver or “Hips Don’t Lie” by Shakira. Scraping ultra-mainstream songs like these isn’t something that could have happened by chance.
The Court’s Ruling
Of course, there can be no doubt that those songs were a part of ChatGPT’s training data. After establishing this, the court relied on two major components to arrive at its decision: memorization and reproduction, and the EU’s Text and Data Mining (TDM) exception.
Any AI model’s training process can be broken down into three distinct parts.
- First, the extraction or scraping of data from various web shops is performed to form a central corpus.
- Second, the analysis and training of the model based on this data.
- Third, the actual use of these models to generate outputs based on the trained data.
Memorization vs. Reproduction: The Legal Line That Matters
Memorization kicks in during the second phase of development, i.e., training. The court assumes that memorization will occur during the training of an AI model, which happens when the model retains some of the information or data it was actually trained on.
There can be a debate on whether this is a valid assumption. We think it is. Any model will have to memorize some part of the training data to reproduce content similar to the data it was trained on.
The court ruled that everything was fine up to this point and within the scope of the EU’s TDM exception. Articles 3 and 4 permit LLMs to train their models on copyrighted data, and an opt-out option is provided to the rights holder.
TDM enables the temporary storage of data (similar to RAM), format conversions, and analysis.
However, the problem arises when the model begins reproducing the data in its outputs, indicating that the LLM model has permanently stored the used data, something not covered by the TDM exception and thus in violation of the EU’s copyright laws.
Another case that can help us understand this memorization vs reproduction argument is the Getty Images vs Stable Diffusion case. Here, Getty Images initially filed for direct infringement and secondary infringement against Stable Diffusion AI, accusing it of using its copyrighted images in its training data.
However, in this case, the judge ruled against Getty Images, arguing that the model did not ‘reproduce’ any of the copyrighted material in its output.
Even here, the fact that an AI model can memorize some of the training data was well established. This brings us to a very key demarcation: merely memorizing training data does not constitute copyright infringement.
If you memorize a verse of Shakespeare’s poetry and learn it line by line, that does not mean you have infringed his copyright. However, when you start writing the same verses, passing them off as your own, and publishing them under your name, that’s when copyright infringement kicks in.
That’s exactly the distinction courts have alluded to. Infringement occurs when the AI model begins reproducing the data it previously memorized during training. This is the basic difference between these two similar cases. In the GEMA case, ChatGPT reproduced exact verses of the songs it was trained on. However, Stable Diffusion did not reproduce the images owned by Getty Images.
This case has set a very important precedent for copyright claims in the EU. We’ll probably see courts citing these two cases while establishing the validity of copyright infringement cases in the future.
The Ethical Grey Zone, or the Loophole: When AI Mimics Without Reproducing
While it’s not expressly stated, the court orders imply that using someone’s copyrighted material for training isn’t infringement as long as there’s no reproduction. This means that if AI companies ensure no reproduction takes place, they can use any and all copyrighted material for training their AI models. This still seems unethical, doesn’t it?
For example, you can build an LLM model trained exclusively on the entire life’s work of Shakespeare while ensuring there’s no reproduction at all. And whenever a user inputs a prompt asking it to write in ‘Shakespeare’s style and tone,’ the model could generate Shakespeare-like writing without actually reproducing any of his original content. Technically, this is not copyright infringement, but it’s still hijacking the very individuality of Shakespeare.
Similarly, consider if you trained an LLM model exclusively on Ed Sheeran’s songs and then fed it lyrics and music of your own. The model would be able to generate a song that seems like Ed Sheeran himself has made it. This is still a massive grey area in the entire copyright infringement fiasco.
In these cases, while there’s no exact reproduction, the reproduction is such that it infringes on the identity of the artist himself. This way, it would become easier for AI companies to mimic or be inspired by copyright holders while steering clear of infringement claims.
EU vs US AI Laws
While the EU’s AI Act and TDM exceptions focus on protecting the rights of the copyright owners, the US lawmakers seem to be in a confusing state when it comes to slamming AI companies for eating into the years of hard work of these artists.
Similar cases in the US have turned into long-drawn-out battles in court with no end in sight. Here are a few cases that highlight the ingenuity of the US lawmakers.
Anthropic vs Universal Music Group (UMG)
A case similar to the GEMA vs OpenAI is UMG vs Anthropic. UMG, along with ABKCO Music and other publishers, filed a lawsuit against Anthropic, accusing it of using copyrighted songs (as many as 500) for training its Claude AI LLM. As expected, Anthropic claimed ‘fair use’ and denied any copyright infringement.
What’s interesting is that UMG had requested a temporary injunction on the use of the copyrighted material to stop Antropic from using the copyrighted content while the case was ongoing. While the court decided that there was enough evidence for the case to go to trial, it rejected the injunction plea due to the lack of sufficient demonstration of any ‘irreparable harm’ done to the plaintiff.
This is pretty confusing. It’s like the court saying, ‘Yes, there may have been a crime committed, but we won’t stop the defendant from continuing to do so until proven.’ The case was filed in October 2023, and a decision on the matter has not yet been made. Anthropic has been able to use copyrighted songs for training its AI for two years, and it will be legally allowed to do so until the conclusion of the case.
Even if the court finds Anthropic guilty, it would, in all likelihood, impose a monetary penalty and/or instruct it to remove the songs from the LLM’s training data. Now, AI companies view these penalties as the ‘cost of doing business,’ or the cost of training their AI models.
Let’s circle back to the memorization vs. reproduction debate cited by the German court. How would an AI model ‘un-learn’ or ‘forget’ data that it has already been trained on? Would this require some sort of reverse engineering?
This is exactly the kind of confusion in the American laws, a loophole that the AI companies are milking while legislators try to wrap their heads around this newfound problem.
New York Times (NYT) vs OpenAI and Microsoft
Something similar happened in the NYT vs OpenAI and Microsoft case, where the plaintiffs accused the AI companies of using copyrighted news articles for training their LLM models. The defendants claimed that they only used ‘publicly available’ articles that do not infringe any copyright claims.
The case has been in pre-trial since December 2023. Sadly, the court hasn’t permitted any injunction in this case, allowing the companies to continue using the alleged copyrighted material for training.


We can see a clear pattern in how US courts and laws view the entire copyright matter. It seems as if the courts are busy creating an ‘illusion’ for plaintiffs that their rights are important and they have, in fact, been wronged by the AI companies. However, at the same time, the cases are progressing very slowly, with no injunctions, allowing the AI giants to continue their operations.
The Scarlett Johansson Case
This is also not the first time OpenAI has disregarded its moral compass in the development of its AI chatbot. In 2024, ChatGPT rolled out its new feature, Voice, which contained one particular version (named ‘Sky’) that sounded eerily similar to Scarlett Johansson.
According to reports, Sam Altman had previously requested that Scarlett lend her voice for ChatGPT, which she politely declined. However, Altman still went on to mimic her voice through an alleged voice actor in a later update.
While OpenAI denied any wrongdoing, public feedback was overwhelmingly negative, with several users claiming it was actually the actress’s voice. Not only is this a violation of personal rights, but an AI behemoth like OpenAI doing so sets the wrong precedent for the industry. Ultimately, though, OpenAI had to pause Sky.
This also highlights the stark difference between EU laws and laws in the US. First of all, there’s no equivalent of the EU AI Act in the US, which provides considerable leeway for AI companies to operate in the grey area.
Even in this case, no legal proceedings have taken place, since the laws vary by state and there’s no central legislation governing the wrongful use of data or personal attributes by AI companies.
The Way Ahead
All in all, we’re not refuting the fact that artificial intelligence is the future of technology. However, it’s clear to everyone that building and training these AI models comes at a cost – both monetary and ethical. Now, it’s up to us whether we want to (or should) incur such costs.
The immediate ethical issue is the unfair use of copyrighted material, regardless of what the laws in various regions suggest. While the EU views it more progressively and in favor of those being wronged by such infringements, the US seems to side with the capitalistic AI companies.
Amidst all this, it’s the actual copyright holders that are suffering. And while Hollywood stars, high-profile authors, and artists might get away with compensation or licensing agreements with AI companies, that’s not a luxury available to indigenous or independent copyright holders.
This is why it’s more important now than ever to push for strict copyright laws in the US.
The Tech Report editorial policy is centered on providing helpful, accurate content that offers real value to our readers. We only work with experienced writers who have specific knowledge in the topics they cover, including latest developments in technology, software, hardware, and more. Our editorial policy ensures that each topic is researched and curated by our in-house editors. We maintain rigorous journalistic standards, and every article is 100% written by real authors.






