OpenAI, the company behind the famed ChatGPT AI model, has claimed in a recent legal filing that it needs copyrighted materials in order to continue training its AI model.
The company must keep releasing improved models in order to sustain itself, with a long-promised ChatGPT-5 on the way later this year or sometime in 2025.
In this new filing, OpenAI appears to be suggesting that it should be allowed to use copyrighted materials for free, since the alternative is its business collapsing.
How OpenAI’s Argument Works
OpenAI’s filing was submitted to the British Parliament’s House of Lords’ communications and digital committee and argues that it would be “impossible” to create a valuable market-leading AI model on public domain content alone.
As the evidence filing puts it:
This just in! View
the top business tech deals for 2024 👨💻
“Because copyright today covers virtually every sort of human expression — including blog posts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment but would not provide AI systems that meet the needs of today’s citizens.”
OpenAI is already facing lawsuits related to the unauthorized use of copyrighted materials: The New York Times alleges “massive copyright infringement” for the use of its content for training, while the Authors Guild has also sued over the use of famous authors’ works in AI training.
Does the Argument Hold Up in the Court of Public Opinion?
Personally, I’m reminded of a similar argument Facebook made years ago, when complaints of its poor content moderation emerged. Facebook’s response was that it was too large a platform to moderate properly, with the implication seeming to be that it felt this justified allowing it to suffer no consequences. The implication to me at the time was that it meant Facebook was too big to continue existing.
In the case of this OpenAI situation, the company appears to be arguing that it can’t afford the copyrighted material that it needs to create AI. If you agree with this, I don’t know why you would then decide that OpenAI should be given the copyrighted material for free to train its model. The more reasonable next step, in my view, would be for the company to change its approach, or perhaps disband itself.
Some critics seem to agree, with one particularly insightful X/Twitter comment comparing the situation to a hypothetical in which a drug dealers argue a similar case:
It’s not the same because drug dealers provide an actual service, but imagine if a drug dealer made this argument. “Your laws are fucking my money up” https://t.co/KgMlmcOIoj
— Patrick Cosmos (@veryimportant) September 4, 2024
LLM Training Troubles Will Likely Continue
The tech industry may have been distancing itself from the “move fast and break things” ethos that once defined it, but OpenAI’s legal troubles seem to indicate that many top tech companies still struggle with the concept today.
AI companies in need of training materials may find themselves facing further problems in the near future, according to the results of a new study: More than 57% of today’s internet content may be AI-generated already. This could result in a snake-eating-its-tail situation as large language models (LLMs) train themselves on content that was itself produced by a previous LLM.
AI has yet to deliver on many promises that it can revolutionize the world. Proving that AI can afford to pay for the resources it needs in order to exist would be a great step towards doing just that.