Study: DeepSeek Could Be Training on OpenAI’s Output

Researchers found startling similar stylistic traits in results from DeepSeek and ChatGPT, suggesting copyright foul play.

A new study has found alarmingly similar outputs from DeepSeek and ChatGPT, fanning the flames in a battle over the IP of training data.

Microsoft and OpenAI have launched their own probe into whether DeepSeek improperly obtained data to train its AI model.

However, this new study from Copyleaks found 74.2% of DeepSeek’s written text is stylistically similar to OpenAI’s ChatGPT outputs, and therefore backs their claims of foul play.

What Did the Researchers Find?

Using screening technology and three AI classifiers, the CopyLeaks team studied texts from Claude, Gemini, Llama, and OpenAI.

The classifiers identified what the company call “subtle stylistic features” like sentence structure, vocabulary, and phrasing. For a classification to be made, all three classifiers had to “agree,” delivering what CopyLeaks says was a “99.88% precision rate and just a 0.04% false-positive rate, accurately identifying texts from both known and unknown AI models.”

 

About Tech.co Video Thumbnail Showing Lead Writer Conor Cawley Smiling Next to Tech.co LogoThis just in! View
the top business tech deals for 2025 👨‍💻
See the list button

These classifiers were then tested with DeepSeek-R1 and the team found that “74.2% of the generated texts aligned with OpenAI’s stylistic fingerprints.” This was in marked contrast to, for example, Microsoft’s Phi-4 model, which “demonstrated a 99.3% disagreement rate.”

What Does This Mean?

It certainly seems like DeepSeek has been trained on OpenAI’s output as the similarity is striking; and it is not true for content from other LLMs.

“This discovery raises concerns about DeepSeek-R1’s resemblance to OpenAI’s model, particularly regarding data sourcing, IP rights, and transparency.” – CopyLeaks team

It adds that this potential “undisclosed reliance on existing models” could also see biases reinforced, limit the diversity of responses that come from LLMs and “pose legal or ethical risks.”

The researchers also go as far as suggesting that their findings could undermine “DeepSeek’s claims of a groundbreaking, low-cost training method.” If the Chinese company is using OpenAI’s data, it may have “misled the market contributing to NVIDIA’s $593 billion single-day loss and giving DeepSeek an unfair advantage,” they state.

A Breakthrough in Tracking IP?

OpenAI and Microsoft continue their probe; but CopyLeaks suggests that this research is a new tool offering “model-specific attribution.”

“This is a breakthrough that fundamentally changes how we approach AI content. This capability is crucial for multiple reasons, including improving overall transparency, ensuring ethical AI training practices, and, most importantly, protecting the intellectual property rights of AI technologies and, hopefully, preventing their potential misuse.” – Shai Nisan, Chief Data Scientist at Copyleaks

And this could benefit not only the AI companies but perhaps also the many organizations who have accused them of copyright infringement as they jostle for dominance.

Did you find this article helpful? Click on one of the following buttons
We're so happy you liked! Get more delivered to your inbox just like it.

We're sorry this article didn't help you today – we welcome feedback, so if there's any way you feel we could improve our content, please email us at contact@tech.co

Written by:
Katie has been a journalist for more than twenty years. At 18 years old, she started her career at the world's oldest photography magazine before joining the launch team at Wired magazine as News Editor. After a spell in Hong Kong writing for Cathay Pacific's inflight magazine about the Asian startup scene, she is now back in the UK. Writing from Sussex, she covers everything from nature restoration to data science for a beautiful array of magazines and websites.
Explore More See all news
Back to top
close Building a Website? We've tested and rated Wix as the best website builder you can choose – try it yourself for free Try Wix today