Thousands of details including login credentials for Amazon Web Services (AWS), MailChimp, and WalkScore.have been found in an AI training dataset, used by the likes of DeepSeek.
In September, a report from Deloitte suggested that almost three out of every four professionals the company surveyed put data privacy among their top three concerns surrounding the rapid rollout of generative AI tools.
And of equal concern is where the data being used to train the AI models is coming from, with news of leaks like this set to justify fears.
OpenSource Dataset Trawled
The secret data was found by cybersecurity researchers, who trawled though 400 terabytes of information.
This was collected from 2.67 billion web pages archived in 2024 and held by The Common Crawl. This non-profit has created an open-source repository of web data, which it started collecting in 2008. It is free for anyone to use so is popular with LLM developers. It hosts around 250 petabytes of web data but this is constantly added to.
This just in! View
the top business tech deals for 2025 👨💻
It was a team at Truffle Security that analyzed this data and found almost 12,000 “valid secrets”, including API keys and passwords, hardcoded in the archive.
Login Details Found in AI Dataset
The Truffle team said that they were prompted to carry out the analysis after wondering why Large Language Models (LLMs) were instructing developers to hardcode API keys.
In the blog announcing its discoveries, the team says that it detected 11,908 “Live Secrets” and found 2.76 million web pages contained live secrets.
Even more worryingly, it also found a high reuse rate of the secret details. It states that 63% were repeated across multiple web pages. “In one extreme case, a single WalkScore API key appeared 57,029 times across 1,871 subdomains,” the researchers write.
Impact of Login Details Being Exposed by AI
Needless to say, any login data that is discovered is bad news for the original account holders, and leaves them highly vulnerable. Because these datasets are being used by some of the world’s biggest AI pioneers, their AI tools could then be weaponized by cybercriminals to uncover login credentials when they want to launch an attack.
Truffle Security is reported to now be working with the vendors to help fix the issue. It has also issued some advice for the AI industry as a whole. It writes: “LLMs may benefit from improved alignment and additional safeguards – potentially through techniques like Constitutional AI – to reduce the risk of inadvertently reproducing or exposing sensitive information.”
This technique has been developed by Anthropic – one of the few AI players to consistently push for an AI safety framework to be put in place. Others – including OpenAI – seem keen to push ahead with innovation at all costs – with the full support of the Trump administration.