Amazon and MailChimp Login Details Found in AI Training Data

More than 12,000 'data secrets' including logins and APIs keys found hardcoded in AI training dataset.

Written by

Published on March 4, 2025

Thousands of details including login credentials for Amazon Web Services (AWS), MailChimp, and WalkScore.have been found in an AI training dataset, used by the likes of DeepSeek.

In September, a report from Deloitte suggested that almost three out of every four professionals the company surveyed put data privacy among their top three concerns surrounding the rapid rollout of generative AI tools.

And of equal concern is where the data being used to train the AI models is coming from, with news of leaks like this set to justify fears.

OpenSource Dataset Trawled

The secret data was found by cybersecurity researchers, who trawled though 400 terabytes of information.

This was collected from 2.67 billion web pages archived in 2024 and held by The Common Crawl. This non-profit has created an open-source repository of web data, which it started collecting in 2008. It is free for anyone to use so is popular with LLM developers. It hosts around 250 petabytes of web data but this is constantly added to.

This just in! View
the top business tech deals for 2025 👨‍💻

It was a team at Truffle Security that analyzed this data and found almost 12,000 “valid secrets”, including API keys and passwords, hardcoded in the archive.

Login Details Found in AI Dataset

The Truffle team said that they were prompted to carry out the analysis after wondering why Large Language Models (LLMs) were instructing developers to hardcode API keys.

In the blog announcing its discoveries, the team says that it detected 11,908 “Live Secrets” and found 2.76 million web pages contained live secrets.

Even more worryingly, it also found a high reuse rate of the secret details. It states that 63% were repeated across multiple web pages. “In one extreme case, a single WalkScore API key appeared 57,029 times across 1,871 subdomains,” the researchers write.

Impact of Login Details Being Exposed by AI

Needless to say, any login data that is discovered is bad news for the original account holders, and leaves them highly vulnerable. Because these datasets are being used by some of the world’s biggest AI pioneers, their AI tools could then be weaponized by cybercriminals to uncover login credentials when they want to launch an attack.

Truffle Security is reported to now be working with the vendors to help fix the issue. It has also issued some advice for the AI industry as a whole. It writes: “LLMs may benefit from improved alignment and additional safeguards – potentially through techniques like Constitutional AI – to reduce the risk of inadvertently reproducing or exposing sensitive information.”

This technique has been developed by Anthropic – one of the few AI players to consistently push for an AI safety framework to be put in place. Others – including OpenAI – seem keen to push ahead with innovation at all costs – with the full support of the Trump administration.

Written by:

Katie Scott Writer

Katie has been a journalist for more than twenty years. At 18 years old, she started her career at the world's oldest photography magazine before joining the launch team at Wired magazine as News Editor. After a spell in Hong Kong writing for Cathay Pacific's inflight magazine about the Asian startup scene, she is now back in the UK. Writing from Sussex, she covers everything from nature restoration to data science for a beautiful array of magazines and websites.

Amazon and MailChimp Login Details Found in AI Training Data

OpenSource Dataset Trawled

Login Details Found in AI Dataset

Impact of Login Details Being Exposed by AI

Written by:

Written by:

New Rate Limit Introduced For Anthropic’s Claude

Elon Musk Strikes Multi-Billion Tesla AI Deal with Samsung

Gen Z Most Likely to Sneakily Use AI Without Telling the Boss

AI Threatens Worldwide ‘Fraud Crisis’ Says OpenAI CEO