Developers Get Peace of Mind with Dataset from Getty Images

Enterprise developers will soon to be able to access a sample of images from Getty Images via the Hugging Face hub.

Written by

Updated on September 9, 2024

Enterprise developers will soon to be able to access a sample of images from Getty Images via the Hugging Face hub. The initial dataset includes 3,750 images from 15 categories, including healthcare, nature and travel.

The announcement comes at a time when developers are working to up public confidence in AI-generated content by tackling AI hallucinations.

Responsible Sourcing of Training Sets

Getty Images holds and licenses more than 572 million “visual assets” and more than 200 million of these are made available for licensing either for free or with a paid subscription.

These images — which include some of the earliest photographs taken — have been vetted legally and are of commercial quality.

This just in! View
the top business tech deals for 2026 👨‍💻

In an announcement, the Hugging face team explains that this open dataset can “enhance the performance of your machine learning and AI models.” It adds that each image has an average of 50 keywords as well as human-inputted captions.

A Better Deal for Creators

It also promises that the content is “commercially safe” and that the creators will be compensated for its use. This will be a welcome statement for creators after several publicized spats between artists and AI ventures over the widespread mining of copyrighted works.

The company clarified that this deal is about exactly getting a fair deal for creators and delivering high quality training datasets to be used confidently.

“Imagine building or enhancing your AI/ML capabilities with data that’s not only diverse and high quality but also comes with the peace of mind that it’s responsibly sourced. That’s what we’re bringing to the table.” – Andrea Gagliano, head of data science and AI/ML at Getty Images told VentureBeat

Changing How Developers Get Data

She stated that her hope that the move might make AI companies move to using officially licensed content as a standard practice, which would mitigate any wrangles about copyright. It would also made AI technology far more reliable as this data is high-quality, legally sound and vetted.

From a developer’s point of view, this will mean far less time spent deleting low-quality data, filtering out copyrighted content, and filling in the blanks when metadata is missing.

The dataset is open and free to use but developers will have to abide by some rules relating to the redistribution of the dataset and creation products or services that would directly compete with Getty Images.

Gagliano stated her hope that the deal could have far reaching implications. “Our goal is to show that it is possible to accommodate licensing for all the content required to train functional AI models – developing business models that enable the creation of high-quality AI models while respecting creator IP” she said.