Home›#AIinEd›Public Domain Data

Public Domain Data

#AIinEd, AI, AI Pioneers, Artificial Intelligence, Data, Open Data, Open Source, Wales Wide Web

21st November 2024

Jamillah Knowles / Better Images of AI / Data People / CC-BY 4.0

I've been looking at how we could use Open Source software to develop Generative AI applications for education. Of course one of the issues is data for training the AI. And its interesting that reports say that the quality of training data is getting worse, probably because so much poor quality data is being produced by AI. So I was interested in an article, The Making of PD12M: Image Acquisition, published on the Spawning blog.

It reports that in the evolving landscape of AI data collection, the Spawning team has introduced Public Domain 12M (PD12M), a innovative 12.4 million image-text dataset that addresses critical challenges in AI training data acquisition. Unlike traditional methods of web scraping, PD12M focuses on ethically sourced images from reputable cultural institutions like Europeana, Wikimedia, and the Smithsonian.

The dataset tackles several persistent issues in AI training data: copyright concerns, image quality, and consent. By exclusively using images with Public Domain Marks or CC0 licenses, PD12M minimizes legal and ethical complications. The team carefully curated images from OpenGLAM institutions, ensuring high-quality, professionally photographed artworks with verified metadata.

Key innovations include a 14-day delay for Wikimedia uploads to allow community flagging, restrictive license selection, and a unique image hosting approach. Rather than placing download burden on original institutions, the images are hosted by AWS Open Data, representing approximately 30TB of high-quality image data.

For education professionals, this approach represents a model of responsible AI development: transparent, ethical, and focused on quality over quantity. It demonstrates how careful data curation can create more reliable and trustworthy AI training resources.

Graham Attwell

Graham Attwell is Director of Pontydysgu. He is an Associate Fellow, Institute for Employment Research, University of Warwick and a Gastwissenschaftler at the Insititut Technik und Bildung, University of Bremen. Born in 1953 he has a BA (Hons) degree in History from the University of Wales: Swansea College.

Main Menu

Public Domain Data

Graham Attwell

Leave a reply Cancel reply

Main Menu

Login

Lost Password

Public Domain Data

Social generative AI for education

Related articles

Leave a reply Cancel reply