This Article Is Based On The LAION Article 'LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS'. All Credit For This Research Goes To The Researchers of This Project 👏👏👏 Please Don't Forget To Join Our ML Subreddit
LAION-5B, an AI training dataset containing over five billion image-text pairs, was recently published on the Large-scale Artificial Intelligence Open Network (LAION) blog. This dataset, which is 14 times larger than its predecessor LAION-400M, contains images and captions collected from the Internet, making it the largest image-to-text dataset available in open access. The dataset contains 2.32 billion photos with English text, 2.26 billion images with text in other languages, and 1.27 billion images with unambiguous textual language. Image tags with alt text values were found by processing files from the Common Crawl collection. Images were retrieved and filtered with CLIP to retain only those whose content matched the alt text description.
In recent years, large training datasets have fueled an increase in multimodal AI models, especially those trained on images and textual data. OpenAI published a study in 2021 called Contrastive Language–Image Pre-training (CLIP), which was trained on 400 million image-text pairs and achieved remarkable performance on several multimodal benchmarks without fine-tuning. Although OpenAI has made the CLIP code and model weights open source, they have not made their dataset public. As a result, LAION decided to try to duplicate OpenAI’s dataset collection, which was released last year. This LAION-400M is a dataset with 413 million image-text pairs that have been used in various studies and experiments. Several indices of closest neighbors to the data, a web demo using the data for semantic search, and replication of the CLIP trained on the data were also included in the release.
A three-step workflow was used to collect the new dataset, LAION-5B. To begin with, a distributed cluster of worker machines scanned Common Crawl data files to capture all html image tags with alt text properties. The alt text has been subjected to language detection; if the confidence in the language detection was low, the language was flagged as ‘unknown’. Raw photos were extracted from tagged URLs and passed to a CLIP template and alt text to generate embeds for both. Both integrations were given a similarity score and low similarity pairs were discarded. Removed duplicates and examples that contained text that was less than five characters or whose image resolution was too high. This dataset opens up large-scale multilingual training and study of language vision models to the general public, previously only available to people with large proprietary datasets. LAION research team released the dataset aimed at democratizing multimodal AI research. The dataset can be downloaded from the HuggingFace website.
LAION has released the following packages under the LAION-5B project:
- laion2B-en 2.32 billion of them contain texts in English
- laion2B-multi 2.26 billion contain texts from more than 100 other languages
- laion1B-nolang 1.27 billion have texts in which a particular language could not be clearly detected.