GlotCC

GlotCC is a multilingual corpus built by the GlotLID language identification and Ungoliant pipeline from CommonCrawl.

Current version (V1) supports more than 1000 languages and is filtered based on adopted filters from C4, CCNet, MADLAD-400, RedPajama-Data-v2, OSCAR, Gopher, RefinedWeb, FineWeb, Datatrove, Dolma, Pile-CC, Pretrainer's Guide, and GlotScript.

™ The logo features a llama with the style of C.C. from the Code Geass anime reading a book.

Dataset

Statistics of the dataset

Running the pipeline

We provide guidance on how to run the pipeline at kargaranamir/ungoliant. The readme is up-to-date.

Summary of Quality Signals

Acknowledgements

We appreciate the collaborators who are collectively advancing the frontier of open datasets and LLM models.
Thanks to the community and friends who enable the auditing of this dataset with higher quality. Also, to everyone contributing to the GlotCC dataset.
Our gratitude extends to the exceptional team at OSCAR for leading the development of open piplines and datasets from CommonCrawl, and to the remarkable team at CommonCrawl.

License

GlotCC data is released under the following licensing scheme: We do not own any of the text from which this data has been extracted. The data is licensed under the terms of the CommonCrawl Terms of Use. We license the actual packaging, metadata, and annotations of this data under the Creative Commons CC0 license.
Ungoliant license remains unchanged as the Apache License 2.0.
GlotLID license remains unchanged as the Apache License 2.0.

Citation

If you find our repo and data useful for your research, please cite:

GlotCC Dataset:

@article{kargaran2024glotcc,
title        = {Glot{CC}: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages},
author       = {Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
journal      = {arXiv preprint},
year         = {2024},
url          = {https://github.com/cisnlp/GlotCC/}
}

GlotLID Language Identification:

@inproceedings{kargaran2023glotlid,
title        = {Glot{LID}: Language Identification for Low-Resource Languages},
author       = {Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
year         = 2023,
booktitle    = {The 2023 Conference on Empirical Methods in Natural Language Processing},
url          = {https://openreview.net/forum?id=dl4e3EBz5j}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets/images		assets/images
audit		audit
filters		filters
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GlotCC

Dataset

Statistics of the dataset

Running the pipeline

Summary of Quality Signals

Acknowledgements

License

Citation

About

Languages

License

cisnlp/GlotCC

Folders and files

Latest commit

History

Repository files navigation

GlotCC

Dataset

Statistics of the dataset

Running the pipeline

Summary of Quality Signals

Acknowledgements

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages