# GlotCC

**GlotCC is a multilingual corpus built by the GlotLID language identification and Ungoliant pipeline from CommonCrawl.** Current version ([V1](https://huggingface.co/datasets/cis-lmu/GlotCC-V1)) supports more than **1000 languages** and is filtered based on adopted filters from C4, CCNet, MADLAD-400, RedPajama-Data-v2, OSCAR, Gopher, RefinedWeb, FineWeb, Datatrove, Dolma, Pile-CC, Pretrainer's Guide, and GlotScript.

™ The logo features a llama with the style of C.C. from the Code Geass anime reading a book. ## Dataset ### Statistics of the dataset ## Running the pipeline We provide guidance on how to run the pipeline at [kargaranamir/ungoliant](https://github.com/kargaranamir/ungoliant). The readme is up-to-date. ## Summary of Quality Signals ## Acknowledgements - We appreciate the collaborators who are collectively advancing the frontier of open datasets and LLM models. - Thanks to the community and friends who enable the auditing of this dataset with higher quality. Also, to everyone contributing to the GlotCC dataset. - Our gratitude extends to the exceptional team at OSCAR for leading the development of open piplines and datasets from CommonCrawl, and to the remarkable team at CommonCrawl. ## License - GlotCC data is released under the following licensing scheme: We do not own any of the text from which this data has been extracted. The data is licensed under the terms of the CommonCrawl [Terms of Use](https://commoncrawl.org/terms-of-use). We license the actual packaging, metadata, and annotations of this data under the Creative Commons [CC0 license](https://github.com/cisnlp/GlotCC/blob/main/LICENSE). - Ungoliant license remains unchanged as the [Apache License 2.0](https://github.com/kargaranamir/ungoliant/blob/main/LICENSE). - GlotLID license remains unchanged as the [Apache License 2.0](https://github.com/cisnlp/GlotLID/blob/main/LICENSE). ## Citation If you find our repo and data useful for your research, please cite: GlotCC Dataset: ```js @article{kargaran2024glotcc, title = {Glot{CC}: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages}, author = {Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich}, journal = {arXiv preprint}, year = {2024}, url = {https://github.com/cisnlp/GlotCC/} } ``` GlotLID Language Identification: ```js @inproceedings{kargaran2023glotlid, title = {Glot{LID}: Language Identification for Low-Resource Languages}, author = {Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich}, year = 2023, booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing}, url = {https://openreview.net/forum?id=dl4e3EBz5j} } ```