GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages -- under review
crawler
multlingual
corpus-linguistics
glot
language-identification
commoncrawl
common-crawl
glotcc
multilingual-dataset
-
Updated
Jun 12, 2024 - Jupyter Notebook