Skip to content

Code used for the creation of OBELISC, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.

License

Notifications You must be signed in to change notification settings

linhduongtuan/OBELISC

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OBELISC

OBELISC is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.

Dataset page: https://huggingface.co/datasets/HuggingFaceM4/OBELISC

Visualization of OBELISC web documents: https://huggingface.co/spaces/HuggingFaceM4/obelisc_visualization

Paper: https://arxiv.org/abs/2306.16527

Goal and organization of obelisc

The folder obelisc is aimed to:

The primary techniques are defined in the sub-folder processors, while their invocation is found in callers. The configs used for the extraction and the filtering of the documents are in configs.

We refer to our paper for details about these steps.

In visualization, there are different streamlit visualizations:

Goal and organization of build_obelisc

In the folder build_obelisc, we are giving all the scripts that were used for the creation of OBELISC, with numbers indicating the chronology.

These scripts often call methods defined in processors but not only, and also define other useful methods.

Citation

If you are using this dataset or this code, please cite

@inproceedings{
lauren{\c{c}}on2023obe,
title={OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents},
author={Hugo Lauren{\c{c}}on and Lucile Saulnier and L{\'e}o Tronchon and Stas Bekman and Amanpreet Singh and Anton Lozhkov and Thomas Wang and Siddharth Karamcheti and Alexander M Rush and Douwe Kiela and Matthieu Cord and Victor Sanh},
year={2023}
}

About

Code used for the creation of OBELISC, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.5%
  • HTML 2.5%