My implementation of Kosmos2.5 from Microsoft research and the paper: "KOSMOS-2.5: A Multimodal Literate Model"
- Lucidrains
- Agorians
pip install kosmos2-torch
import torch
from kosmos.model import Kosmos
#usage
img = torch.randn(1, 3, 256, 256)
text = torch.randint(0, 20000, (1, 1024))
model = Kosmos()
output = model(img, text)
print(output)
Here is a table summarizing the datasets used in the paper KOSMOS-2.5: A Multimodal Literate Model with metadata and source links:
Dataset | Modality | # Samples | Domain | Source |
---|---|---|---|---|
IIT-CDIP | Text + Layout | 27.6M pages | Scanned documents | Link |
arXiv papers | Text + Layout | 20.9M pages | Research papers | Link |
PowerPoint slides | Text + Layout | 6.2M pages | Presentation slides | Web crawl |
General PDF | Text + Layout | 155.2M pages | Diverse PDF files | Web crawl |
Web screenshots | Text + Layout | 100M pages | Webpage screenshots | Link |
README | Text + Markdown | 2.9M files | GitHub README files | Link |
DOCX | Text + Markdown | 1.1M pages | WORD documents | Web crawl |
LaTeX | Text + Markdown | 3.7M pages | Research papers | Link |
HTML | Text + Markdown | 6.3M pages | Webpages | Link |
MIT
@misc{2309.11419,
Author = {Tengchao Lv and Yupan Huang and Jingye Chen and Lei Cui and Shuming Ma and Yaoyao Chang and Shaohan Huang and Wenhui Wang and Li Dong and Weiyao Luo and Shaoxiang Wu and Guoxin Wang and Cha Zhang and Furu Wei},
Title = {Kosmos-2.5: A Multimodal Literate Model},
Year = {2023},
Eprint = {arXiv:2309.11419},
}
bold italics