Otter: A Multi-Modal Model with In-Context Instruction Tuning

Li, Bo; Zhang, Yuanhan; Chen, Liangyu; Wang, Jinghao; Yang, Jingkang; Liu, Ziwei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.03726 (cs)

[Submitted on 5 May 2023]

Title:Otter: A Multi-Modal Model with In-Context Instruction Tuning

Authors:Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei Liu

View PDF

Abstract:Large language models (LLMs) have demonstrated significant universal capabilities as few/zero-shot learners in various tasks due to their pre-training on vast amounts of text data, as exemplified by GPT-3, which boosted to InstrctGPT and ChatGPT, effectively following natural language instructions to accomplish real-world tasks. In this paper, we propose to introduce instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset. We adopt a similar approach to construct our MultI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. We then introduce Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following ability and in-context learning. We also optimize OpenFlamingo's implementation for researchers, democratizing the required training resources from 1$\times$ A100 GPU to 4$\times$ RTX-3090 GPUs, and integrate both OpenFlamingo and Otter into Huggingface Transformers for more researchers to incorporate the models into their customized training and inference pipelines.

Comments:	Technical Report
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2305.03726 [cs.CV]
	(or arXiv:2305.03726v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.03726

Submission history

From: Bo Li [view email]
[v1] Fri, 5 May 2023 17:59:46 UTC (4,781 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Otter: A Multi-Modal Model with In-Context Instruction Tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Otter: A Multi-Modal Model with In-Context Instruction Tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators