i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Yang, Ziyi; Khademi, Mahmoud; Xu, Yichong; Pryzant, Reid; Fang, Yuwei; Zhu, Chenguang; Chen, Dongdong; Qian, Yao; Gao, Mei; Chen, Yi-Ling; Gmyr, Robert; Kanda, Naoyuki; Codella, Noel; Xiao, Bin; Shi, Yu; Yuan, Lu; Yoshioka, Takuya; Zeng, Michael; Huang, Xuedong

Computer Science > Computation and Language

arXiv:2305.12311 (cs)

[Submitted on 21 May 2023]

Title:i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Authors:Ziyi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang, Chenguang Zhu, Dongdong Chen, Yao Qian, Mei Gao, Yi-Ling Chen, Robert Gmyr, Naoyuki Kanda, Noel Codella, Bin Xiao, Yu Shi, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang

View PDF

Abstract:The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space. Next, language tokens are generated from these representations via an autoregressive decoder. The whole framework is pretrained end-to-end on a large collection of dual- and single-modality datasets using a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- and dual-modality baselines on 7 multimodal tasks, demonstrating the power of generative multimodal pretraining across a diversity of tasks and signals.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2305.12311 [cs.CL]
	(or arXiv:2305.12311v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.12311

Submission history

From: Ziyi Yang [view email]
[v1] Sun, 21 May 2023 01:25:44 UTC (274 KB)

Computer Science > Computation and Language

Title:i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators