Listen, Think, and Understand

LTU-AS (Second Generation):

LTU-AS was accepted at ASRU 2023. See you in Taipei!

[Paper] [HuggingFace Space] [ASRU Peer Review] [Compare LTU-1 and LTU-AS]

Authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, and James Glass (MIT & MIT-IBM Watson AI Lab)

LTU (First Generation):

[Paper] [HuggingFace Space]

Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James Glass (MIT & MIT-IBM Watson AI Lab)

Abstract:

The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into coarse-grained categories, but also to listen to the details of the sounds, explain the reason for the predictions, think what the sound infers, and understand the scene and what action needs to be taken. Such capabilities beyond perception are not yet present in existing audio models. On the other hand, modern large language models (LLMs) exhibit emerging reasoning ability but they lack audio perception capabilities. Therefore, we ask the question: can we build an AI model that has both audio perception and a reasoning ability?

In this paper, we propose a novel audio foundation model, called LTU (Listen, Think, and Understand). To train LTU, we created a new OpenAQA-5M dataset consisting of 1.9 million closed-ended and 3.7 million open-ended, diverse (audio, question, answer) tuples, and used an autoregressive training framework and a perception-to-understanding curriculum. LTU demonstrates strong performance and generalization ability on conventional audio tasks such as classification and captioning. Moreover, it exhibits remarkable reasoning and comprehension abilities in the audio domain. To the best of our knowledge, LTU is the first audio-enabled large language model that bridges audio perception with advanced reasoning.

How about the code? We plan to release the code but our institute needs to review the software release, we are working on preparing for the review.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
asru_review		asru_review
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
ltu.png		ltu.png
ltu2_api_demo.py		ltu2_api_demo.py
sample_audio.wav		sample_audio.wav
usage.gif		usage.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Listen, Think, and Understand

About

Releases

Packages

Languages

lanfeima/ltu

Folders and files

Latest commit

History

Repository files navigation

Listen, Think, and Understand

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages