Skip to content

dsesclei/generative-beatsaber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Audio encoder + LLM for rhythm game map generation

Architecture

diagram

Preprocessing

Songs are divided into segments roughly 10 seconds long, each containing a number of beats (which varies per song by BPM.) The raw audio for these songs is then processed through a neural audio codec (Descript Audio Codec), providing the initial audio embeddings.

Training

These codec embeddings are further processed by a Conformer followed by a Perceiver (BLIP-3), the latter of which converts this variable length segment into a fixed number of LLM embeddings that can be used in place of tokens.

Llama-3-8b is finetuned on these songs with their tokenized beatmaps using LoRA. The prompt is composed of the full list of audio embeddings, a header, and then the embeddings for each segment interleaved with the segment's note tokens. Phrased in code:

all_audio_embeddings + header_tokens + audio_embeddings[0] + segment_tokens[0] + audio_embeddings[1] + segment_tokens[1] ...

Or in tokens:

AUDIO_0 AUDIO_1 ... AUDIO_N <header> Difficulty: expert-plus | BPM level: 3 | Rating: 9 | walls </header> AUDIO_0 [red middle far-left down] [blue bottom left down-left] [12% blue bottom far-right right] [25% blue bottom left left] ... end AUDIO_1 start [red bottom right right] [12% red middle far-left up-left] ...

Notes are in the format [percent along segment, color, row, col, cut direction].

Code is also present for experimenting with spectrograms instead of the codec, and a Q-former (BLIP-2) in place of the Perceiver.

Similar projects and papers

About

Audio encoder + LLM for rhythm game map generation

Resources

Stars

Watchers

Forks