Stars
Build real-time multimodal AI applications 🤖🎙️📹
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
🐮📢 The first AI voice assistant that interrupts *you*
Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning
LSLM implements full duplex modeling in interactive speech language models, based on research by Ma et al. (2024). This project advances human-computer interaction through real-time spoken dialogue…
PyTorch code and models for V-JEPA self-supervised learning from video.
LivePortrait is an advanced deep learning-based system for animating portrait images. It uses a two-stage training process to create realistic and controllable animations from static portrait images.
VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
[NeurIPS 2024] Official PyTorch implementation code for realizing the technical part of Mamba-based traversal of rationale (Meteor) to improve performance of numerous vision language performances f…
secutron / Emote-hack
Forked from johndpope/Emote-hackusing chatgpt (now Claude 3) to reverse engineer code from Emote white paper. WIP
FFmpeg libav tutorial - learn how media works from basic to transmuxing, transcoding and more. Translations: 🇺🇸 🇨🇳 🇰🇷 🇪🇸 🇻🇳 🇧🇷
[CVPR] MARLIN: Masked Autoencoder for facial video Representation LearnINg
Emote Portrait Alive: Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
NITEC: Versatile Hand-Annotated Eye Contact Dataset for Ego-Vision Interaction (WACV24)
🔍 Explore Egocentric Vision: research, data, challenges, real-world apps. Stay updated & contribute to our dynamic repository! Work-in-progress; join us!
EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset
A Lightweight Face Recognition and Facial Attribute Analysis (Age, Gender, Emotion and Race) Library for Python
Low latency ai companion voice talk in 60 lines of code using faster_whisper and elevenlabs input streaming
A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription.
vits2 backbone with multilingual-bert(한국어 지원)
🎤📄 An innovative tool that transforms audio or video files into text transcripts and generates concise meeting minutes. Stay organized and efficient in your meetings, and get ready for Phase 2 wher…