Lists (1)
Sort Name ascending (A-Z)
Starred repositories
[Pytorch] Generative retrieval model based on RQ-VAE from "Recommender Systems with Generative Retrieval"
Official PyTorch implementation of SegFormer
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
🎥 Python and OpenCV-based scene cut/transition detection program & library.
Translate the video from one language to another and add dubbing. 将视频从一种语言翻译为另一种语言,并支持api调用
Collection of handy online tools for developers, with great UX.
The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
✯ 可直连访问的电视/广播图标库与相关工具项目 ✯ 🔕 永久免费 直连访问 完整开源 不断完善的台标 支持IPv4/IPv6双栈访问 🔕
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understanding
🔥🔥🔥 Web-based linux server management control panel. / 现代化、开源的 Linux 服务器运维管理面板。
Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"
Reference implementation for DPO (Direct Preference Optimization)
Document Artifical Intelligence
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
Cobra: Extending Mamba to Multi-modal Large Language Model for Efficient Inference
[ECCV 2024] Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.
视频硬字幕提取,生成srt文件。无需申请第三方API,本地实现文本识别。基于深度学习的视频字幕提取框架,包含字幕区域检测、字幕内容提取。A GUI tool for extracting hard-coded subtitle (hardsub) from videos and generating srt files.
[NeurIPS 2024 Oral][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-sim…