- China
-
23:58
(UTC -12:00)
Stars
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
A throughput-oriented high-performance serving framework for LLMs
MSCCL++: A GPU-driven communication stack for scalable AI applications
Data annotation toolbox supports image, audio and video data.
A Comprehensive Toolkit for High-Quality PDF Content Extraction
A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 whil…
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
✨✨Latest Advances on Multimodal Large Language Models
[ACL2024 Findings] Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models
Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
FlashInfer: Kernel Library for LLM Serving
整理开源的中文大语言模型,以规模较小、可私有化部署、训练成本较低的模型为主,包括底座模型,垂直领域微调及应用,数据集与教程等。
SGLang is a fast serving framework for large language models and vision language models.
LLM Group Chat Framework: chat with multiple LLMs at the same time. 大模型群聊框架:同时与多个大语言模型聊天。
HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
Official inference library for Mistral models
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks