Skip to content
View lvhan028's full-sized avatar
  • China
  • 23:58 (UTC -12:00)

Block or report lvhan028

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Material for gpu-mode lectures

Jupyter Notebook 2,988 300 Updated Nov 9, 2024
C++ 32 1 Updated Nov 7, 2024

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 613 47 Updated Sep 4, 2024

A throughput-oriented high-performance serving framework for LLMs

Cuda 630 25 Updated Sep 21, 2024

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 248 39 Updated Nov 12, 2024

Data annotation toolbox supports image, audio and video data.

Python 848 78 Updated Nov 11, 2024

A Comprehensive Toolkit for High-Quality PDF Content Extraction

Python 5,494 367 Updated Oct 24, 2024

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。

Python 13,900 1,044 Updated Nov 12, 2024

[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 whil…

Python 779 36 Updated Nov 11, 2024

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 5,664 514 Updated Oct 18, 2024

Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.

Python 10,032 636 Updated Nov 11, 2024

Self-host LLMs with LMDeploy and BentoML

Python 16 2 Updated Jul 30, 2024

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 4,618 425 Updated Nov 12, 2024

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Python 2,516 154 Updated Oct 10, 2024

✨✨Latest Advances on Multimodal Large Language Models

12,591 804 Updated Nov 10, 2024

[ACL2024 Findings] Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

324 10 Updated Mar 22, 2024

Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.

Go 97,414 7,753 Updated Nov 12, 2024

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Python 20,185 2,228 Updated Aug 12, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 1,418 131 Updated Nov 11, 2024

整理开源的中文大语言模型,以规模较小、可私有化部署、训练成本较低的模型为主,包括底座模型,垂直领域微调及应用,数据集与教程等。

15,922 1,470 Updated Sep 19, 2024

SGLang is a fast serving framework for large language models and vision language models.

Python 5,993 496 Updated Nov 12, 2024

LLM Group Chat Framework: chat with multiple LLMs at the same time. 大模型群聊框架:同时与多个大语言模型聊天。

TypeScript 251 23 Updated Apr 10, 2024

HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance

Python 1,511 127 Updated Oct 29, 2024

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

C++ 7,957 412 Updated Sep 6, 2024

Mamba SSM architecture

Python 13,166 1,122 Updated Nov 5, 2024

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Python 1,893 175 Updated Nov 8, 2024

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 8,626 984 Updated Nov 12, 2024

Official inference library for Mistral models

Jupyter Notebook 9,709 861 Updated Nov 12, 2024

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 6,656 364 Updated Jul 11, 2024
Next