Skip to content
View chenbohua3's full-sized avatar

Organizations

@AlibabaPAI

Block or report chenbohua3

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Python 660 22 Updated Aug 13, 2024

CUDA Templates for Linear Algebra Subroutines

C++ 5,216 877 Updated Aug 20, 2024

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

996 21 Updated Jul 31, 2024

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frame…

Python 393 23 Updated Aug 5, 2024

A curated list of practical guide resources of LLMs (LLMs Tree, Examples, Papers)

9,189 697 Updated May 31, 2024

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Python 378 16 Updated Aug 13, 2024

这是一个面向中文社区,分析市面上智能合约应用的架构与实现的仓库。

Solidity 1,479 327 Updated Aug 3, 2024

leaked prompts of GPTs

28,097 3,764 Updated Jul 9, 2024

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Python 1,074 63 Updated Feb 14, 2024

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 7,983 871 Updated Aug 20, 2024

a state-of-the-art-level open visual language model | 多模态预训练模型

Python 5,797 397 Updated May 29, 2024

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 6,474 360 Updated Jul 11, 2024

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,256 169 Updated Jul 16, 2024

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 25,296 3,656 Updated Aug 23, 2024

A curated list for Efficient Large Language Models

Python 1,058 75 Updated Aug 22, 2024

Development repository for the Triton language and compiler

C++ 12,308 1,487 Updated Aug 23, 2024

Xwin-LM: Powerful, Stable, and Reproducible LLM Alignment

Python 1,012 41 Updated May 31, 2024

Code for the paper "Evaluating Large Language Models Trained on Code"

Python 2,260 325 Updated Feb 5, 2024

A Pythonic framework to simplify AI service building

Python 2,615 167 Updated Aug 21, 2024

Inference Llama 2 in one file of pure 🔥

Mojo 2,088 138 Updated May 21, 2024

Lepton Examples

Jupyter Notebook 140 18 Updated Jul 25, 2024

Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.

Python 5,926 510 Updated Aug 22, 2024

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Python 14,563 2,560 Updated Aug 20, 2024

A framework for few-shot evaluation of language models.

Python 6,210 1,645 Updated Aug 22, 2024

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

Python 2,794 583 Updated Jul 19, 2024

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

Python 2,095 205 Updated Aug 23, 2024

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

Python 4,512 359 Updated Aug 14, 2024

A Python-level JIT compiler designed to make unmodified PyTorch programs faster.

Python 990 123 Updated Apr 17, 2024

BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.

C++ 786 159 Updated Aug 20, 2024
Next