[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 whil…

Python 776 36 Updated Nov 2, 2024

LLMServe / DistServe

Disaggregated serving system for Large Language Models (LLMs).

Jupyter Notebook 347 35 Updated Aug 19, 2024

open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python 4,057 428 Updated Nov 7, 2024

NVIDIA / RULER

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?

Python 693 46 Updated Oct 24, 2024

pbootcmspro / PbootCMS

PbootCMS是全新内核且永久开源免费的PHP企业网站开发建设管理系统，是一套高效、简洁、强悍的可免费商用的PHP CMS源码，能够满足各类企业网站开发建设的需要。系统采用简单到想哭的模板标签，只要懂HTML就可快速开发企业网站。官方提供了大量网站模板免费下载和使用，将致力于为广大开发者和企业提供最佳的网站开发建设解决方案。

PHP 103 19 Updated Sep 24, 2024

IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 611 47 Updated Sep 4, 2024

kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes

Go 1,604 697 Updated Nov 4, 2024

soybeanjs / soybean-admin

A clean, elegant, beautiful and powerful admin template, based on Vue3, Vite5, TypeScript, Pinia, NaiveUI and UnoCSS. 一个清新优雅、高颜值且功能强大的后台管理模板，基于最新的前端技术栈，包括 Vue3, Vite5, TypeScript, Pinia, NaiveUI 和 …

TypeScript 10,149 1,824 Updated Nov 7, 2024

mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 6,647 363 Updated Jul 11, 2024

4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical ca…

Go 515 93 Updated May 21, 2024

THUDM / LongBench

[ACL 2024] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Python 657 53 Updated Sep 10, 2024

predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Python 2,179 143 Updated Nov 7, 2024

huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models

Rust 2,810 176 Updated Nov 5, 2024

DefTruth / Awesome-LLM-Inference

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,789 193 Updated Nov 1, 2024

Mellanox / k8s-rdma-shared-dev-plugin

Go 194 35 Updated Oct 30, 2024

casper-hansen / AutoAWQ_kernels

Cuda 51 22 Updated Sep 10, 2024

triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend

Python 700 103 Updated Nov 5, 2024

inferflow / inferflow

Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).

C++ 236 24 Updated Mar 15, 2024

microsoft / LLMLingua

[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

Python 4,604 252 Updated Aug 22, 2024

facebookresearch / LLM-QAT

Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"

Python 252 24 Updated Sep 3, 2024

alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.

Python 711 100 Updated Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

icyboy™ icyxp

Achievements

Achievements

Block or report icyxp

Stars

sensepost / ruler

alibaba / ChatLearn

Project-HAMi / HAMi

microsoft / retina

k8spacket / k8spacket

KoolCore / Proxmox_VE_Status

arcee-ai / mergekit

neuralmagic / AutoFP8

sgl-project / sglang

microsoft / MInference