- SuZhou, China
- team.jiunile.com
Stars
Heterogeneous AI Computing Virtualization Middleware
eBPF distributed networking observability tool for Kubernetes
k8spacket - collects TCP traffic and TLS connection metadata in the Kubernetes cluster using eBPF and visualizes in Grafana
开源 Proxmox VE 网页后台添加处理器、NVMe、SSD 的温度和负载信息的脚本工具。
Tools for merging pretrained large language models.
SGLang is a fast serving framework for large language models and vision language models.
[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 whil…
Disaggregated serving system for Large Language Models (LLMs).
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
PbootCMS是全新内核且永久开源免费的PHP企业网站开发建设管理系统,是一套高效、简洁、 强悍的可免费商用的PHP CMS源码,能够满足各类企业网站开发建设的需要。系统采用简单到想哭的模板标签,只要懂HTML就可快速开发企业网站。官方提供了大量网站模板免费下载和使用,将致力于为广大开发者和企业提供最佳的网站开发建设解决方案。
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Distributed ML Training and Fine-Tuning on Kubernetes
A clean, elegant, beautiful and powerful admin template, based on Vue3, Vite5, TypeScript, Pinia, NaiveUI and UnoCSS. 一个清新优雅、高颜值且功能强大的后台管理模板,基于最新的前端技术栈,包括 Vue3, Vite5, TypeScript, Pinia, NaiveUI 和 …
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical ca…
[ACL 2024] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
A blazing fast inference solution for text embeddings models
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
The Triton TensorRT-LLM Backend
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.