Block or Report
Block or report UranusSeven
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseLists (9)
Sort Name ascending (A-Z)
Language
Sort by: Recently starred
Starred repositories
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with …
Auditing and relabeling cross-distribution Linux wheels.
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
FlashInfer: Kernel Library for LLM Serving
SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
A project aimed at measuring the real-world performance of Large Language Model (LLM) inference frameworks, inspired by the concepts in deepspeed-fastgen.
Python packaging and dependency management made easy
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
A unified evaluation framework for large language models
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
Robust Speech Recognition via Large-Scale Weak Supervision
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Ongoing research training transformer models at scale
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and…
Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting yo…
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
Sparsity-aware deep learning inference runtime for CPUs
Implementation of Nougat Neural Optical Understanding for Academic Documents
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
A natural language interface for computers