-
UC Berkeley
- San Francisco Bay Area
-
17:00
(UTC -07:00) - https://zhuohan.li
- @zhuohan123
- in/zhuohan-li
Block or Report
Block or report zhuohan123
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseStars
Language
Sort by: Recently starred
Dynamic Memory Management for Serving LLMs without PagedAttention
A framework for few-shot evaluation of language models.
A fast communication-overlapping library for tensor parallelism on GPUs.
HabanaAI / vllm-fork
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
A visual no-code/code-free web crawler/spider易采集:一个可视化浏览器自动化测试/数据采集/爬虫软件,可以无代码图形化的设计和执行爬虫任务。别名:ServiceWrapper面向Web应用的智能化服务封装系统。
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
Arena-Hard-Auto: An automatic LLM benchmark.
DSPy: The framework for programming—not prompting—foundation models
A parallel framework for training deep neural networks
[ICML 2024] CLLMs: Consistency Large Language Models
Universal LLM Deployment Engine with ML Compilation
Standardized Serverless ML Inference Platform on Kubernetes
An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio…
Building a quick conversation-based search demo with Lepton AI.
LlamaIndex is a data framework for your LLM applications
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
SGLang is yet another fast serving framework for large language models and vision language models.
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.