Block or Report
Block or report je1lee
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseStars
Language
Sort by: Recently starred
Run PyTorch LLMs locally on servers, desktop and mobile
On-device AI across mobile, embedded and edge for PyTorch
Compressed LLMs for Efficient Text Generation [ICLR'24 Workshop]
Official implementation of "Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling"
Generative AI extensions for onnxruntime
PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Deploying LLMs offline on the NVIDIA Jetson platform marks the dawn of a new era in embodied intelligence, where devices can function independently without continuous internet access.
✨✨Latest Advances on Multimodal Large Language Models
Fast job queuing and RPC in python with asyncio and redis.
Implementation of TiTok, proposed by Bytedance in "An Image is Worth 32 Tokens for Reconstruction and Generation"
Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS
A utility library to help integrate Python applications with Metropolis Microservices for Jetson
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Universal LLM Deployment Engine with ML Compilation
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
Development repository for the Triton language and compiler
Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)
PygmalionAI's large-scale inference engine
Large Language Model Text Generation Inference
A WebUI for Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
Programming accelerated applications with CUDA C/C++, enough to be able to begin work accelerating your own CPU-only applications for performance gains, and for moving into novel computational terr…