- Seoul, Korea, Republic of
- https://cpm0722.github.io
- in/hansu-kim-b15b2920b
Block or Report
Block or report cpm0722
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseLists (16)
Sort Name ascending (A-Z)
Stars
Language
Sort by: Recently starred
llama3.np is a pure NumPy implementation for Llama 3 model.
llama3.cuda is a pure C/CUDA implementation for Llama 3 model.
Perplexica is an AI-powered search engine. It is an Open source alternative to Perplexity AI
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
A framework for few-shot evaluation of language models.
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Accelerate your Hugging Face Transformers 7.6-9x. Native to Hugging Face and PyTorch.
Architecture decision record (ADR) examples for software planning, IT leadership, and template documentation
Write scalable load tests in plain Python 🚗💨
Official implementation of project Honeybee (CVPR 2024)
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Machine Learning Engineering Open Book
Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Explain complex systems using visuals and simple terms. Help you prepare for system design interviews.
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
Fast inference engine for Transformer models
Hackable and optimized Transformers building blocks, supporting a composable construction.
C++ Library Manager for Windows, Linux, and MacOS
An unnecessarily tiny implementation of GPT-2 in NumPy.