Skip to content
View cpm0722's full-sized avatar

Organizations

@Soongsil-Developers
Block or Report

Block or report cpm0722

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Beta Lists are currently in beta. Share feedback and report bugs.
Showing results

llama3.np is a pure NumPy implementation for Llama 3 model.

Python 937 71 Updated Jun 2, 2024

llama3.cuda is a pure C/CUDA implementation for Llama 3 model.

Cuda 271 17 Updated Jun 4, 2024

LLM inference in C/C++

C++ 62,012 8,902 Updated Jul 22, 2024

Perplexica is an AI-powered search engine. It is an Open source alternative to Perplexity AI

TypeScript 11,264 1,002 Updated Jul 22, 2024

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,031 138 Updated Jul 20, 2024

A framework for few-shot evaluation of language models.

Python 5,910 1,571 Updated Jul 22, 2024

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Python 1,644 83 Updated Jan 21, 2024

Accelerate your Hugging Face Transformers 7.6-9x. Native to Hugging Face and PyTorch.

Python 663 63 Updated Jun 23, 2024

Architecture decision record (ADR) examples for software planning, IT leadership, and template documentation

11,609 2,398 Updated Jun 12, 2024

Write scalable load tests in plain Python 🚗💨

Python 24,260 2,930 Updated Jul 22, 2024

Official implementation of project Honeybee (CVPR 2024)

Python 400 18 Updated May 10, 2024

row-major matmul optimization

C++ 568 79 Updated Sep 9, 2023

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 765 121 Updated Jul 29, 2023

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 5,380 488 Updated Jul 13, 2024

Machine Learning Engineering Open Book

Python 10,281 616 Updated Jul 18, 2024

Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/

Rust 18,949 1,297 Updated Jul 22, 2024

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Python 1,771 166 Updated Jul 10, 2024

Explain complex systems using visuals and simple terms. Help you prepare for system design interviews.

60,774 6,276 Updated May 16, 2024

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 7,620 830 Updated Jul 19, 2024

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 3,451 306 Updated Jul 22, 2024

The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!

Python 6,823 767 Updated Jul 22, 2024

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 6,381 355 Updated Jul 11, 2024

Fast inference engine for Transformer models

C++ 3,074 273 Updated Jul 11, 2024

Hackable and optimized Transformers building blocks, supporting a composable construction.

Python 8,072 569 Updated Jul 18, 2024

C++ Library Manager for Windows, Linux, and MacOS

CMake 22,404 6,195 Updated Jul 22, 2024

The Mojo Programming Language

Mojo 22,367 2,550 Updated Jul 22, 2024

An unnecessarily tiny implementation of GPT-2 in NumPy.

Python 3,135 403 Updated Apr 24, 2023

Inference Llama 2 in one file of pure C

C 16,877 1,974 Updated Jul 13, 2024
Next