Skip to content
View rhmaaa's full-sized avatar
Block or Report

Block or report rhmaaa

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

collection of benchmarks to measure basic GPU capabilities

Jupyter Notebook 220 35 Updated Jun 21, 2024

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

C++ 15 1 Updated Aug 18, 2024

使用 cutlass 实现 flash-attention 精简版,具有教学意义

Cuda 26 Updated Aug 12, 2024

小彭老师领衔编写,现代C++的中文百科全书

Typst 466 31 Updated Aug 20, 2024

x86-64 SIMD矢量优化系列教程

C++ 92 8 Updated Jul 7, 2024

Code for Palu: Compressing KV-Cache with Low-Rank Projection

Python 23 1 Updated Aug 10, 2024

Demonstration of various hardware effects on CUDA GPUs.

C++ 334 25 Updated Nov 22, 2023

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 39 2 Updated Aug 12, 2024

SGLang is yet another fast serving framework for large language models and vision language models.

Python 4,229 277 Updated Aug 20, 2024

Large Language Model Text Generation Inference

Python 8,613 990 Updated Aug 20, 2024

A way to use cuda to accelerate top k algorithm

Cuda 29 7 Updated Jul 11, 2017

深度学习系统笔记,包含深度学习数学基础知识、神经网络基础部件详解、深度学习炼丹策略、模型压缩算法详解,以及如何实现深度学习推理框架实战。

Python 319 49 Updated Feb 2, 2024

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

Python 14,463 955 Updated Aug 20, 2024

The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>

Python 65 4 Updated Aug 12, 2024

LLM101n: Let's build a Storyteller

27,091 1,477 Updated Aug 1, 2024

Source code for Twitter's Recommendation Algorithm

Scala 61,935 12,159 Updated Jul 10, 2024

Using a four layer perceptron, The highest accuracy can reach 97%。

Python 2 Updated Jun 5, 2024

🏠 [ECCV 2024] Pytorch implementation of 'HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression'

Python 181 10 Updated Jul 9, 2024

Learning about CUDA by writing PTX code.

Python 28 Updated Feb 27, 2024

[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

Python 61 3 Updated Aug 13, 2024
Python 127 16 Updated Jul 23, 2024

Accessible large language models via k-bit quantization for PyTorch.

Python 5,930 600 Updated Aug 20, 2024

QQQ is an innovative and hardware-optimized W4A8 quantization solution.

Python 49 3 Updated Aug 2, 2024

[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling

Python 1,330 100 Updated Jul 10, 2024

A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..

Python 148 6 Updated Aug 9, 2024

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Python 191 16 Updated Aug 16, 2024

The best library for implementation of all Data Structures and Algorithms - Trees + Graph Algorithms too!

C++ 2,747 993 Updated Mar 16, 2024

Lightning fast C++/CUDA neural network framework

C++ 3,619 443 Updated Jul 31, 2024
Next