Skip to content
View kaka7's full-sized avatar

Block or report kaka7

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.

Cuda 83 50 Updated Feb 21, 2024

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 290 35 Updated Sep 11, 2024

Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL

C++ 525 83 Updated Nov 18, 2020

C++ Implementation of PyTorch Tutorials for Everyone

C++ 1,954 260 Updated May 6, 2024

This is a code repository for pytorch c++ (or libtorch) tutorial.

C++ 734 121 Updated Nov 2, 2021

C++ library based on tensorrt integration

C++ 2,582 547 Updated May 24, 2023

A curated list of awesome C++ (or C) frameworks, libraries, resources, and shiny things. Inspired by awesome-... stuff.

59,297 7,790 Updated Oct 1, 2024

Fast CUDA matrix multiplication from scratch

Cuda 447 61 Updated Dec 28, 2023

A list of awesome compiler projects and papers for tensor computation and deep learning.

2,345 300 Updated Jul 14, 2024

Yinghan's Code Sample

Cuda 279 53 Updated Jul 25, 2022

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Cuda 270 43 Updated Nov 28, 2021

Achieve peak performance on x86 CPUs and NVIDIA GPUs

C++ 63 14 Updated Oct 7, 2024

GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as…

C++ 1,106 505 Updated Aug 21, 2024

Instructions, Docker images, and examples for Nsight Compute and Nsight Systems

Cuda 126 18 Updated May 19, 2020

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 28,413 4,214 Updated Oct 15, 2024

Training material for Nsight developer tools

C 127 33 Updated Aug 8, 2024

Implementation of popular deep learning networks with TensorRT network definition API

C++ 6,932 1,769 Updated Oct 11, 2024

Demonstration of various hardware effects on CUDA GPUs.

C++ 351 28 Updated Nov 22, 2023

Demonstration of various hardware effects.

C++ 2,831 159 Updated Feb 29, 2024

Thin, unified, C++-flavored wrappers for the CUDA APIs

C++ 784 80 Updated Sep 23, 2024

AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术

Jupyter Notebook 10,794 1,562 Updated Oct 9, 2024

The C++ Core Guidelines are a set of tried-and-true guidelines, rules, and best practices about coding in C++

CSS 42,637 5,432 Updated Oct 4, 2024

Useful CMake Examples

CMake 12,360 2,491 Updated Feb 28, 2024

CMake Cookbook recipes.

C++ 2,704 696 Updated Jun 1, 2021

A Deep-Learning-Based Chinese Speech Recognition System 基于深度学习的中文语音识别系统

Python 7,793 1,891 Updated Sep 26, 2024

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 814 128 Updated Jul 29, 2023

compiler learning resources collect.

Python 2,097 325 Updated May 27, 2024

Open deep learning compiler stack for cpu, gpu and specialized accelerators

Python 11,694 3,457 Updated Oct 15, 2024

Source code examples from the Parallel Forall Blog

HTML 1,230 631 Updated Jul 23, 2024
Next