- Moscow, Russia
- @dvmazur
Starred repositories
Triton-based implementation of Sparse Mixture of Experts.
Trio – a friendly Python library for async concurrency and I/O
⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)
prime (previously called ZeroBand) is a framework for efficient, globally distributed training of AI models over the internet.
nsync is a C library that exports various synchronization primitives, such as mutexes
🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Lightning fast C++/CUDA neural network framework
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Efficient Triton Kernels for LLM Training
A fast communication-overlapping library for tensor parallelism on GPUs.
GPU programming related news and material links
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Perplexica is an AI-powered search engine. It is an Open source alternative to Perplexity AI
🦜🔗 Build context-aware reasoning applications
Vision utilities for web interaction agents 👀
Fast Inference of MoE Models with CPU-GPU Orchestration
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Minimalistic large language model 3D-parallelism training
YSDA course in Natural Language Processing
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Ext…
Finetune Llama 3.2, Mistral, Phi, Qwen & Gemma LLMs 2-5x faster with 80% less memory