A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据！

Python 2,262 139 Updated Aug 15, 2024

IEIT-Yuan / Yuan2.0-M32

Mixture-of-Experts (MoE) Language Model

Python 173 38 Updated Jul 17, 2024

tianyi-lab / Cherry_LLM

[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models

Python 253 19 Updated Aug 14, 2024

gururise / AlpacaDataCleaned

Alpaca dataset from Stanford, cleaned and curated

Python 1,474 145 Updated Apr 14, 2023

microsoft / Megatron-DeepSpeed

Forked from NVIDIA/Megatron-LM

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Python 1,793 338 Updated Aug 14, 2024

InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)

Python 3,579 290 Updated Aug 10, 2024

NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Python 8,257 1,372 Updated Aug 17, 2024

deepseek-ai / DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

3,256 122 Updated Aug 10, 2024

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

Python 9,694 2,184 Updated Aug 16, 2024

fastapi / fastapi

FastAPI framework, high performance, easy to learn, fast to code, ready for production

Python 74,700 6,286 Updated Aug 17, 2024

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 7,935 865 Updated Aug 14, 2024

LiqunW / Long-document-dataset

Python 19 2 Updated Jun 20, 2019

lxchtan / PoNet

Official code for ICLR 2022 paper: "PoNet: Pooling Network for Efficient Token Mixing in Long Sequences".

Python 31 6 Updated May 23, 2023

alibaba / EasyNLP

EasyNLP: A Comprehensive and Easy-to-use NLP Toolkit

Python 2,017 248 Updated Mar 18, 2024

thu-coai / CDial-GPT

A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models

Python 1,745 255 Updated Jun 12, 2023

MarkWuNLP / MultiTurnResponseSelection

This repo contains our ACL 2017 paper data and source code

Python 719 190 Updated Sep 15, 2020

alibaba-damo-academy / SpokenNLP

A wide variety of research projects developed by the SpokenNLP team of Speech Lab, Alibaba Group.

Python 97 11 Updated Feb 5, 2024

jiahe7ay / MINI_LLM

This is a repository used by individuals to experiment and reproduce the pre-training process of LLM.

Python 310 50 Updated Apr 24, 2024

facebook / zstd

Zstandard - Fast real-time compression algorithm

C 23,019 2,055 Updated Aug 13, 2024

alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.

Python 602 82 Updated Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QingqingSun-Bao

Block or report QingqingSun-Bao

Stars

murray-z / text_clustering

microsoft / LMOps

TencentARC / LLaMA-Pro

thu-coai / AutoDetect

nlpdata / c3

Duxiaoman-DI / XuanYuan

WangRongsheng / XrayGLM

unslothai / unsloth

hkust-nlp / ceval

haonan-li / CMMLU

modelscope / data-juicer