TeaChat

TeaChat使用题库作为垂类语料库，涵盖数学、语文、英语、物理、化学、生物、政治、历史、地理九大高中学科，使用fine-tune、RAG、Multi-Agent技术，提供高考习题解答、解析功能，旨在响应深化教育改革、促进教育公平的发展理念，提供一款人人可用的免费教师AI，减小教育资源差距。

Framework

微调效果不佳，现使用架构遵循：

QuickStart

创建虚拟环境：

conda create -n teachat python=3.10
conda activate teachat

获取项目：

git clone https://github.com/time1527/TeaChat.git
cd TeaChat

安装依赖：

pip install -r requirements.txt

运行：

bash run.sh

Catalogue

├── assets：图片
├── data：增量预训练、sft数据处理
├── evaluate：评估
├── finetune：微调config
├── gradio_app.py：前端
├── LICENSE
├── multi_agent: metagpt
├── ocr
├── rag：检索
├── rag_data：RAG数据整理
├── README.md
├── requirements.txt
├── run.sh：lmdeploy serve + gradio_app
└── test：测试

data：使用minhash在数据集间、数据集内模糊去重，使用精确去重和模糊去重两种方式将训练数据集相对于垂类评测集去重
evaluate：使用AGIeval、GAOKAO-Bench、cmmlu、ceval中的高中部分作为垂类评测集，采用zero-shot的方式对微调后的模型展开评测
finetune：使用YeungNLP/firefly-train-1.1M和QingyiSi/Alpaca-CoT中的CoT_data.json作为通用数据集，使用WanJuan1.0中的高中数据作为垂类数据集，在internLM2-chat-1_8b的基础上通过QLoRA进行有监督微调
multi_agent：使用metagpt实现keypoint/major/question提取及检索
rag_data：
- 视频链接数据：爬取bilibili视频url
- 知识点数据：gpt识别人教版课本目录，人工检查，根据页码提取pdf内容
- QA数据：WanJuan1.0中的高中数据
rag：
- 元数据筛选：改写langchain的BM25Retriever，为其添加元数据筛选功能
- web检索：改写langchain的WebResearchRetriever，使用GoogleSerperAPIWrapper，并省去需要llm的步骤
- 混合检索：BM25FilterRetriever + FAISS.as_retriever()
- 重排序

v0.1

Data Used

数据集QingyiSi/Alpaca-CoT：LICENSE为apache-2.0
数据集YeungNLP/firefly-train-1.1M：属于项目Firefly
数据集WanJuan1.0：许可协议为 CC BY-NC 4.0

Reference

InternLM/Tutorial：https://github.com/InternLM/Tutorial
slimpajama数据处理流程：https://github.com/Cerebras/modelzoo/tree/Release_2.1.1/modelzoo/transformers/data_processing/slimpajama
langchain：https://github.com/langchain-ai/langchain

@misc{2023xtuner,
    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
    author={XTuner Contributors},
    howpublished = {\url{https://github.com/InternLM/xtuner}},
    year={2023}
}

@misc{2023lmdeploy,
    title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},
    author={LMDeploy Contributors},
    howpublished = {\url{https://github.com/InternLM/lmdeploy}},
    year={2023}
}

@misc{alpaca-cot,
  author = {Qingyi Si, Zheng Lin },
  school = {Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China},
  title = {Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/PhoebusSi/alpaca-CoT}},
}

@misc{alpaca,
  author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
  title = {Stanford Alpaca: An Instruction-following LLaMA model},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TeaChat

Framework

QuickStart

Catalogue

v0.1

Data Used

Reference

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
assets		assets
config		config
data		data
evaluate		evaluate
finetune		finetune
multi_agent		multi_agent
ocr		ocr
rag		rag
rag_data		rag_data
test		test
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
gradio_app.py		gradio_app.py
gradio_app_with_multiagent.py		gradio_app_with_multiagent.py
requirements.txt		requirements.txt
run.sh		run.sh

License

time1527/TeaChat

Folders and files

Latest commit

History

Repository files navigation

TeaChat

Framework

QuickStart

Catalogue

v0.1

Data Used

Reference