Skip to content

time1527/TeaChat

Repository files navigation

TeaChat

Stars Badge Stars Badge Forks Badge Pull Requests Badge Issues Badge License Badge

TeaChat使用题库作为垂类语料库,涵盖数学、语文、英语、物理、化学、生物、政治、历史、地理九大高中学科,使用fine-tune、RAG、Multi-Agent技术,提供高考习题解答、解析功能,旨在响应深化教育改革、促进教育公平的发展理念,提供一款人人可用的免费教师AI,减小教育资源差距。

Framework

微调效果不佳,现使用架构遵循:

QuickStart

创建虚拟环境:

conda create -n teachat python=3.10
conda activate teachat

获取项目:

git clone https://github.com/time1527/TeaChat.git
cd TeaChat

安装依赖:

pip install -r requirements.txt

运行:

bash run.sh

Catalogue

├── assets:图片
├── data:增量预训练、sft数据处理
├── evaluate:评估
├── finetune:微调config
├── gradio_app.py:前端
├── LICENSE
├── multi_agent: metagpt
├── ocr
├── rag:检索
├── rag_data:RAG数据整理
├── README.md
├── requirements.txt
├── run.sh:lmdeploy serve + gradio_app
└── test:测试
  • data:使用minhash在数据集间、数据集内模糊去重,使用精确去重和模糊去重两种方式将训练数据集相对于垂类评测集去重
  • evaluate:使用AGIeval、GAOKAO-Bench、cmmlu、ceval中的高中部分作为垂类评测集,采用zero-shot的方式对微调后的模型展开评测
  • finetune:使用YeungNLP/firefly-train-1.1MQingyiSi/Alpaca-CoT中的CoT_data.json作为通用数据集,使用WanJuan1.0中的高中数据作为垂类数据集,在internLM2-chat-1_8b的基础上通过QLoRA进行有监督微调
  • multi_agent:使用metagpt实现keypoint/major/question提取及检索
  • rag_data
    • 视频链接数据:爬取bilibili视频url
    • 知识点数据:gpt识别人教版课本目录,人工检查,根据页码提取pdf内容
    • QA数据:WanJuan1.0中的高中数据
  • rag
    • 元数据筛选:改写langchain的BM25Retriever,为其添加元数据筛选功能
    • web检索:改写langchain的WebResearchRetriever,使用GoogleSerperAPIWrapper,并省去需要llm的步骤
    • 混合检索:BM25FilterRetriever + FAISS.as_retriever()
    • 重排序

v0.1

  • 增量预训练数据收集:2024/03/26
  • 增量预训练数据整理:2024/04/15
  • SFT前评测:2024/04/23
  • SFT数据收集:2024/03/26
  • SFT数据整理:2024/04/15
  • SFT
    • internlm2_1.8b_chat + 垂类数据:2024/04/21
    • internlm2_1.8b_chat + 垂类数据 + 通用数据:2024/04/27
    • “internlm2_1.8b_chat + 垂类数据 + 通用数据” + 垂类数据:2024/05/07
  • SFT后评测:
    • internlm2_1.8b_chat + 垂类数据:2024/04/23
    • internlm2_1.8b_chat + 垂类数据 + 通用数据:2024/05/05
    • “internlm2_1.8b_chat + 垂类数据 + 通用数据” + 垂类数据:2024/05/07
  • RAG数据收集:2024/4/20
  • RAG:2024/04/26
  • Multi-Agent:2024/5

Data Used

  1. 数据集QingyiSi/Alpaca-CoT:LICENSE为apache-2.0
  2. 数据集YeungNLP/firefly-train-1.1M:属于项目Firefly
  3. 数据集WanJuan1.0:许可协议为 CC BY-NC 4.0

Reference

  1. InternLM/Tutorialhttps://github.com/InternLM/Tutorial

  2. slimpajama数据处理流程:https://github.com/Cerebras/modelzoo/tree/Release_2.1.1/modelzoo/transformers/data_processing/slimpajama

  3. langchainhttps://github.com/langchain-ai/langchain

  4. @misc{2023xtuner,
        title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
        author={XTuner Contributors},
        howpublished = {\url{https://github.com/InternLM/xtuner}},
        year={2023}
    }
    
  5. @misc{2023lmdeploy,
        title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},
        author={LMDeploy Contributors},
        howpublished = {\url{https://github.com/InternLM/lmdeploy}},
        year={2023}
    }
    
  6. @misc{alpaca-cot,
      author = {Qingyi Si, Zheng Lin },
      school = {Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China},
      title = {Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface},
      year = {2023},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished = {\url{https://github.com/PhoebusSi/alpaca-CoT}},
    }
    
  7. @misc{alpaca,
      author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
      title = {Stanford Alpaca: An Instruction-following LLaMA model},
      year = {2023},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
    }