A new model AutoCoder_QW_7B is uploaded. In this model, We fixed the previous problem that the model will only start the code interpreter when you ask it to verify its code.
The base model of AutoCode_QW_7B is CodeQwen1.5-7b.
We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024). (90.9% vs 90.2%).
Additionally, compared to previous open-source models, AutoCoder offers a new feature: it can automatically install the required packages and attempt to run the code until it deems there are no issues, whenever the user wishes to execute the code.
- Difference between the code interpreter of AutoCoder and the GPT-4 Turbo:
Below are the video demos for the code interpreter comparison between GPT-4 Turbo and AutoCoder:
GPT-4o can not access the external library.
AutoCoder-gpt4o.mp4
AutoCoder can automatically install the required packages. This feature expands the scope of code interpreter's application.
AutoCoder_demo.mp4
- Difference between the code interpreter of AutoCoder and the current open-source code interpreter OpenCodeInterpreter:
The code interpreter of AutoCoder, like GPT-4 Turbo, is only called when the user has a need to verify the code, while OpenCodeInterpreter runs all generated python code.
The Model is avaliable on Huggingface:
AutoCoder (33B) AutoCoder-S (6.7B)
The base models of AutoCoder (33B) and AutoCoder-S (6.7B) are deepseeker-coder.
The base model of AutoCoder_QW_7B is CodeQwen1.5-7b.
- Create the conda env
conda create -n AutoCoder python=3.11
conda activate AutoCoder
pip install -r requirements.txt
- Test on HumanEval 90.9% on base, 78.0% on base + extra. (Skip to Step 5, if you don't want to test its performance on benchmarks)
cd Evaluation
python test_humaneval.py
You will receive a file named AutoCoder_HumanEval+.jsonl, which follows the EvalPlus format, after this step.
Then follow the testing framework of the EvalPlus GitHub. You will see the results.
NOTE:
- Don't forget to use evalplus's
evalplus.sanitize
to post-process the code. - If you don't use the greedy method (for example set the
do_sample=True
) for the code generation. You will probably see the different results.
- Test on MBPP 82.5% on base, 70.6% on base + extra. (Skip to Step 5, if you don't want to test its performance on benchmarks)
python test_humaneval.py
Post-process to delete the nature language for testing
python postprocess_mbpp.py
Your will get a AutoCoder_Mbpp+-sanitized.jsonl file after this step, it extracted all the code blocks.
Then, directly test it by using EvalPlus GitHub (You don't need to use to use evalplus's evalplus.sanitize
to post-process the code this time).
- Test on DS-1000. (Skip to Step 5, if you don't want to test its performance on benchmarks)
python test_ds1000.py
Your will get a jsonl file after this step, it extracted all the code blocks. Then, directly test it by using DS-1000 GitHub.
- Web demo (Include code interpreter)
Install gradio and Run:
pip install gradio==3.48.0
cd /Web_demo
python chatbot.py
-
We suggest to set
do_sample = True
(default setting here) while using the code interpreter. -
It would be preferable to use Linux for deploying everything.
If you have any inquiries, please feel free to raise an issue or reach out to [email protected].
@misc{lei2024autocoder,
title={AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct}},
author={Bin Lei and Yuchen Li and Qiuwu Chen},
year={2024},
eprint={2405.14906},
archivePrefix={arXiv},
primaryClass={cs.SE}
}
Thanks to Tianyu Zheng, the first author of the OpenCodeInterpreter, for guidance on some technical details.