We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024). (90.9% vs 90.2%).
Additionally, compared to previous open-source models, AutoCoder offers a new feature: it can automatically install the required packages and attempt to run the code until it deems there are no issues, whenever the user wishes to execute the code.
- Difference between the code interpreter of AutoCoder and the GPT-4 Turbo:
Below are the video demos for the code interpreter comparision between GPT-4 Turbo and AutoCoder:
GPT-4o can not access the external library.
AutoCoder-gpt4o.mp4
AutoCoder can automatically install the required packages. This feature expands the scope of code interpreter's application.
AutoCoder_demo.mp4
- Difference between the code interpreter of AutoCoder and the current open-source code interpreter OpenCodeInterpreter:
The code interpreter of AutoCoder, like GPT-4 Turbo, is only called when the user has a need to verify the code, while OpenCodeInterpreter runs all generated python code.
The Model is avaliable on Huggingface: AutoCoder (33B) AutoCoder-S (6.7B)
The base model is deepseeker-coder.
- Create the conda env
conda create -n AutoCoder python=3.11
conda activate AutoCoder
pip install -r requirements.txt
- Test on HumanEval 90.9% on base, 78.0% on base + extra.
cd Evaluation
python test_humaneval.py
You will receive a file named AutoCoder_HumanEval+.jsonl, which follows the EvalPlus format, after this step.
Then follow the testing framework of the EvalPlus GitHub. You will see the results.
NOTE:
- Don't forget to use evalplus's
evalplus.sanitize
to post-process the code. - If you don't use the greedy method (for example set the
do_sample=True
) for the code generation. You will probably see the different results.
- Test on MBPP 82.5% on base, 70.6% on base + extra.
python test_humaneval.py
Post-process to delete the nature language for testing
python postprocess_mbpp.py
Your will get a AutoCoder_Mbpp+-sanitized.jsonl file after this step, it extracted all the code blocks.
Then, directly test it by using EvalPlus GitHub (You don't need to use to use evalplus's evalplus.sanitize
to post-process the code this time).
- Test on DS-1000.
python test_ds1000.py
Your will get a jsonl file after this step, it extracted all the code blocks. Then, directly test it by using DS-1000 GitHub.
- Web demo (Include code interpreter)
Install gradio related pakcages
cd /Web_demo
pip install -r requirements.txt
Run it:
python chatbot.py
NOTE:
-
Currently the model will only start the code interpreter if you ask it to verify its code. I am still finetuning it on a instructed dataset, which will give it the ability to enable the code interpreter upon a user request to run code. I will update the model when it is finished.
-
We suggest to set
do_sample = True
(default setting here) while using the code interpreter.
If you have any inquiries, please feel free to raise an issue or reach out to [email protected].
@misc{lei2024autocoder,
title={AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct}},
author={Bin Lei and Yuchen Li and Qiuwu Chen},
year={2024},
eprint={2405.14906},
archivePrefix={arXiv},
primaryClass={cs.SE}
}
Thanks to Tianyu Zheng, the first author of the OpenCodeInterpreter, for guidance on some technical details.