Skip to content

We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024) and GPT-4o.

License

Notifications You must be signed in to change notification settings

johngmuender/fork_AutoCoder

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoCoder

Introduction

We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024). (90.9% vs 90.2%).

Additionally, compared to previous open-source models, AutoCoder offers a new feature: it can automatically install the required packages and attempt to run the code until it deems there are no issues, whenever the user wishes to execute the code.

  • Difference between the code interpreter of AutoCoder and the GPT-4 Turbo:

Below are the video demos for the code interpreter comparision between GPT-4 Turbo and AutoCoder:

GPT-4o can not access the external library.

AutoCoder-gpt4o.mp4

AutoCoder can automatically install the required packages. This feature expands the scope of code interpreter's application.

AutoCoder_demo.mp4
  • Difference between the code interpreter of AutoCoder and the current open-source code interpreter OpenCodeInterpreter:

The code interpreter of AutoCoder, like GPT-4 Turbo, is only called when the user has a need to verify the code, while OpenCodeInterpreter runs all generated python code.

Model

The Model is avaliable on Huggingface: AutoCoder (33B) AutoCoder-S (6.7B)

The base model is deepseeker-coder.

Quick Start

  1. Create the conda env
conda create -n AutoCoder python=3.11
conda activate AutoCoder
pip install -r requirements.txt
  1. Test on HumanEval 90.9% on base, 78.0% on base + extra.
cd Evaluation
python test_humaneval.py

You will receive a file named AutoCoder_HumanEval+.jsonl, which follows the EvalPlus format, after this step.

Then follow the testing framework of the EvalPlus GitHub. You will see the results.

NOTE:

  • Don't forget to use evalplus's evalplus.sanitize to post-process the code.
  • If you don't use the greedy method (for example set the do_sample=True) for the code generation. You will probably see the different results.
  1. Test on MBPP 82.5% on base, 70.6% on base + extra.
python test_humaneval.py

Post-process to delete the nature language for testing

python postprocess_mbpp.py

Your will get a AutoCoder_Mbpp+-sanitized.jsonl file after this step, it extracted all the code blocks. Then, directly test it by using EvalPlus GitHub (You don't need to use to use evalplus's evalplus.sanitize to post-process the code this time).

  1. Test on DS-1000.
python test_ds1000.py

Your will get a jsonl file after this step, it extracted all the code blocks. Then, directly test it by using DS-1000 GitHub.

  1. Web demo (Include code interpreter)

Install gradio related pakcages

cd /Web_demo
pip install -r requirements.txt

Run it:

python chatbot.py

NOTE:

  • Currently the model will only start the code interpreter if you ask it to verify its code. I am still finetuning it on a instructed dataset, which will give it the ability to enable the code interpreter upon a user request to run code. I will update the model when it is finished.

  • We suggest to set do_sample = True (default setting here) while using the code interpreter.

Contact

If you have any inquiries, please feel free to raise an issue or reach out to [email protected].

Citation

@misc{lei2024autocoder,
      title={AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct}}, 
      author={Bin Lei and Yuchen Li and Qiuwu Chen},
      year={2024},
      eprint={2405.14906},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

Acknowledgments

Thanks to Tianyu Zheng, the first author of the OpenCodeInterpreter, for guidance on some technical details.

About

We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024) and GPT-4o.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 83.6%
  • C++ 6.0%
  • Java 4.4%
  • Rust 3.3%
  • Shell 1.5%
  • CMake 0.8%
  • Other 0.4%