CURE: Code-Aware Neural Machine Translation for Automatic Program Repair

A pytorch implementation of paper CURE: Code-Aware Neural Machine Translation for Automatic Program Repair

File Structure

results: This folder contains all the bugs in both Defects4J and QuixBugs benchmarks that CURE fixed. Each file contains the buggy line, CURE's patch and the developer's patch
candidate_patches: This folder contains all the candidate patches CURE generated for bugs in each benchmark
data: This folder contains the vocabulary file, subword tokenizer, some training data examples, and the GPT PL model pre-trained on code.
- vocabulary
  - subword.txt: the subword tokenizer model needed by subword-nmt
  - vocabulary.txt: the vocabulary file used in CURE's paper
- models: This folder is used to save the models
  - code_gpt.pt: the save GPT PL model trained on code
- patches: This folder is used to save the generated patches
  - gpt_conut_1.txt: an example file that contains the candidate patches generated by a GPT-CoNuT model, including 100 patches for each QuixBugs bug.
  - gpt_fconv_1.txt: an example file that contains the candidate patches generated by a GPT-FConv model, including 100 patches for each QuixBugs bug.
- data: This folder is used to save the training data and validation data
  - CURE uses the source code training data shared by previous work CoCoNuT
src: This folder includes the source code for CURE's APR model

Dependency

Python 3.8
PyTorch 1.4.0
NumPy 1.18.1
Huggingface transformers 2.10.0
subword-nmt

Usage

To train a GPT-CoNuT model, run src/trainer/gpt_conut_trainer.py Some settings you may need to change:

vocab_file: the path to the vocabulary file used by the model
train_file: the path to the training data
valid_file: the path to the validation data
gpt_file: the path to the saved GPT PL model
hyper_parameter: the hyper-parameter of the model (including the number of encoder/decoder layers, dropout rate, etc.)
save_dir: the directory to save the model, default: data/models/

To train a GPT-FConv model, run src/trainer/gpt_fconv_trainer.py Some settings you may need to change:

vocab_file: the path to the vocabulary file used by the model
train_file: the path to the training data
valid_file: the path to the validation data
gpt_file: the path to the saved GPT PL model
hyper_parameter: the hyper-parameter of the model (including the number of encoder/decoder layers, dropout rate, etc.)
save_dir: the directory to save the model, default: data/models/

To prepare input for new test data, check data/data/prepare_testing_data.py, make sure you check the readme file and follow the three steps to prepare the test input.

To generate patches, run src/tester/generator.py Some settings you may need to change:

vocab_file: the path to the vocabulary file used by the model
input_file: the input data to the model for generating patches, with each line referring to a bug in the following format: buggy line <CTX> surrounding function. see candidate_patches/QuixBugs/quixbugs_bpe.txt for reference.
identifier_txt_file: the valid identifiers for each bug, with each line being a list of valid identifiers, identifiers are split by space. see candidate_patches/QuixBugs/identifier.txt for reference
identifier_token_file: the tokenized identifiers for each bug, with each line being a list of valid identifiers tokenized by camel letter, underscore, and subword. identifiers are split by \t. see candidate_patches/QuixBugs/identifier.tokens for reference
output_file: the path to the output result
beam_size: the number of candidate patches generated by each model
model_file: the path to the saved APR model
CURE's trained models: https://zenodo.org/record/7030145#.YwvXfFvMI5l

data/patches/gpt_conut_1.txt and data/patches/gpt_fconv_1.txt are example candidate patches generated by GPT-CoNuT and GPT-FConv models for QuixBugs benchmark.

To validate the candidate patches generated by models, run src/validation/rerank.py, which will rerank the patches generated by all the models and the result will be dumped into data/patches/reranked_patches.json, then run src/validation/validate_quixbugs.py or src/validation/validate_defects4j.py, which will run unit test cases (offered by Defects4J or QuixBugs) to validate the candidate patches. The final result will be dumped into data/patches/validated_patches.json

If you use CURE for academic purpose, please cite the following citation:

@inproceedings{jiang2021cure,
  author={Jiang, Nan and Lutellier, Thibaud and Tan, Lin},
  booktitle={2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)}, 
  title={CURE: Code-Aware Neural Machine Translation for Automatic Program Repair}, 
  year={2021},
  pages={1161-1173},
  doi={10.1109/ICSE43902.2021.00107}
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
candidate_patches		candidate_patches
data		data
parser		parser
results		results
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CURE: Code-Aware Neural Machine Translation for Automatic Program Repair

File Structure

Dependency

Usage

About

Releases

Packages

Contributors 3

Languages

License

lin-tan/CURE

Folders and files

Latest commit

History

Repository files navigation

CURE: Code-Aware Neural Machine Translation for Automatic Program Repair

File Structure

Dependency

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages