This is the code repository for the project "CodeRAG-Bench: Can Retrieval Augment Code Generation?".
Create a new environment:
conda env create -n crag python=3.10 -y
conda activate crag
And install the necessary libraries:
pip install -r requirements.txt
- Retrieval: Code to run retrieval, with BM25, dense retrievers via sentence-transformers, and proprietary API embeddings.
- Generation: Code to run model generation and execution-based evaluation.
- Preprocess: code to preprocess raw data for retrieval pool construction, see inside the directory for details.
cd retrieval/
Before running retrieval on a dataset, you need to create the datastore for it. Following
python -m create/${data_name}.py
# choices for ${data_name}
# basic programming: 'humaneval', 'mbpp', 'live_code_bench'
# open-domain: 'ds1000', 'odex'
# repository-level: 'repoeval_repo', 'swebench_repo'
To run a new dataset, you can simply reformat your data into the BEIR official format, which creates a dataset directory containing the three following files, corpus.jsonl
, queries.jsonl
and qrel/test.tsv
.
See code_search_net.py for some example script to combert code search net to document-to-code retrieval tasks.
Make sure your resulting dataset is under retrieval/datasets
.
Run your embedding models by loading embedding models from sentence-transformers
as follows:
python3 eval_beir_sbert_canonical.py \
--model YOUR_MODEL_NAME_OR_PATH \
--dataset TASK_NAME \
--output_file PATH_TO_YOUR_SCORE_FILE \
--results_file PATH_TO_YOUR_RETRIEVAL_RESULTS_FILE
By specifying the output file name --output_file
, you can save the retrieval results as a json file.
{'ndcg': {'NDCG@1': 0.61667, 'NDCG@3': 0.68203, 'NDCG@5': 0.70804, 'NDCG@10': 0.72701, 'NDCG@100': 0.74926, 'NDCG@1000': 0.75551}, 'mrr': {'MRR@1': 0.61667, 'MRR@3': 0.67278, 'MRR@5': 0.68611, 'MRR@10': 0.69368, 'MRR@100': 0.69721, 'MRR@1000': 0.69744}, 'recall': {'Recall@1': 0.58817, 'Recall@3': 0.728, 'Recall@5': 0.79294, 'Recall@10': 0.84789, 'Recall@100': 0.95, 'Recall@1000': 0.99667}, 'precision': {'P@1': 0.61667, 'P@3': 0.26444, 'P@5': 0.176, 'P@10': 0.09533, 'P@100': 0.01077, 'P@1000': 0.00113}}
--results_file
indicates the file name to store retrieval results, which will be used in the subsequent RAG evaluations.
For open retrieval, you can load a corpus file on our huggingface space (list of the corpora), and generate embeddings using single or multiple GPUs.
- Generate embeddings using a single GPU
python generate_embeddings.py \
--model YOUR_MODEL_NAME_OR_PATH \
--output_dir OUTPUT_EMBEDDING_DIR \
--hf_datasets HF_DATASET_NAME \
--shard_id 0 \
--num_shards 1
- Generate embeddings using multiple GPUS (e.g. 8)
for i in {0..7}; do
export CUDA_VISIBLE_DEVICES=${i}
nohup python generate_embeddings.py --model_name_or_path YOUR_MODEL_NAME_OR_PATH \
--output_dir OUTPUT_EMBEDDING_DIR \
--hf_datasets HF_DATASET_NAME \
--shard_id ${i} --num_shards 8 > ./log/embeddings_logs.${i} 2>&1 &
Now you can load generated embeddings to run open retrieval for a target dataset.
python3 eval_beir_sbert_open.py \
--model avsolatorio/GIST-large-Embedding-v0 \
--embdding_path "OUTPUT_EMBEDDING_DIR/*" \
--dataset DATASET_NAME \
--hf_dataset HF_DATASET_NAME \
--output_file PATH_TO_YOUR_SCORE_FILE \
--results_file PATH_TO_YOUR_RETRIEVAL_RESULTS_FILE
Start with a fresh environment with python 3.10
# install pyserini
pip install pyserini==0.25.0
# install openjdk-11 and maven (if you don't have any)
conda install -c conda-forge openjdk=11 maven -y
For more information of installing pyserini, please refer to installation guide for pyserini
We provide a convenient meta script to navigate your experiments:
Preprocess all the corpus file of existing datasets into pyserini indexable format. For each dataset, the modified corpus will be saved in OUTPUT_DIR/{DATASET_NAME}_corpus/edit.jsonl
:
python3 modify_corpus_for_bm25.py \
--dataset DATASET_NAME, "all" if you want to do operation all datasets \
--output_metadir OUTPUT_DIR \
--stage preprocess
Indexing the corpus from OUTPUT_DIR/{DATASET_NAME}_corpus/edit.jsonl
, and the index will be saved in INDEX_DIR/{DATASET_NAME}_corpus/
:
python3 modify_corpus_for_bm25.py \
--dataset DATASET_NAME, "all" if you want to do operation all datasets \
--output_metadir OUTPUT_DIR \
--index_dir INDEX_DIR \
--stage index
Search the query from the target dataset using BM25:
python3 modify_corpus_for_bm25.py \
--dataset DATASET_NAME, "all" if you want to do operation all datasets \
--output_metadir OUTPUT_DIR \
--index_dir INDEX_DIR \
--top_k TOP_K \
--k1 K1 \
--b B \
--stage search
The score file will be saved in results/{DATASET_NAME}_k1={K1}_b={B}_pyserini_bm25_output.jsonl
, retrieval results in results/{DATASET_NAME}_k1={K1}_b={B}_pyserini_bm25.jsonl
For your convenience, you can run all the stages at once by passing --stage all
and the corresponding parameters.
We modify and index and search each instance's corpus,
python eval_beir_pyserini_repo.py \
--dataset DATASET_NAME \
--output_metadir OUTPUT_DIR \
--index_dir INDEX_DIR \
--top_k TOP_K \
--k1 K1 \
--b B \
--output_file PATH_TO_YOUR_SCORE_FILE \
--results_file PATH_TO_YOUR_RETRIEVAL_RESULTS_FILE
The modified corpus will be saved in OUTPUT_DIR/{DATASET_NAME}_corpus/{instance dir}/edit.jsonl
,
the index will be saved in INDEX_DIR/{DATASET_NAME}_corpus/{instance dir}
.
Datasets other than swe-bench-lite are supported in this meta script:
python3 eval_corpora_ablations.py \
--model bm25 \
--dataset DATASET_NAME, "all" if you want to do operation all datasets \
--corpus CORPUS_NAME, "all" if you want to do operation all corpora \
--output_metadir OUTPUT_DIR \
--index_dir INDEX_DIR \
--top_k TOP_K \
--k1 K1 \
--b B \
--stage {preprocess, index, search, all}
The modified corpus will be saved in OUTPUT_DIR/{CORPUS NAME}_corpus/edit.jsonl
,
the index will be saved in INDEX_DIR/{CORPUS NAME}_corpus/
. The search result will be saved in results/{DATASET_NAME}_k1={K1}_b={B}_pyserini_bm25_corpus
,
and the score file will be saved in {DATASET_NAME}_corpus={CORPUS_NAME}_k1={K1}_b={B}_pyserini_bm25_output.jsonl
, retrieval results in {DATASET_NAME}_corpus={CORPUS_NAME}_k1={K1}_b={B}_pyserini_bm25.jsonl
.
-
For non-repository level datasets: Run your API-based models by loading embeddings from the proprietary APIs as follows:
# voyage.ai python3 eval_voyage.py \ --dataset TASK_NAME \ --model MODEL_NAME (default is voyage-code-2) \ --api_key_fp PATH_TO_YOUR_API_KEY_FILE (need to have a new line) \ --batch_size YOUR_BATCH_SIZE \ --output_file PATH_TO_YOUR_SCORE_FILE \ --results_file PATH_TO_YOUR_RETRIEVAL_RESULTS_FILE # openai python3 eval_openai.py \ --dataset TASK_NAME \ --model MODEL_NAME (default is text-embedding-3-small) \ --api_key_fp PATH_TO_YOUR_API_KEY_FILE (need to have a new line) \ --batch_size YOUR_BATCH_SIZE \ --output_file PATH_TO_YOUR_SCORE_FILE \ --results_file PATH_TO_YOUR_RETRIEVAL_RESULTS_FILE
--run_async
can be used to run the retrieval asynchronously.The default behavior is to cache and use the generated document embedding and doc ids of voyage in
datasets/{dataset}/voyage_doc_embeddings.npy
anddatasets/{dataset}/voyage_doc_ids.json
, and of openai indatasets/{dataset}/doc_embeddings.npy
,datasets/{dataset}/doc_ids.json
.Query embeddings and query to ids are also cached in
datasets/{dataset}/voyage_query_embeddings.npy
anddatasets/{dataset}/voyage_queryidx2truncatedidx.json
, and of openai indatasets/{dataset}/query_embeddings.npy
,datasets/{dataset}/queryidx2truncatedidx.json
. Please erase them if you would like a fresh start. -
For repository level datasets:
python3 eval_api_repo.py \ --dataset TASK_NAME \ --model MODEL_NAME \ --api_key_fp PATH_TO_YOUR_API_KEY_FILE (need to have a new line) \ --batch_size YOUR_BATCH_SIZE \ --output_file PATH_TO_YOUR_SCORE_FILE \ --results_file PATH_TO_YOUR_RETRIEVAL_RESULTS_FILE
--run_async
can be used to run the retrieval asynchronously. Both query embeddings, query to id and document embeddings and document ids will be saved indatasets/{dataset}/{instance dir}
, where instance dir is one specific instance of the dataset.
Datasets other than swe-bench-lite are supported.
Add --corpus_path THE_PATH_TO_YOUR_CORPUS_FILE (we expect an edit.jsonl in a separate directory, you can use the processed one for BM25)
to the above commands.
Query embeddings and query to ids will be cached in the same manner as above, while the document will be cached in generated document embedding and doc ids of voyage in
THE_PARENT_DIR_OF_CORPUS_PATH/voyage_doc_embeddings.npy
and THE_PARENT_DIR_OF_CORPUS_PATH/voyage_doc_ids.json
,
and of openai in THE_PARENT_DIR_OF_CORPUS_PATH/doc_embeddings.npy
, THE_PARENT_DIR_OF_CORPUS_PATH/doc_ids.json
.
Please erase them if you would like a fresh start.
The main.py
script supports running code generation with any models supported by huggingface or OpenAI.
To run no-retrieval generation on the orignal dataset, specify its huggingface dataset name in the dataset_path
argument:
python main.py --task "humaneval" \
--model "bigcode/starcoder2-7b" \
--dataset_path "openai_humaneval" \
--allow_code_execution
Set --allow_code_execution
to evaluate generations with code execution, this is required for all tasks.
Note that the task
should align with the dataset_path
. All tasks available are:
- basic programming: 'humaneval', 'mbpp', 'lcb' (for livecodebench)
- open domain: 'ds1000-all-completion', 'odex-en'
- repository level: 'repoeval-function', 'swebench-lite'
Running generation with previous retrieval results, e.g., "retrieval/humaneval/gist_large.json", specify the files as follows:
python main.py --task "humaneval" \
--model "bigcode/starcoder2-7b" \
--dataset_path "json" --data_files_test "retrieval/humaneval/gist_large.json" \
--allow_code_execution
Running the main.py
script will automatically conduct execution-based evaluation after the generation is finished.
However, for RepoEval(-function) and SWE-bench(-Lite) datasets, additional setups are required due to their problem complexity.
After downloading the repositories for RepoEval (e.g., under retrieval/output/repoeval/repositories/function_level/
), obtaining retrieval results (e.g., /path/to/retrieval/results/retriever-name.jsonl
), and obtaining the code generation outputs in previous steps (e.g., /path/to/generation/outputs/model-name.json
), we can run execution-based evaluation.
First, build an environment called repoeval
. Note that you can use a separate environment to run the experiments. It will automatically switch to the repoeval
environment when running execution.
cd generation
conda env create --file eval/tasks/custom_metrics/repoeval_environment.yml -n repoeval
Then, run the following command to test the environment. It runs the tests under all the repositories and checks whether the original code (i.e., the ground truth code) can pass all the test cases.
cd generation
PYTHONPATH=./ python eval/tasks/custom_metrics/repoeval_execution.py
If all the tests are passed, you can evaluate your code generation outputs with Pass@1.
cd generation
MODEL_NAME="model-name"
RETRIEVAL_FILE="/path/to/retrieval/results/retriever-name.jsonl"
GENERATION_OUTPUTS="/path/to/generation/outputs/model-name.json"
PYTHONPATH=./ python main.py --task "repoeval-function" --model $MODEL_NAME --dataset_path "json" \
--data_files_test $RETRIEVAL_FILE \
--metric_output_path results/repoeval-function_${PROMPT_NAME}_4k_${MODEL_SAVE_NAME}_evaluation_results.vllm.json \
--max_length_input 3596 --max_length_generation 4096 --precision auto \
--save_every_k_tasks 100 --ignore_eos --model_backend vllm --new_tokens_only --topk_docs 5 \
--allow_code_execution \
--load_generations_path $GENERATION_OUTPUTS
After obtaining the generation output file from previous steps, e.g. /path/to/generation/outputs/model-name.json
Run the following to transform the generation output file in-place for SWE-Bench evaluation harness:
python generation/eval/tasks/custom_metrics/swebench_transform.py \
--output_path /path/to/generation/outputs/model-name.json
Then, start the SWE-bench evaluation docker (beware, huge size more than 30GB) provided by OpenDevin:
docker run -it \
-v /path/to/generation/outputs:/swe_bench_output \
ghcr.io/opendevin/eval-swe-bench:full-v1.0 /bin/bash
And run evaluation inside the docker to get the final evaluation stats:
export MINICONDA3=/swe_util/miniforge3
export OD_SWE_BENCH=/swe_util/OD-SWE-bench
export EVAL_DATA_DIR=/swe_util/eval_data
cd /swe_util && ./get_model_report.sh --output-file /swe_bench_output/model-name.json --model-name "model-name" --dataset swe-bench-test-lite
We also explored document chunking and reranking for better RACG.
To reranking the top-100 retrieved documents and possibly get better top-k (k<<100) documents, we use an existing reranker model:
python rerank.py --results_path ${retrieval_results_file}
This script will prompt to ask the query field by providing a list of options available in the results_path
.
For example, on the humaneval retrieval results, it will prompt
Choose query field from [dict_keys(['task_id', 'prompt', 'canonical_solution', 'test', 'entry_point', 'docs'])]:
You can type in ("prompt" + Enter) and it will automatically finish the reranking.
For other datasets, use "text" for mbpp and livecodebench, "prompt" for ds-1000, "intent" for odex, "prompt" for repoeval, and "problem_statement" for swebench.
To take the first N-token chunk from the retrieved document, and process them as a new retrieval file:
python chunk.py --results_path ${retrieval_results_file} --max_num_tokens 500
We also support a heuristic-based chunking specifically for library documentation, to take only the beginning textual descriptions. To do this, run:
python chunk.py --results_path ${retrieval_results_file} --is_docs