Skip to content

Collections of research, benchmarks and tools towards more robust and reliable language models for code; LM4Code; LM4SE; reliable LLM; LLM4Code

Notifications You must be signed in to change notification settings

yueyueL/ReliableLM4Code

Repository files navigation

This repository extends from our recent work, "Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey" and "Large language models for software engineering: A systematic literature review". It includes necessary information for our research and a curated collection of LM4Code papers and other resources (datasets, tutorials, etc.). The focus is primarily on papers that use pre-trained models, especially large language models, to improve the reliability of language models in Software Engineering research.

For more details, please access this site

Modern language models (LMs) have been successfully employed in source code generation and understanding, leading to a significant increase in research focused on learning-based code intelligence, such as automated bug repair, and test case generation. Despite their great potential, language models for code intelligence (LM4Code) are susceptible to potential pitfalls, which hinder realistic performance and further impact their reliability and applicability in real-world deployment. Such challenges drive the need for a comprehensive understanding - not just identifying these issues but delving into their possible implications and existing solutions to build more reliable language models tailored to code intelligence. Based on a well-defined systematic research approach, we conducted an extensive literature review to uncover the pitfalls inherent in LM4Code. Finally, 67 primary studies from top-tier venues have been identified. After carefully examining these studies, we designed a taxonomy of pitfalls in LM4Code research and conducted a systematic study to summarize the issues, implications, current solutions, and challenges of different pitfalls for LM4Code systems. We developed a comprehensive classification scheme that dissects pitfalls across four crucial aspects: data collection and labeling, system design and learning, performance evaluation, and deployment and maintenance. Through this study, we aim to provide a roadmap for researchers and practitioners, facilitating their understanding and utilization of LM4Code in reliable and trustworthy ways.

Please feel free to send a pull request to add papers and relevant content that are not listed here. We uploaded our completed paper lists to Google Drive with detailed reviewed information.

Content

Papers

Data Collection and Labeling

Unbalanced Distribution

  • Deep Learning Based Vulnerability Detection (2021), arxiv, S Chakraborty, R Krishna, Y Ding, et al. [pdf]
  • Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays! (2023), ICSE, X Yang, et al. [pdf]
  • On the Value of Oversampling for Deep Learning in Software Defect Prediction (2021), TSE, R Yedida, T Menzies. [pdf]
  • Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets (2022), ASE, Z Li, et al. [pdf]
  • An empirical study of deep learning models for vulnerability detection (2023), arxiv, B Steenhoek, et al. [pdf]

Label Errors

  • Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets (2022), ASE, Z Li, et al. [pdf]
  • XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training (2022), TOSEM, Z Lin, et al. [pdf]
  • Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper) (2023), ISSTA, X Nie, et al. [pdf]

Data Noise

  • Slice-Based Code Change Representation Learning (2023), SANER, F Zhang, et al. [pdf]
  • Are we building on the rock? on the importance of data preprocessing for code summarization (2022), FSE, L Shi, et al. [pdf]
  • Neural-Machine-Translation-Based Commit Message Generation: How Far Are We? (2018), ASE, Z Liu, et al. [pdf]

System Design and Learning

Data Snooping

  • AutoTransform: automated code transformation to support modern code review process (2022), ICSE, Thongtanunam, Patanamon, Chanathip Pornprasit, and Chakkrit Tantithamthavorn. [pdf]
  • Can Neural Clone Detection Generalize to Unseen FunctionalitiesĆ’ (2021), ASE, C Liu, et al. [pdf]
  • CD-VulD: Cross-Domain Vulnerability Discovery Based on Deep Domain Adaptation (2020), TDSC, S Liu, et al. [pdf]
  • Deep just-in-time defect prediction: how far are we? (2021), ISSTA, Z Zeng, et al. [pdf]
  • Patching as translation: the data and the metaphor (2020), ASE, Y Ding, et al. [pdf]
  • An empirical study of deep learning models for vulnerability detection (2023), ICSE, B Steenhoek, et al. [pdf]
  • Keeping Pace with Ever-Increasing Data: Towards Continual Learning of Code Intelligence Models (2302), ICSE, S Gao, et al. [pdf]
  • Revisiting Learning-based Commit Message Generation (2023), ICSE, J Dong, Y Lou, D Hao, et al. [pdf]
  • Syntax and Domain Aware Model for Unsupervised Program Translation (2302), ICSE, F Liu, J Li, L Zhang. [pdf]
  • How Effective Are Neural Networks for Fixing Security Vulnerabilities (2023), ISSTA, Y Wu, N Jiang, HV Pham, et al. [pdf]
  • Towards More Realistic Evaluation for Neural Test Oracle Generation (2305), ISSTA, Z Liu, K Liu, X Xia, et al. [pdf]
  • On the Evaluation of Neural Code Summarization (2022), ICSE, E Shi, Y Wang, L Du, et al. [pdf]

Spurious Correlations

  • Deep Learning Based Vulnerability Detection: Are We There Yet? (2021), TSE, S Chakraborty, R Krishna, Y Ding, et al. [pdf]
  • Diet code is healthy: simplifying programs for pre-trained models of code (2022), FSE, Z Zhang, H Zhang, B Shen, et al. [pdf]
  • Explaining mispredictions of machine learning models using rule induction (2021), FSE, J Cito, I Dillig, S Kim, et al. [pdf]
  • Interpreting Deep Learning-based Vulnerability Detector Predictions Based on Heuristic Searching (2021), TOSEM, D Zou, Y Zhu, S Xu, et al. [pdf]
  • Thinking Like a Developer? Comparing the Attention of Humans with Neural Models of Code (2021), ASE, M Paltenghi, M Pradel. [pdf]
  • Vulnerability detection with fine-grained interpretations (2021), FSE, Y Li, S Wang, TN Nguyen. [pdf]
  • What do they capture? a structural analysis of pre-trained language models for source code (2022), ICSE, Y Wan, W Zhao, H Zhang, et al. [pdf]
  • An empirical study of deep learning models for vulnerability detection (2023), ICSE, B Steenhoek, MM Rahman, R Jiles, et al. [pdf]
  • Towards Efficient Fine-Tuning of Pre-trained Code Models: An Experimental Study and Beyond (2023), ISSTA, E Shi, Y Wang, H Zhang, et al. [pdf]

Inappropriate Model Design

  • Deep Learning Based Vulnerability Detection: Are We There Yet? (2021), TSE, S Chakraborty, R Krishna, Y Ding, et al. [pdf]
  • Enhancing DNN-Based Binary Code Function Search With Low-Cost Equivalence Checking (2022), TSE, H Wang, P Ma, Y Yuan, et al. [pdf]
  • Improving automatic source code summarization via deep reinforcement learning (2018), ASE, Y Wan, Z Zhao, M Yang, et al.[pdf]
  • Patching as translation: the data and the metaphor (2020), ASE, Y Ding, B Ray, P Devanbu, et al.[pdf]
  • Reinforcement-Learning-Guided Source Code Summarization Using Hierarchical Attention (2020), TSE, W Wang, Y Zhang, Y Sui, et al. [pdf]
  • XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training (2022), TOSEM, Z Lin, G Li, J Zhang, et al. [pdf]
  • RepresentThemAll: A Universal Learning Representation of Bug Reports (2023), ICSE, S Fang, T Zhang, Y Tan, et al. [pdf]
  • Template-based Neural Program Repair (2023), ICSE, X Meng, X Wang, H Zhang, et al. [pdf]

Performance Evaluation

Inappropriate Baseline

  • Towards More Realistic Evaluation for Neural Test Oracle Generationr (2023), ARXIV, Z Liu, K Liu, X Xia, et al. [pdf]

Inappropriate Evaluation Dataset

  • Deep Learning Based Program Generation From Requirements Text: Are We There Yet? (2020), TSE, H Liu, M Shen, J Zhu, et al. [pdf]
  • Generating realistic vulnerabilities via neural code editing: an empirical study (2022), FSE, Y Nong, Y Ou, M Pradel, et al. [pdf]

Low Reproducibility

  • An extensive study on pre-trained models for program understanding and generation (2022), ISSTA, Z Zeng, H Tan, H Zhang, et al. [pdf]

Inappropriate Performance Measures

  • Deep Learning Based Vulnerability Detection: Are We There Yet? (2021), TSE, S Chakraborty, R Krishna, Y Ding, et al. [pdf]
  • Improving automatic source code summarization via deep reinforcement learning (2018), ASE, Y Wan, Z Zhao, M Yang, et al. [pdf]
  • Multi-task learning based pre-trained language model for code completion (2020), ASE, F Liu, G Li, Y Zhao, et al. [pdf]
  • On the Value of Oversampling for Deep Learning in Software Defect Prediction (2021), TSE, R Yedida, T Menzies. [pdf]
  • Patching as translation: the data and the metaphor (2020), ASE, Y Ding, B Ray, P Devanbu, et al. [pdf]
  • Reinforcement-Learning-Guided Source Code Summarization Using Hierarchical Attention (2020), TSE, W Wang, Y Zhang, Y Sui, et al. [pdf]
  • SynShine: Improved Fixing of Syntax Errors (2022), TSE, Ahmed T, Ledesma N R, Devanbu P. [pdf]
  • An empirical study of deep learning models for vulnerability detection (2023), ICSE, B Steenhoek, MM Rahman, R Jiles, et al. [pdf]
  • Revisiting Learning-based Commit Message Generation (2023), ICSE, J Dong, Y Lou, D Hao, et al. [pdf]
  • Tare: Type-Aware Neural Program Repair (2023), ICSE, Q Zhu, Z Sun, W Zhang, et al. [pdf]
  • How Effective Are Neural Networks for Fixing Security Vulnerabilities (2023), ISSTA, Y Wu, N Jiang, HV Pham, et al. [pdf]
  • Towards More Realistic Evaluation for Neural Test Oracle Generation (2305), ISSTA, Z Liu, K Liu, X Xia, et al. [pdf]
  • GitHub Copilot AI pair programmer: Asset or Liability? (2023), JSS, AM Dakhel, V Majdinasab, A Nikanjam, et al. [pdf]

Deployment and Maintainance

Real-World Constraints

  • Examining Zero-Shot Vulnerability Repair with Large Language Models (2023), S&P, H Pearce, B Tan, B Ahmad, et al. [pdf]
  • A Performance-Sensitive Malware Detection System Using Deep Learning on Mobile Devices (2020), TIFS, R Feng, S Chen, X Xie, et al. [pdf]
  • Diet code is healthy: simplifying programs for pre-trained models of code (2022), FSE, Z Zhang, H Zhang, B Shen, et al.[pdf]
  • When Code Completion Fails: A Case Study on Real-World Completions (2019), ICSE, VJ Hellendoorn, S Proksch, HC Gall, et al. [pdf]
  • Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants (2023), arxiv, G Sandoval, H Pearce, T Nys, et al. [pdf]
  • Grounded Copilot: How Programmers Interact with Code-Generating Models (2023), OOPSLA1, S Barke, MB James, N Polikarpova. [pdf]
  • LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning (2308), arxiv, J Lu, L Yu, X Li, et al.[pdf]
  • Compressing Pre-trained Models of Code into 3 MB (2022), ASE, J Shi, Z Yang, B Xu, et al.[pdf]

Attack Threats

  • You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion (2021), USENIX Security, R Schuster, C Song, E Tromer, et al. [pdf]
  • Adversarial Robustness of Deep Code Comment Generation (2022), TOSEM, Y Zhou, X Zhang, J Shen, et al. [pdf]
  • An extensive study on pre-trained models for program understanding and generation (2022), ISSTA, Z Zeng, H Tan, H Zhang, et al. [pdf]
  • Generating Adversarial Examples for Holding Robustness of Source Code Processing Models (2020), AAAI, H Zhang, Z Li, G Li, et al. [pdf]
  • Semantic Robustness of Models of Source Code (2020), SANER, G Ramakrishnan, J Henkel, Z Wang, et al. [pdf]
  • You see what I want you to see: poisoning vulnerabilities in neural code search (2022), FSE, Y Wan, S Zhang, H Zhang, et al. [pdf]
  • Contrabert: Enhancing code pre-trained models via contrastive learning (2023), ICSE, S Liu, B Wu, X Xie, et al. [pdf]
  • On the robustness of code generation techniques: An empirical study on github copilot (2023), ICSE, A Mastropaolo, L Pascarella, E Guglielmi, et al. [pdf]
  • Two sides of the same coin: Exploiting the impact of identifiers in neural code comprehension (2023), ICSE, S Gao, C Gao, C Wang, et al. [pdf]
  • Multi-target Backdoor Attacks for Code Pre-trained Models (2023), ACL, Y Li, S Liu, K Chen, et al. [pdf]
  • Backdooring Neural Code Search (2023), ACL, W Sun, Y Chen, G Tao, et al. [pdf]
  • ReCode: Robustness Evaluation of Code Generation Models (2022), ACL, S Wang, Z Li, H Qian, et al. [pdf]
  • Natural Attack for Pre-trained Models of Code (2022), ICSE, Z Yang, J Shi, J He, et al. [pdf]
  • Coprotector: Protect open-source code against unauthorized training usage with data poisoning (2022), WWW, Z Sun, X Du, F Song, et al. [pdf]
  • On the Security Vulnerabilities of Text-to-SQL Models (2211), ISSRE, X Peng, Y Zhang, J Yang, et al. [pdf]

Security Concerns in Generated Code

  • Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions (2022), S&P, H Pearce, B Ahmad, B Tan, et al. [pdf]
  • Automated repair of programs from large language models (2023), ICSE, Z Fan, X Gao, M Mirchev, et al. [pdf]
  • Cctest: Testing and repairing code completion systems (2023), ICSE, Z Li, C Wang, Z Liu, et al. [pdf]
  • Analyzing Leakage of Personally Identifiable Information in Language Models (2023), S&P, N Lukas, A Salem, R Sim, et al. [pdf]
  • CodexLeaks: Privacy Leaks from Code Generation Language Models in GitHub Copilot (2023), USENIX Security, L Niu, S Mirza, Z Maradni, et al. [pdf]

Language Models for Code Intelligence

Decoder-only Models

GPT-1

GPT-2

GPT-3

Codex

GPT-NeoX

GPT-Neo

  • Release Date: 2021-03
  • Source: Github

CodeGen

InstructGPT

CodeGeeX

  • Title: CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X
  • Year: 2023
  • Paper: Link

GPT-J

LLaMA

ChatGPT

StableLM-Alpha

InCoder

  • Paper: "InCoder: A Generative Model for Code Infilling and Synthesis"
  • Authors: Daniel Fried et al.
  • Release Date: 2023
  • Paper: Link

GPT-4

WizardCoder

PanGu-Coder

OPT

StarCoder

SantaCoder

PaLM

Vicuna

  • Release Date: 2023/03
  • Blog: Link

Flan-UL2

CPM-Bee

MT-NLG

GLM

YaLM

  • Release Date: 2022-06
  • Institute: Yandex
  • Blog: YaLM Blog

Alpaca

  • Release Date: 2023-03
  • Institute: Stanford University
  • Access: Alpaca GitHub

RWKV-4

  • Release Date: 2022-09
  • Institute: Independent (BlinkDL)
  • Access: RWKV-4 GitHub

Sparrow

Falcon

  • Release Date: 2023-05
  • Institute: Technology Innovation Institute (TII)
  • Access: Falcon Homepage

Code Llama

RedPajama-INCITE

DeciCoder-1B

OpenLLaMA

CodeGPT

Encoder-only Models

BERT

ALBERT

RoBERTa

CodeBERT

GraphCodeBERT

Encoder-decoder Models

AlphaCode

  • Release Date: 2022/02
  • Access: AlphaCode
  • Institute: DeepMind

T5

CodeT5

CodeT5+

UnixCoder

PLBART

CodeReviewer

Relevant Surveys on LM4Code

  • Large Language Models for Software Engineering: Survey and Open Problems, 2023, paper
  • Large Language Models for Software Engineering: A Systematic Literature Review, 2023, paper
  • A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends, 2023, paper
  • Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code, 2023, paper
  • Software testing with large language model: Survey, landscape, and vision, 2023, paper
  • Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey, 2023, paper
  • Generative Artificial Intelligence for Software Engineering--A Research Agenda, 2023, paper
  • A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly, 2023, paper
  • Trustworthy and Synergistic Artificial Intelligence for Software Engineering: Vision and Roadmaps, 2023, paper
  • Large language models meet NL2Code: A survey, 2023, paper
  • A Survey on Pretrained Language Models for Neural Code Intelligence, 2022, paper

General Surveys on AI4SE

  • A systematic literature review on the use of deep learning in software engineering research, TOSEM 2022, paper
  • A survey on deep learning for software engineering, CSUR 2022, paper
  • Software engineering for AI-based systems: a survey, TOSEM 2021, paper
  • Machine/deep learning for software engineering: A systematic literature review, TSE 2022, paper
  • Machine Learning Applied to Software Testing: A Systematic Mapping Study, 2019, paper
  • A survey of machine learning for big code and naturalness, CSUR 2018, paper

General Surveys on LLM

  • Large Language Models: A Comprehensive Survey of Applications, Challenges, Limitations, and Future Prospects, 2023, paper
  • A survey of large language models, 2023, paper
  • A Survey on Evaluation of Large Language Models, 2023, paper
  • Recent advances in natural language processing via large pre-trained language models: A survey, CSUR 2023, paper
  • A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4, 2023, paper
  • Challenges and Applications of Large Language Models: A Survey, 2023, paper
  • Harnessing the power of llms in practice: A survey on chatgpt and beyond, 2023, paper
  • A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT, 2023, paper

Repositories and Resources for LM4Code

  • LLM4SE: Large Language Models for Software Engineering
    • Repository
    • This repository is associated with prominent software engineering conferences like ICSE, FSE, and ASE.
  • Awesome-Code-LLM
    • Repository
    • This is the repo for one survey - a comprehensive review of LLM researches for code. Works in each category are ordered chronologically. A curated list of language modeling researches for code and related datasets.
  • awesome-ai4code-papers
    • Repository
    • A collection of recent papers, benchmarks and datasets of AI4Code domain.
  • ml4code
    • Repository
    • Research on machine learning for source code.
  • awesome-machine-learning-on-source-code
    • Repository
    • Cool links & research papers related to Machine Learning applied to source code (MLonCode)
  • saltudelft/ml4se
    • Repository
    • A curated list of papers, theses, datasets, and tools related to the application of Machine Learning for Software Engineering
  • CUHK-ARISE/ml4code-dataset
    • Repository
    • A collection of datasets for machine learning for big code

Repositories and Resources for LLM

  • Awesome-LLM4Tool: A Curated List of Resources for LLM Tools
    • Repository
    • Offers a curated list of papers, repositories, tutorials, and resources related to large language models for tools.
  • LLMsPracticalGuide: A Curated List of Practical Resources
    • Repository
    • It includes an evolutionary tree of modern Large Language Models to trace the development over the years
  • Hannibal046/Awesome-LLM
    • Repository
    • Awesome-LLM: a curated list of Large Language Model
  • awesome-decentralized-llm
    • Repository
    • Collection of LLM resources that can be used to build products you can "own" or to perform reproducible research.
  • RUCAIBox/LLMSurvey
    • Repository
    • The official GitHub page for the survey paper "A Survey of Large Language Models".
  • tensorchord/Awesome-LLMOps
    • Repository
    • An awesome & curated list of best LLMOps tools for developers
  • luban-agi/Awesome-Domain-LLM
    • Repository
    • A curated list of domain-specific large language models in Chinese
  • underlines/awesome-ml
    • Repository
    • Curated list of useful LLM / Analytics / Datascience resources

Benchmarks

Bug Repair

Defects4J

ManyBugs/IntroClass

BugAID

CoCoNut

QuixBugs

Bugs.jar

BugsInPy

DeepFix

Code Generation/Synthesis

CONCODE

HumanEval

MBPP/MathQA-Python

Code Sumarization

CODE-NN

TL-CodeSum

CodeSearchNet

Cites

If you find this repository useful, please cite our survey paper:

@article{she2023pitfalls,
  title={Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey},
  author={She, Xinyu and Liu, Yue and Zhao, Yanjie and He, Yiling and Li, Li and Tantithamthavorn, Chakkrit and Qin, Zhan and Wang, Haoyu},
  journal={arXiv preprint arXiv:2310.17903},
  year={2023}
}

@article{hou2023large,
  title={Large language models for software engineering: A systematic literature review},
  author={Hou, Xinyi and Zhao, Yanjie and Liu, Yue and Yang, Zhou and Wang, Kailong and Li, Li and Luo, Xiapu and Lo, David and Grundy, John and Wang, Haoyu},
  journal={arXiv preprint arXiv:2308.10620},
  year={2023}
}

About

Collections of research, benchmarks and tools towards more robust and reliable language models for code; LM4Code; LM4SE; reliable LLM; LLM4Code

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published