Skip to content

Code Sentence Embedding repository for submission at IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 21))

Notifications You must be signed in to change notification settings

martin-wey/code-sentence-embeddings-saner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Code Sentence Embedding - https://arxiv.org/abs/2008.03731

Code Sentence Embedding (CSE) repository for the paper Combining Code Embedding and Static Analysis for Function-Call Completion submitted at IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 21).

This repository contains all the necessary artifacts and instructions to replicate the experiments of the paper.

Dataset


Our dataset is based on Allamanis et al.'s 2013 GitHub Java Corpus (http:https://groups.inf.ed.ac.uk/cup/javaGithub/). We used Eclipse JDT Core to parse the ASTs of all .java files contained within this dataset in order to retrieve function-call sequences. We limited the scope of the sequences to method declarations. Each sequence contains the method declaration name following by the function calls in the method body.

  • The dataset used to train our models is available here.
  • The dataset used to test our models is available here.

Experiment 1 - Naturalness of function calls


We used SLP-Core to compute the cross-entropy of our test projects for various order of n-gram JM models. The code is in the following Java project : https://github.com/mweyssow/cse-saner/tree/master/cse-slp.

The code requires slp-core.jar file to be added as external jar. Then, it can be ran within an IDE.

The file cse/JavaEntropyRunner.java can be run with the path to the training set as first program argument and the path to the test set as second program argument. Consider using _*sequences.txt files to compute the cross-entropy of a specific test project.

  • the variable cutOffValues can be altered to allow more cut-off values to be tested.
  • the variable modelOrderValues can be altered to consider other order values (in the paper, we choose the model orders with n=1..10).

Experiment 2 - Function-call completion with PV model


We used Gensim to train and evaluate paragraph vector models (i.e., doc2vec). All the code is implemented in Python 3 in the following project : https://github.com/mweyssow/cse-saner/tree/master/cse.

The only required dependency is Gensim. The code uses argparse so that it can be called from the terminal.

The file cse/d2v_train.py is used to train the model. python d2v_train.py --help can be called from the terminal to show all the available arguments. The training logs will be saved in a .log file and the model will be saved in a .bin file. Here is an example of usage:

  • python d2v_train.py --train_set='./data/plain_method_data.txt' --dm=1 --vector_dim=300 --window=8 --min_count=10 --epochs=20 --hs=1 --negative=5 --ns_exponent=0.75 --dbow_words=0

Once the model trained, you can run the evaluation with the file cse/d2v_eval_eclipse.py. The script takes as argument the path to the model .bin file, the path to the training set (json format) and the path to the test project (json format) to be evaluated.

  • the variable topk_range can be altered to evaluate the model for several suggestion lists of size k.
  • the script returns the Recall@k and MRR@k metrics.

Experiment 3 - Function-call completion with n-gram model


For this last experiment, we extend SLP-Core library to allow the completion to be performed with the static analysis.

The file cse/JavaFunctionPredictionRunner.java evaluate a n-gram model for function-call completion. You can chose one of the smoothing available in SLP-Core library (here we use JM), change the vocabulary cut-off or change the order of the n-gram model.

The file can be executed within an IDE (as for the first experiment).

  • the first program argument is the path to the training set.
  • the second program argument is the path to the test sequences (*sequences.txt in the test dataset).
  • the third program argument is the path to the static analysis proposals (_*proposals.txt in the test dataset).

As for the second experiment, the script returns the Recall@k and MRR@k for project under test.

About

Code Sentence Embedding repository for submission at IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 21))

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published