IR2Vec
is a LLVM IR based framework to generate distributed representations for the source code in an unsupervised manner, which can be used to represent programs as input to solve machine learning tasks that take programs as inputs.
This repo contains the source code and relevant information described in the paper (arXiv). Please see here for more details.
IR2Vec: LLVM IR Based Scalable Program Embeddings, S. VenkataKeerthy, Rohit Aggarwal, Shalini Jain, Maunendra Sankar Desarkar, Ramakrishna Upadrasta, and Y. N. Srikant
- Requirements
- Binaries
- Building from Source
- How to generate IR2Vec program representations?
- Experiments
- Citation
- Contributions
- License
- cmake (>= 3.13.4)
- GNU Make (4.2.1)
- LLVM (8.0.1) - src, release
- Support for latest LLVM versions would be added soon
- Eigen library (3.3.7)
- Python (3.6.7)
- Other python requirements
- For training the vocabulary are available in seed_embeddings/OpenKE/requirements.txt, and
- For running experiments are available in experiments/exp_requirements.yaml
- Conda/Anaconda based virtual environment is assumed
(Experiments are done on an Ubuntu 18.04 machine)
Binaries are autogenerated for every relevant checkin using GitHub Actions. Such generated artifacts are tagged along with the successful runs of Publish
workflow and can be found here.
mkdir build && cd build
- IR2Vec uses Eigen library. If your system already have Eigen (3.3.7) setup, you can skip this step.
- Download and extract the released version.
wget https://gitlab.com/libeigen/eigen/-/archive/3.3.7/eigen-3.3.7.tar.gz
tar -xvzf eigen-3.3.7.tar.gz
mkdir eigen-build && cd eigen-build
cmake ../eigen-3.3.7 && make
cd ../
- Download and extract the released version.
cmake -DLT_LLVM_INSTALL_DIR=<path_to_LLVM_build_dir> -DEigen3_DIR=<path_to_eigen_build_dir> ../src
make
This process would generate ir2vec
binary under build/bin
directory.
Generating embeddings for some programs may need more stack space, so set ulimit -s unlimited
in each session or, you can add this command in .bashrc
.
To ensure the correctness, run make verify-all
ir2vec -<mode> -vocab <seedEmbedding-file-path> -o <output-file> -level <p|f> -class <class-number> <input-ll-file>
mode
- can be one ofsym
/fa
sym
denotes Symbolic representationfa
denotes Flow-Aware representation
vocab
- the path to the seed embeddings fileo
- file in which the embeddings are to be appended; (Note : If file doesn’t exist, new file would be created, else embeddings would be appended)level
- can be one of charsp
/f
.p
denotesprogram level
encodingf
denotesfunction level
encoding
class
- only non-mandatory argument. Used for the purpose of mentioning class labels for classification tasks (To be used with thelevel p
). Defaults to -1. When, not equal to -1, the pass printsclass-number
followed by the corresponding embeddings
Please use --help
for further details.
Format of the output embeddings in output_file
- If the
level
isp
:
<class-number> <Embeddings>
class-number would be printed only if it is not -1
- If the
level
isf
<function-name> = <Embeddings>
ir2vec -fa -vocab vocabulary/seedEmbeddingVocab-300-llvm8.txt -o <output_file> -level <p|f> -class <class-number> <input_ll_file>
ir2vec -sym -vocab vocabulary/seedEmbeddingVocab-300-llvm8.txt -o <output_file> -level <p|f> -class <class-number> <input_ll_file>
@article{VenkataKeerthy-2020-IR2Vec,
author = {VenkataKeerthy, S. and Aggarwal, Rohit and Jain, Shalini and Desarkar, Maunendra Sankar and Upadrasta, Ramakrishna and Srikant, Y. N.},
title = {{IR2Vec: LLVM IR Based Scalable Program Embeddings}},
year = {2020},
issue_date = {December 2020},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {17},
number = {4},
issn = {1544-3566},
url = {https://doi.org/10.1145/3418463},
doi = {10.1145/3418463},
journal = {ACM Trans. Archit. Code Optim.},
month = dec,
articleno = {32},
numpages = {27},
keywords = {heterogeneous systems, representation learning, compiler optimizations, LLVM, intermediate representations}
}
Please feel free to raise issues to file a bug, to pose a question, or to initiate any related discussions. Pull requests are welcome :)
IR2Vec is released under a BSD 4-Clause License. See the LICENSE file for more details.