Skip to content

IITH-Compilers/IR2Vec

 
 

Repository files navigation

IR2Vec

IR2Vec is a LLVM IR based framework to generate distributed representations for the source code in an unsupervised manner, which can be used to represent programs as input to solve machine learning tasks that take programs as inputs.

This repo contains the source code and relevant information described in the paper (arXiv). Please see here for more details.

IR2Vec: LLVM IR Based Scalable Program Embeddings, S. VenkataKeerthy, Rohit Aggarwal, Shalini Jain, Maunendra Sankar Desarkar, Ramakrishna Upadrasta, and Y. N. Srikant

Tests Publish pre-commit checks

Image

Table Of Contents

Requirements

(Experiments are done on an Ubuntu 18.04 machine)

Binaries - Artifacts

Binaries are autogenerated for every relevant checkin using GitHub Actions. Such generated artifacts are tagged along with the successful runs of Publish workflow and can be found here.

Building from source

  1. mkdir build && cd build
  2. IR2Vec uses Eigen library. If your system already have Eigen (3.3.7) setup, you can skip this step.
    1. Download and extract the released version.
      • wget https://gitlab.com/libeigen/eigen/-/archive/3.3.7/eigen-3.3.7.tar.gz
      • tar -xvzf eigen-3.3.7.tar.gz
    2. mkdir eigen-build && cd eigen-build
    3. cmake ../eigen-3.3.7 && make
    4. cd ../
  3. cmake -DLT_LLVM_INSTALL_DIR=<path_to_LLVM_build_dir> -DEigen3_DIR=<path_to_eigen_build_dir> ../src
  4. make

This process would generate ir2vec binary under build/bin directory.

Generating embeddings for some programs may need more stack space, so set ulimit -s unlimited in each session or, you can add this command in .bashrc.

To ensure the correctness, run make verify-all

How to generate IR2Vec program representations?

ir2vec -<mode> -vocab <seedEmbedding-file-path> -o <output-file> -level <p|f> -class <class-number> <input-ll-file>

Command-Line options

  • mode - can be one of sym/fa
    • sym denotes Symbolic representation
    • fa denotes Flow-Aware representation
  • vocab - the path to the seed embeddings file
  • o - file in which the embeddings are to be appended; (Note : If file doesn’t exist, new file would be created, else embeddings would be appended)
  • level - can be one of chars p/f.
    • p denotes program level encoding
    • f denotes function level encoding
  • class - only non-mandatory argument. Used for the purpose of mentioning class labels for classification tasks (To be used with the level p). Defaults to -1. When, not equal to -1, the pass prints class-number followed by the corresponding embeddings

Please use --help for further details.

Format of the output embeddings in output_file

  • If the level is p:
<class-number> <Embeddings>

class-number would be printed only if it is not -1

  • If the level is f
<function-name> = <Embeddings>

Flow-Aware Embeddings

  • ir2vec -fa -vocab vocabulary/seedEmbeddingVocab-300-llvm8.txt -o <output_file> -level <p|f> -class <class-number> <input_ll_file>

Symbolic Embeddings

  • ir2vec -sym -vocab vocabulary/seedEmbeddingVocab-300-llvm8.txt -o <output_file> -level <p|f> -class <class-number> <input_ll_file>

Experiments

Citation

@article{VenkataKeerthy-2020-IR2Vec,
author = {VenkataKeerthy, S. and Aggarwal, Rohit and Jain, Shalini and Desarkar, Maunendra Sankar and Upadrasta, Ramakrishna and Srikant, Y. N.},
title = {{IR2Vec: LLVM IR Based Scalable Program Embeddings}},
year = {2020},
issue_date = {December 2020},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {17},
number = {4},
issn = {1544-3566},
url = {https://doi.org/10.1145/3418463},
doi = {10.1145/3418463},
journal = {ACM Trans. Archit. Code Optim.},
month = dec,
articleno = {32},
numpages = {27},
keywords = {heterogeneous systems, representation learning, compiler optimizations, LLVM, intermediate representations}
}

Contributions

Please feel free to raise issues to file a bug, to pose a question, or to initiate any related discussions. Pull requests are welcome :)

Contributors
  1. S. VenkataKeerthy
  2. Rohit Aggarwal

License

IR2Vec is released under a BSD 4-Clause License. See the LICENSE file for more details.