# Introduction This repository is an adaptation of the [bert](https://github.com/google-research/bert) repository. The purpose of this repository is to visualize BERT's self-attention weights after it has been fine-tuned on the [IMDb dateset](http://ai.stanford.edu/~amaas/data/sentiment/). However, it can be extended to any text classification dataset by creating an appropriate DataProcessor class. See [run_classifier.py](run_classifier.py) for details. # Usage 1. Create a tsv file for each of the IMDB training and test set. Refer to the [imdb_data](https://github.com/hsm207/imdb_data) repo for instructions. 2. Fine-tune BERT on the IMBDb training set. Refer to the official BERT repo for fine-tuning instructions. Alternatively, you can skip this step by downloading the fine-tuned model from [here](https://drive.google.com/open?id=13Ajyk6xejy3kRU7Ewo_5slCo9db2bOdk). The pre-trained model (BERT base uncased) used to perform the fine-tuning can also be downloaded from [here](https://drive.google.com/open?id=1f23aE84MlPY1eQqzyENt4Fk_DGucof_4). 3. Visualize BERT's weights. Refer to the [BERT_viz_attention_imdb](/bert_attn_viz/notebooks/BERT_viz_attention_imdb.ipynb) notebook for more details. # How it works The forward pass has been modified to return a list of `{layer_i: layer_i_attention_weights}` dictionaries. The shape of `layer_i_attention_weights` is `(batch_size, num_multihead_attn, max_seq_length, max_seq_length)`. You can specify a function to process the above list by passing it as a parameter into the `load_bert_model` function in the [explain.model](explain/model.py) module. The function's output is avaialble as part of the result of the Estimator's predict call under the key named 'attention'. Currently, only two attention processor functions have been defined, namely `average_last_layer_by_head` and `average_first_layer_by_head`. See [explain.attention](explain/attention.py) for implementation details. # Model Performance Metrics The fine-tuned model achieved an accuracy of 0.9407 on the test set. The fine-tuning process was done with the following hyperparameters: * maximum sequence length: 512 * training batch size: 8 * learning rate: 3e-5 * number of epochs: 3