OthelloScope

Replicating neuroscope for OthelloGPT MLP neurons. Main challenges:

Get OthelloGPT to work
Get template files and HTML to work
Find out how to find the most activating dataset examples --> Use the linear probe
Figure out a good way to represent MLP neurons
Figure out a good way to differentiate polysemantic neurons from monosemantic neurons
Show a neuron's most activating examples
Concrete tasks
- Create a function for identifying the variance of activation of neurons to create a descending
- Separate the ours, theirs, and blanks
- [ ]

Othello World's original README

Update 02/13/2023 🔥🔥🔥

Neel Nanda just released a TransformerLens version of Othello-GPT (Colab, Repo Notebook), boosting the mechanistic interpretability research of it.

Othello World

This repository provides the code for training, probing and intervening the Othello-GPT in Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task, to be present at ICLR 2023.
The implementation is based on minGPT, thanks to Andrej Karpathy.

Abstract

Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.

Installation

Some plotting functions require Latex on your machine: check this FAQ for how to install.
Then use these commands to set up:

conda env create -f environment.yml
conda activate othello
python -m ipykernel install --user --name othello --display-name "othello"
mkdir -p ckpts/battery_othello

Training Othello-GPT

Download the championship dataset and the synthetic dataset and save them in data subfolder.
Then see train_gpt_othello.ipynb for the training and validation. Alternatively, checkpoints can be downloaded from here to skip this step.
The default experiment setting requires $8$ GPU's and takes up to roughly $12$ Gigabytes memory on each. Once you set up the code, we can use jupyter nbconvert --execute --to notebook --allow-errors --ExecutePreprocessor.timeout=-1 train_gpt_othello.ipynb --inplace --output ckpts/checkpoint.ipynb to run it in background.

Probing Othello-GPT

Then we will use train_probe_othello.py to train probes.
For example, if we want to train a nonlinear probe with hidden size $64$ on internal representations extracted from layer $6$ of the Othello-GPT trained on the championship dataset, we can use the command python train_probe_othello.py --layer 6 --twolayer --mid_dim 64 --championship.
Checkpoints will be saved to ckpts/battery_othello or can be alternatively downloaded from here. What produces the these checkpoints are produce_probes.sh.

Intervening Othello-GPT

See intervening_probe_interact_column.ipynb for the intervention experiment, where we can customize (1) which model to intervene on, (2) the pre-intervention board state (3) which square(s) to intervene.

Attribution via Intervention Plots

See plot_attribution_via_intervention_othello.ipynb for the attribution via intervention experiment, where we can also customize (1) which model to intervene on, (2) the pre-intervention board state (3) which square(s) to attribute.

How to Cite

@inproceedings{
li2023emergent,
title={Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task},
author={Kenneth Li and Aspen K Hopkins and David Bau and Fernanda Vi{\'e}gas and Hanspeter Pfister and Martin Wattenberg},
booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
url={https://openreview.net/forum?id=DeG07_TcZvT}
}

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
data		data
mechanistic_interpretability		mechanistic_interpretability
mingpt		mingpt
othelloscope		othelloscope
togglable		togglable
.gitignore		.gitignore
LICENSE		LICENSE
Othello_GPT_Circuits.ipynb		Othello_GPT_Circuits.ipynb
README.md		README.md
board_seqs_int_small.npy		board_seqs_int_small.npy
board_seqs_string_small.npy		board_seqs_string_small.npy
environment.yml		environment.yml
intervening_probe_interact_column.ipynb		intervening_probe_interact_column.ipynb
intervention_benchmark.pkl		intervention_benchmark.pkl
main_linear_probe.pth		main_linear_probe.pth
othelloscope 2.zip		othelloscope 2.zip
othelloscope.zip		othelloscope.zip
plot_attribution_via_intervention_othello.ipynb		plot_attribution_via_intervention_othello.ipynb
produce_probes.sh		produce_probes.sh
train_gpt_othello.ipynb		train_gpt_othello.ipynb
train_probe_othello.py		train_probe_othello.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OthelloScope

Othello World's original README

Update 02/13/2023 🔥🔥🔥

Othello World

Abstract

Table of Contents

Installation

Training Othello-GPT

Probing Othello-GPT

Intervening Othello-GPT

Attribution via Intervention Plots

How to Cite

About

Releases

Packages

Languages

License

apartresearch/othelloscope

Folders and files

Latest commit

History

Repository files navigation

OthelloScope

Othello World's original README

Update 02/13/2023 🔥🔥🔥

Othello World

Abstract

Table of Contents

Installation

Training Othello-GPT

Probing Othello-GPT

Intervening Othello-GPT

Attribution via Intervention Plots

How to Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages