GitHub - jnward/monosemanticity-repro

This is my reproduction of Anthropic's research on monosemanticity in language models. I wrote a blog post about it where I share details and results.

This repo is WIP, so the clearest code if you want to try this yourself is in neuronresampling.ipynb.

Future work:

Self-contained and minimal jupyter notebook/colab notebook
Implement Anthropic's recent updates to their research (unconstrained encoder norm, L2 regularization, etc.)
Experiments with multi-layer transformers
Better docs...

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
README.md		README.md
autoencoder.py		autoencoder.py
config.py		config.py
data_utils.py		data_utils.py
features.md		features.md
model_utils.py		model_utils.py
neuron_resampling.ipynb		neuron_resampling.ipynb
prepare_dataset.ipynb		prepare_dataset.ipynb
requirements.txt		requirements.txt
train_autoencoder.py		train_autoencoder.py
train_transformer.py		train_transformer.py
transformer.ipynb		transformer.ipynb
transformer.py		transformer.py
transformer_pile.ipynb		transformer_pile.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

jnward/monosemanticity-repro

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages