This is my reproduction of Anthropic's research on monosemanticity in language models. I wrote a blog post about it where I share details and results.
This repo is WIP, so the clearest code if you want to try this yourself is in neuronresampling.ipynb
.
Future work:
- Self-contained and minimal jupyter notebook/colab notebook
- Implement Anthropic's recent updates to their research (unconstrained encoder norm, L2 regularization, etc.)
- Experiments with multi-layer transformers
- Better docs...