Skip to content

jnward/monosemanticity-repro

Repository files navigation

This is my reproduction of Anthropic's research on monosemanticity in language models. I wrote a blog post about it where I share details and results.

This repo is WIP, so the clearest code if you want to try this yourself is in neuronresampling.ipynb.

Future work:

  • Self-contained and minimal jupyter notebook/colab notebook
  • Implement Anthropic's recent updates to their research (unconstrained encoder norm, L2 regularization, etc.)
  • Experiments with multi-layer transformers
  • Better docs...

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages