Stars
Mechanistic Interpretability Visualizations using React
Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"
A curation of awesome tools, documents and projects about LLM Security.
A framework for few-shot evaluation of language models.
Representation Engineering: A Top-Down Approach to AI Transparency
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
Sparse Autoencoder for Mechanistic Interpretability
Training Sparse Autoencoders on Language Models
Sparse Autoencoder for Mechanistic Interpretability
winnieyangwannan / SAELens
Forked from jbloomAus/SAELensTraining Sparse Autoencoders on Language Models
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
A scientific instrument for investigating latent spaces
Code for internal lab sharing - polishing has started but is by no means complete
A beautiful, simple, clean, and responsive Jekyll theme for academics