explanare (explanare) / Repositories

Evaluate interpretability methods on localizing and disentangling concepts in LLMs.

Python 23 4 Updated Aug 7, 2024

Demystifying Verbatim Memorization in Large Language Models

Updated Jul 29, 2024

A framework for evaluating natural language explanations of neurons.

Python 1 Updated Jun 26, 2024

A causal intervention framework to learn robust and interpretable character representations inside subword-based language models

Jupyter Notebook 1 MIT License Updated Jul 10, 2023

Provide feedback