Coup probes

This is a repository that contains experiments of the Coup probes post. It contains code to train probes to identify theft advice, and evaluate their generalization abilities under format variations and jaibreak suffixes.

Once the seed dataset is generated, theft_probe/run.py runs the relevant scripts to generate jailbreaks and the model activations. Then theft_probe/train_probes.py trains the probes and evaluates them, and theft_probe/plot.py generates figures.