Skip to content

Latest commit

 

History

History
7 lines (4 loc) · 862 Bytes

README.md

File metadata and controls

7 lines (4 loc) · 862 Bytes

Coup probes

This is a repository that contains experiments of the Coup probes post. It contains code to train probes to identify theft advice, and evaluate their generalization abilities under format variations and jaibreak suffixes.

Once the seed dataset is generated, theft_probe/run.py runs the relevant scripts to generate jailbreaks and the model activations. Then theft_probe/train_probes.py trains the probes and evaluates them, and theft_probe/plot.py generates figures.

This is a fork of the official repository for "Universal and Transferable Adversarial Attacks on Aligned Language Models" by Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson.