Preliminary results here: https://docs.google.com/document/d/1LCjjnUPN51gHl_rmCWEmmtbY-Wu1dixzOif14e-7i-U
Structure of the repository:
- generate many hidden states with
generate_main.py
(it uses Collin's code from the zip files, copied inutils_generation
) - find direction using CCS recursively by adding new constraints each time using
generate_ccs_dirs_main.py
(it uses Collin's code from the github repo + bugfixes + custom things I want for recursion & debuging, the core functions are inutils.py
). You can see what I use it for by looking atscript.sh
(it's a bit messy). - analyze the directions found by CCS by running
analyze_ccs_dirs_main.py
(looking at a single layer) oranalyze_ccs_dirs_layers_main.py
(looking at multiple layers at a time) as Python notebooks. - get a sense of what the CCS loss does by running
vizualisation.py
as a Python notebook. - see what happens when you mix directions by running
mixing_directions.py
as a Python notebook.
Note: evaluate.py
, generate_ccs_dirs.py
, evaluate.py
, and analyze_ccs_dirs.py
legacy files from when I only looked at datasets composed of a single prompt. They are not used anymore.