-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expanding SAE evals #196
Expanding SAE evals #196
Conversation
Hey Josh, Thanks again for this PR. Sorry for the delay in responding. Major thoughts:
Next steps: Specific reponse:
Agreed. I think we should maybe separate out a general eval workflow from whatever we do in training though they could rely on some shared implementations to avoid duplication. I think of auto-interp / N2G stuff as kinda seperate but understand why you might want that stuff as well when doing evals. I think it's important to have scripts that make sense on a piece of hardware (Eg: autointerp relying on external LM's is not using the GPU as much so probably that should be a seperate script).
Sounds great! We can probably whip up a prototype in streamlit or something (save some artefacts / results to disk and then load with plotly express. Use streamlit to play with structure of how we want to present. Claude could probably produce a decent scaffold in minutes.
Have you see the post I supervised Evan Anders working on? https://www.lesswrong.com/posts/8QRH8wKcnKGhpAu2o/examining-language-model-performance-with-reconstructed |
I've addressed the first two in the PR by adding an Eval config. The shuffle function just allows us to shuffle Huggingface datasets by using their built in shuffle buffer. Since we already have our own shuffle buffer in the library this isn't useful for the shuffle buffer itself (hence why I set a buffer size of 1), but because it also shuffles the shards of the datasets (per https://huggingface.co/docs/datasets/v2.20.0/stream#shuffle). This means that we can e.g. eval SAEs starting on different shards of a large web text dataset then we trained on.
I have, but evidently I forgot all the cool results you guys found :)
I was thinking this sort of thing exactly! Probably next I'm going to add neuron to graph using the OpenAI implementation, once we have that I think this sort of display could show some pretty novel and interesting info :) In addition to the tests in the PR, I ran a quick test that spun up training a GPT-2 SAE to ensure that the metrics were working and visually it looked okay! |
Haven't forgotten about this. I wanted to make sure the eval runner respects the config set in training (which I think now it might partly ignore) but otherwise this is gtg. We'll look into getting some enterprise hosted runners so we can trigger benchmarking of new SAEs :) |
Description
Building off discussion with @jbloomAus, this PR adds some metrics (L0, L1, KL divergence, variance explained) to the eval suite in evals.py, cleans up the eval suite, and adds a script to evals.py to run the eval suite on any regex set of SAEs + any group of datasets + any group of context lengths. The eventual goal is to be able to more easily compare SAEs and assess the quality of proposed SAE improvements.
Next steps:
I haven't added tests because it's not yet clear exactly how we want to integrate these metrics into unit testing, since it can take a while to run the tests, but here is the result of running `python sae_lens/evals.py "gpt2-small-res-jb" "blocks.8.hook_resid_pre" --save_path "gpt2_small_jb_layer8_resid_pre_eval_results.csv" --eval_batch_size_prompts 16 --num_eval_batches 5'
One interesting finding from these new benchmarks is that the L0 (and MSE/var explained) increases drastically with longer context length (which was somewhat expected from the conversation between Joseph Bloom and Sam Marks).
Type of change
Checklist:
You have tested formatting, typing and unit tests (acceptance tests not currently in use)
make check-ci
to check format and linting. (you can runmake format
to format code if needed.)