Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanding SAE evals #196

Merged
merged 9 commits into from
Jul 11, 2024
Merged

Expanding SAE evals #196

merged 9 commits into from
Jul 11, 2024

Conversation

JoshEngels
Copy link
Contributor

@JoshEngels JoshEngels commented Jun 26, 2024

Description

Building off discussion with @jbloomAus, this PR adds some metrics (L0, L1, KL divergence, variance explained) to the eval suite in evals.py, cleans up the eval suite, and adds a script to evals.py to run the eval suite on any regex set of SAEs + any group of datasets + any group of context lengths. The eventual goal is to be able to more easily compare SAEs and assess the quality of proposed SAE improvements.

Next steps:

  • Add additional more interesting metrics to these evals
  • Add config to specify which evals to run? This will likely be necessary as some evals will require running over a much larger set of inputs (e.g. AutoInterp or Neuron to Graph), and we definitely don't want to do this every eval step during training.
  • Add the output of these evals to a publicly visible frontend (wandb?)

I haven't added tests because it's not yet clear exactly how we want to integrate these metrics into unit testing, since it can take a while to run the tests, but here is the result of running `python sae_lens/evals.py "gpt2-small-res-jb" "blocks.8.hook_resid_pre" --save_path "gpt2_small_jb_layer8_resid_pre_eval_results.csv" --eval_batch_size_prompts 16 --num_eval_batches 5'

sae_id,context_size,dataset,metrics/ce_loss_score,metrics/ce_loss_without_sae,metrics/ce_loss_with_sae,metrics/ce_loss_with_ablation,metrics/kl_div_score,metrics/kl_div_without_sae,metrics/kl_div_with_sae,metrics/kl_div_with_ablation,metrics/l2_norm_in,metrics/l2_norm_out,metrics/l2_ratio,metrics/explained_variance,metrics/l0,metrics/l1,metrics/mse,metrics/total_tokens_evaluated
gpt2-small-res-jb-blocks.8.hook_resid_pre,64,Skylion007/openwebtext,0.9845677614212036,3.960983991622925,4.076873779296875,11.470545768737793,0.9784350395202637,0,0.1678992509841919,7.785748481750488,150.4132080078125,142.9733123779297,0.9286446571350098,0.8952922224998474,63.646095275878906,188.48428344726562,820.5320434570312,5120
gpt2-small-res-jb-blocks.8.hook_resid_pre,64,lighteval/MATH,0.9586054086685181,3.457794666290283,3.7396981716156006,10.267949104309082,0.9578211903572083,0,0.3119097352027893,7.394940376281738,152.2721405029297,141.8019256591797,0.9004862904548645,0.8332929015159607,67.1050796508789,219.2670440673828,1358.131103515625,5120
gpt2-small-res-jb-blocks.8.hook_resid_pre,128,Skylion007/openwebtext,0.9811103940010071,3.578336477279663,3.7281081676483154,11.507119178771973,0.9785649180412292,0,0.17235903441905975,8.040968894958496,126.40776824951172,118.80797576904297,0.9266132712364197,0.8637218475341797,66.3515625,168.2150115966797,847.5152587890625,10240
gpt2-small-res-jb-blocks.8.hook_resid_pre,128,lighteval/MATH,0.9497612714767456,3.115217924118042,3.4698617458343506,10.174386978149414,0.9544948935508728,0,0.3422759473323822,7.521711826324463,128.04052734375,117.19371795654297,0.8958619236946106,0.7796635031700134,70.74580383300781,201.8173828125,1422.8131103515625,10240
gpt2-small-res-jb-blocks.8.hook_resid_pre,256,Skylion007/openwebtext,0.9705211520195007,3.4974114894866943,3.7319839000701904,11.454726219177246,0.9675344824790955,0,0.2600250840187073,8.009261131286621,114.35155487060547,108.47358703613281,0.9435735940933228,0.8196613192558289,73.0569839477539,178.72763061523438,1090.837890625,20480
gpt2-small-res-jb-blocks.8.hook_resid_pre,256,lighteval/MATH,0.9374359846115112,2.9319350719451904,3.384852647781372,10.17119312286377,0.9416958093643188,0,0.4426324963569641,7.591780185699463,115.85066223144531,108.4005355834961,0.9277817010879517,0.7144442796707153,79.93281555175781,217.1159210205078,1775.419189453125,20480
gpt2-small-res-jb-blocks.8.hook_resid_pre,512,Skylion007/openwebtext,0.9241127371788025,3.218637228012085,3.8472740650177,11.502455711364746,0.9269356727600098,0,0.6009768843650818,8.225316047668457,108.37286376953125,137.26498413085938,1.2824950218200684,-0.1812371015548706,160.3723602294922,349.6664123535156,7341.86279296875,40960
gpt2-small-res-jb-blocks.8.hook_resid_pre,512,lighteval/MATH,0.8801499009132385,2.8067667484283447,3.696810245513916,10.233075141906738,0.8844277858734131,0,0.8902671933174133,7.703125,110.2323226928711,153.8268280029297,1.4150439500808716,-1.0064818859100342,191.05508422851562,464.3595275878906,13022.7548828125,40960

One interesting finding from these new benchmarks is that the L0 (and MSE/var explained) increases drastically with longer context length (which was somewhat expected from the conversation between Joseph Bloom and Sam Marks).

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

  • I have run make check-ci to check format and linting. (you can run make format to format code if needed.)

@jbloomAus
Copy link
Owner

Hey Josh,

Thanks again for this PR. Sorry for the delay in responding.

Major thoughts:

  • It doesn't make sense to do all the sparsity evaluation work in training and then pop those metrics out. We should just have the training code only get the reconstruction related metrics. (So maybe a little bit of refactoring here required?)
  • The KL div should probably be optional as it can be computationally expensive. By default I'd want it off during SAE training.
  • Can you please provide a bit of detail / commentary on the stuff you added to the data store for shuffling? Seems good but I want to make sure I understand it.

Next steps:
@JoshEngels If you are able to address the above, we can likely merge as soon as you're done. We should probably do a quick performance test too to see we haven't massively slowed down training (I don't think we should if KL div / sparsity metrics) are turned off.

Specific reponse:

Add config to specify which evals to run? This will likely be necessary as some evals will require running over a much larger set of inputs (e.g. AutoInterp or Neuron to Graph), and we definitely don't want to do this every eval step during training.

Agreed. I think we should maybe separate out a general eval workflow from whatever we do in training though they could rely on some shared implementations to avoid duplication. I think of auto-interp / N2G stuff as kinda seperate but understand why you might want that stuff as well when doing evals. I think it's important to have scripts that make sense on a piece of hardware (Eg: autointerp relying on external LM's is not using the GPU as much so probably that should be a seperate script).

Add the output of these evals to a publicly visible frontend (wandb?)

Sounds great! We can probably whip up a prototype in streamlit or something (save some artefacts / results to disk and then load with plotly express. Use streamlit to play with structure of how we want to present. Claude could probably produce a decent scaffold in minutes.

One interesting finding from these new benchmarks is that the L0 (and MSE/var explained) increases drastically with longer context length (which was somewhat expected from the conversation between Joseph Bloom and Sam Marks).

Have you see the post I supervised Evan Anders working on? https://www.lesswrong.com/posts/8QRH8wKcnKGhpAu2o/examining-language-model-performance-with-reconstructed

@JoshEngels
Copy link
Contributor Author

JoshEngels commented Jul 4, 2024

Major thoughts:

I've addressed the first two in the PR by adding an Eval config. The shuffle function just allows us to shuffle Huggingface datasets by using their built in shuffle buffer. Since we already have our own shuffle buffer in the library this isn't useful for the shuffle buffer itself (hence why I set a buffer size of 1), but because it also shuffles the shards of the datasets (per https://huggingface.co/docs/datasets/v2.20.0/stream#shuffle). This means that we can e.g. eval SAEs starting on different shards of a large web text dataset then we trained on.

Have you see the post I supervised Evan Anders working on?

I have, but evidently I forgot all the cool results you guys found :)

Sounds great! We can probably whip up a prototype in streamlit or something (save some artefacts / results to disk and then load with plotly express. Use streamlit to play with structure of how we want to present. Claude could probably produce a decent scaffold in minutes.

I was thinking this sort of thing exactly! Probably next I'm going to add neuron to graph using the OpenAI implementation, once we have that I think this sort of display could show some pretty novel and interesting info :)

In addition to the tests in the PR, I ran a quick test that spun up training a GPT-2 SAE to ensure that the metrics were working and visually it looked okay!

@jbloomAus
Copy link
Owner

Haven't forgotten about this. I wanted to make sure the eval runner respects the config set in training (which I think now it might partly ignore) but otherwise this is gtg. We'll look into getting some enterprise hosted runners so we can trigger benchmarking of new SAEs :)

@curt-tigges curt-tigges merged commit 7b84053 into jbloomAus:main Jul 11, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants