Expanding SAE evals #196

JoshEngels · 2024-06-26T23:22:30Z

Description

Building off discussion with @jbloomAus, this PR adds some metrics (L0, L1, KL divergence, variance explained) to the eval suite in evals.py, cleans up the eval suite, and adds a script to evals.py to run the eval suite on any regex set of SAEs + any group of datasets + any group of context lengths. The eventual goal is to be able to more easily compare SAEs and assess the quality of proposed SAE improvements.

Next steps:

Add additional more interesting metrics to these evals
Add config to specify which evals to run? This will likely be necessary as some evals will require running over a much larger set of inputs (e.g. AutoInterp or Neuron to Graph), and we definitely don't want to do this every eval step during training.
Add the output of these evals to a publicly visible frontend (wandb?)

I haven't added tests because it's not yet clear exactly how we want to integrate these metrics into unit testing, since it can take a while to run the tests, but here is the result of running `python sae_lens/evals.py "gpt2-small-res-jb" "blocks.8.hook_resid_pre" --save_path "gpt2_small_jb_layer8_resid_pre_eval_results.csv" --eval_batch_size_prompts 16 --num_eval_batches 5'

sae_id,context_size,dataset,metrics/ce_loss_score,metrics/ce_loss_without_sae,metrics/ce_loss_with_sae,metrics/ce_loss_with_ablation,metrics/kl_div_score,metrics/kl_div_without_sae,metrics/kl_div_with_sae,metrics/kl_div_with_ablation,metrics/l2_norm_in,metrics/l2_norm_out,metrics/l2_ratio,metrics/explained_variance,metrics/l0,metrics/l1,metrics/mse,metrics/total_tokens_evaluated
gpt2-small-res-jb-blocks.8.hook_resid_pre,64,Skylion007/openwebtext,0.9845677614212036,3.960983991622925,4.076873779296875,11.470545768737793,0.9784350395202637,0,0.1678992509841919,7.785748481750488,150.4132080078125,142.9733123779297,0.9286446571350098,0.8952922224998474,63.646095275878906,188.48428344726562,820.5320434570312,5120
gpt2-small-res-jb-blocks.8.hook_resid_pre,64,lighteval/MATH,0.9586054086685181,3.457794666290283,3.7396981716156006,10.267949104309082,0.9578211903572083,0,0.3119097352027893,7.394940376281738,152.2721405029297,141.8019256591797,0.9004862904548645,0.8332929015159607,67.1050796508789,219.2670440673828,1358.131103515625,5120
gpt2-small-res-jb-blocks.8.hook_resid_pre,128,Skylion007/openwebtext,0.9811103940010071,3.578336477279663,3.7281081676483154,11.507119178771973,0.9785649180412292,0,0.17235903441905975,8.040968894958496,126.40776824951172,118.80797576904297,0.9266132712364197,0.8637218475341797,66.3515625,168.2150115966797,847.5152587890625,10240
gpt2-small-res-jb-blocks.8.hook_resid_pre,128,lighteval/MATH,0.9497612714767456,3.115217924118042,3.4698617458343506,10.174386978149414,0.9544948935508728,0,0.3422759473323822,7.521711826324463,128.04052734375,117.19371795654297,0.8958619236946106,0.7796635031700134,70.74580383300781,201.8173828125,1422.8131103515625,10240
gpt2-small-res-jb-blocks.8.hook_resid_pre,256,Skylion007/openwebtext,0.9705211520195007,3.4974114894866943,3.7319839000701904,11.454726219177246,0.9675344824790955,0,0.2600250840187073,8.009261131286621,114.35155487060547,108.47358703613281,0.9435735940933228,0.8196613192558289,73.0569839477539,178.72763061523438,1090.837890625,20480
gpt2-small-res-jb-blocks.8.hook_resid_pre,256,lighteval/MATH,0.9374359846115112,2.9319350719451904,3.384852647781372,10.17119312286377,0.9416958093643188,0,0.4426324963569641,7.591780185699463,115.85066223144531,108.4005355834961,0.9277817010879517,0.7144442796707153,79.93281555175781,217.1159210205078,1775.419189453125,20480
gpt2-small-res-jb-blocks.8.hook_resid_pre,512,Skylion007/openwebtext,0.9241127371788025,3.218637228012085,3.8472740650177,11.502455711364746,0.9269356727600098,0,0.6009768843650818,8.225316047668457,108.37286376953125,137.26498413085938,1.2824950218200684,-0.1812371015548706,160.3723602294922,349.6664123535156,7341.86279296875,40960
gpt2-small-res-jb-blocks.8.hook_resid_pre,512,lighteval/MATH,0.8801499009132385,2.8067667484283447,3.696810245513916,10.233075141906738,0.8844277858734131,0,0.8902671933174133,7.703125,110.2323226928711,153.8268280029297,1.4150439500808716,-1.0064818859100342,191.05508422851562,464.3595275878906,13022.7548828125,40960

One interesting finding from these new benchmarks is that the L0 (and MSE/var explained) increases drastically with longer context length (which was somewhat expected from the conversation between Joseph Bloom and Sam Marks).

Type of change

New feature (non-breaking change which adds functionality)

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

I have run make check-ci to check format and linting. (you can run make format to format code if needed.)

jbloomAus · 2024-07-01T14:21:10Z

Hey Josh,

Thanks again for this PR. Sorry for the delay in responding.

Major thoughts:

It doesn't make sense to do all the sparsity evaluation work in training and then pop those metrics out. We should just have the training code only get the reconstruction related metrics. (So maybe a little bit of refactoring here required?)
The KL div should probably be optional as it can be computationally expensive. By default I'd want it off during SAE training.
Can you please provide a bit of detail / commentary on the stuff you added to the data store for shuffling? Seems good but I want to make sure I understand it.

Next steps:
@JoshEngels If you are able to address the above, we can likely merge as soon as you're done. We should probably do a quick performance test too to see we haven't massively slowed down training (I don't think we should if KL div / sparsity metrics) are turned off.

Specific reponse:

Add config to specify which evals to run? This will likely be necessary as some evals will require running over a much larger set of inputs (e.g. AutoInterp or Neuron to Graph), and we definitely don't want to do this every eval step during training.

Agreed. I think we should maybe separate out a general eval workflow from whatever we do in training though they could rely on some shared implementations to avoid duplication. I think of auto-interp / N2G stuff as kinda seperate but understand why you might want that stuff as well when doing evals. I think it's important to have scripts that make sense on a piece of hardware (Eg: autointerp relying on external LM's is not using the GPU as much so probably that should be a seperate script).

Add the output of these evals to a publicly visible frontend (wandb?)

Sounds great! We can probably whip up a prototype in streamlit or something (save some artefacts / results to disk and then load with plotly express. Use streamlit to play with structure of how we want to present. Claude could probably produce a decent scaffold in minutes.

One interesting finding from these new benchmarks is that the L0 (and MSE/var explained) increases drastically with longer context length (which was somewhat expected from the conversation between Joseph Bloom and Sam Marks).

Have you see the post I supervised Evan Anders working on? https://www.lesswrong.com/posts/8QRH8wKcnKGhpAu2o/examining-language-model-performance-with-reconstructed

JoshEngels · 2024-07-04T19:48:13Z

Major thoughts:

I've addressed the first two in the PR by adding an Eval config. The shuffle function just allows us to shuffle Huggingface datasets by using their built in shuffle buffer. Since we already have our own shuffle buffer in the library this isn't useful for the shuffle buffer itself (hence why I set a buffer size of 1), but because it also shuffles the shards of the datasets (per https://huggingface.co/docs/datasets/v2.20.0/stream#shuffle). This means that we can e.g. eval SAEs starting on different shards of a large web text dataset then we trained on.

Have you see the post I supervised Evan Anders working on?

I have, but evidently I forgot all the cool results you guys found :)

Sounds great! We can probably whip up a prototype in streamlit or something (save some artefacts / results to disk and then load with plotly express. Use streamlit to play with structure of how we want to present. Claude could probably produce a decent scaffold in minutes.

I was thinking this sort of thing exactly! Probably next I'm going to add neuron to graph using the OpenAI implementation, once we have that I think this sort of display could show some pretty novel and interesting info :)

In addition to the tests in the PR, I ran a quick test that spun up training a GPT-2 SAE to ensure that the metrics were working and visually it looked okay!

jbloomAus · 2024-07-08T21:13:45Z

Haven't forgotten about this. I wanted to make sure the eval runner respects the config set in training (which I think now it might partly ignore) but otherwise this is gtg. We'll look into getting some enterprise hosted runners so we can trigger benchmarking of new SAEs :)

JoshEngels added 5 commits June 26, 2024 17:33

First round of evals

2476afb

Moving file

4be5011

Adding script to evals.py

f9aa2dd

Fixing test

389a159

Updating example commands

265687c

JoshEngels added 4 commits July 4, 2024 15:27

Making changes in response to comments

cf4ebcd

Merge remote-tracking branch 'upstream/main' into Evals

52780c0

Actually doing merge

c362e81

Adding type hint

5da6a13

curt-tigges merged commit 7b84053 into jbloomAus:main Jul 11, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expanding SAE evals #196

Expanding SAE evals #196

JoshEngels commented Jun 26, 2024 •

edited

Loading

jbloomAus commented Jul 1, 2024

JoshEngels commented Jul 4, 2024 •

edited

Loading

jbloomAus commented Jul 8, 2024

Expanding SAE evals #196

Expanding SAE evals #196

Conversation

JoshEngels commented Jun 26, 2024 • edited Loading

Description

Type of change

Checklist:

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

jbloomAus commented Jul 1, 2024

JoshEngels commented Jul 4, 2024 • edited Loading

jbloomAus commented Jul 8, 2024

JoshEngels commented Jun 26, 2024 •

edited

Loading

JoshEngels commented Jul 4, 2024 •

edited

Loading