What Is Learned, When? | EleutherAI Community Project

reference thesis: https://nsaphra.net/uploads/thesis.pdf

older work (e.g. that thesis) uses lstm, is unclear if any of that transfers to a modern architecture, pythia checkpoints give a good reference point

Goals for the project

As per JDC has mentioned:

Establish things that pythia-12b embeddings have learned when fully trained.
Look through the checkpoints to see at what point the model learns those things/how quickly it learns those things/what the learning curve looks like for all those things the fully trained model learns
See how this extends to other pythia models/sizes.

TODO List:

Upload all(or a meaningful subset, e.g. every power of 2) pythia-12b checkpoints to HF
Analysis of token meanings/categories in the fully trained pythia-12b model
Analysis of what meanings show up when in training pythia-12b
Potentially expand this to other pythia model sizes to see if this is true across scales?

Links to data

GSON has uploaded weights here: https://huggingface.co/amphora/pythia-12b-weights

And data on their similarities here: https://huggingface.co/amphora/pythia-12b-weights/tree/main/cos_sim

References

Representation Degeneration Problem in Training Natural Language Generation Models & Is anisotropy really the cause of BERT embeddings not being semantic? - anisotropic and hypercone behaviour of token embeddings and the latter paper links them to known biases such as frequency, subword, punctuation, and case
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse - probably only relevant part fig1 evolution of cosine angle between tokens, maybe as another way to quantify embedding quality, or some behaviour that might be worth keeping in mind
Interpreting Word Embeddings with Eigenvector Analysis - embedding svd
(https://github.com/saprmarks/dictionary_learning) - candidate implementation of sparse autoencoders

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
glosstag		glosstag
notebooks		notebooks
20B_tokenizer.json		20B_tokenizer.json
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What Is Learned, When? | EleutherAI Community Project

Goals for the project

TODO List:

Links to data

References

About

Releases

Packages

Contributors 3

Languages

License

segyges/pythia-embedding-analysis

Folders and files

Latest commit

History

Repository files navigation

What Is Learned, When? | EleutherAI Community Project

Goals for the project

TODO List:

Links to data

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages