This is just a small language model that I made for fun. It's not very good but it's a good way to learn about GPT like transformer models actually work.
NOTE: Training is not well optimized and the dataset is rather small so there is a lot of room for improvement.
Before using standard methods
After making some trivial changes
- Added RMSNorm
- Added Gradient Clipping
- 1/10th the learning rate
Training loss over time
NOTE Differences in this model and GPT3 were taken from LLaMa and PaLM