Modded-NanoGPT

This is a variant of the PyTorch GPT-2 trainer from Andrej Karpathy's llm.c repo. It:

Trains 2x more efficiently (taking only 5B tokens instead of 10B to reach the same validation loss).
Has simpler code (446 lines instead of 858).
Implements modernizations like rotary embeddings.

To run it:

python data/fineweb.py
./run.sh

This will produce a 124M-parameter transformer trained on 5B tokens, which has 3.2818 validation loss on the Fineweb validation set. For comparison, the original llm.c trainer yields 3.2847 validation loss after training for 10B tokens.

To simplify the code, some features have been removed, including text generation. And to obtain a training speed improvement, we have diverged a bit architecturally and in terms of hyperparameters from being a strict reproduction of the GPT-2 paper.

The speedup is due to the following changes:

Increased learning rate by 3x
Switched to trapezoidal learning rate schedule following 2405.18392
Switched to rotary embeddings
Removed the special initialization for linear layers before residuals. Instead, just scale down the output of the attention block by a fixed scalar.
Removed all affine scale and bias parameters from the architecture, and switched to RMSNorm (actually this causes a slight slowdown, and I just did it to reduce code complexity)

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
data		data
img		img
README.md		README.md
run.sh		run.sh
train_gpt2.py		train_gpt2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modded-NanoGPT

About

Releases

Packages

Languages

KellerJordan/modded-nanogpt

Folders and files

Latest commit

History

Repository files navigation

Modded-NanoGPT

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages