Meta MegaByte model experiments.
Based on MEGABYTE-pytorch
repository from Lucidrains and nanoGPT
from Andrej Karpathy.
All training runs available publicly on Neptune.ai.
- Create training script
- Add Neptune.ai logging
- Add cosine learning rate scheduler
- Add weight initialisation based on MegaByte paper
DIM_HEAD = 64
HEADS = 8
NUM_TOKENS = 256
DIM = (768, 512, 256)
DEPTH = (6, 4, 2)
MAX_SEQ_LEN = (512, 4, 4)
FLASH_ATTN = False
- Baseline Lucidrains plus Cosine LR (Neptune: MEG-16)
- More or less the same as no LR schedule. Very slightly worse.
- Double Batch Size from 4 to 8 (Effective 16 to 32)
- Naive Attempt: Slightly worse convergence (Neptune: MEG-18)
- LR (6e-4 to 6e-5):
- (BS := 8, Neptune: MEG-19) Faster and better convergence per training step. However, validation loss vs wall-clock time is worse, clear trade-off.
- (BS := 4, Neptune: MEG-20) Learning rate too high for this batch size. Training loss plateaus then gets worse, validation loss stays the same.
- AMP (Neptune: MEG-21). This experiment fixed the loss scaling. Achieves the best validation loss
at 1.29, gradient eventually exploded at around step 7,000. Best to use 3e-4 instead of 6e-4 as maximum learning rate.
- Attempted Continuation (Neptune: MEG-22). Unsuccessful, maybe best performance reached, or because optimiser state also wasn't restored.
- AMP (Neptune: MEG-21). This experiment fixed the loss scaling. Achieves the best validation loss
at 1.29, gradient eventually exploded at around step 7,000. Best to use 3e-4 instead of 6e-4 as maximum learning rate.
- 6e-4 with Batch Size 20 (Effective 80) on A10 24GB on Lambda Cloud (Neptune: MEG-32). Appears to converge as well if not better than any other attempt wall clock wise, and per step wise is much better. However, training randomly blew up before finishing (Neptune: MEG-34), details continuation attempt. Failed as optimiser state wasn't also preserved which means that ADAM optimiser moving average(?) values probably wrong.
- 6e-4 with Batch Size 40 (Effective 160) on A100 40GB on Lambda Cloud (Neptune: MEG-39). Same as A10 24GB but with even better performance vs wall-clock time.
DIM_HEAD = 64
HEADS = 16
NUM_TOKENS = 256
DIM = (1024, 512, 256)
DEPTH = (24, 4, 2) # (12, 4, 2)
MAX_SEQ_LEN = (512, 4, 4)
FLASH_ATTN = False
NOTES:
-
Trying SophiaG optimiser instead of Adam
-
Trying
RedPajama-Data-1T-Sample
instead ofenwik8
-
Batch Size := 20 on A6000 48GB, LR=3e-4 (Neptune: MEG-67)
- Training Continuation (lr=3e-4 too high for larger 270M model vs 52M model, continuing from 1000 step checkpoint at 2e-4, 0 warmup on continuation) (Neptune: MEG-72)
- Model gradients exploded as learning rate too high at 3e-4. Should just use original 2e-4 as suggested in other forks and original paper. Possibly SophiaG also doesn't like higher LR for this as well?
- Training Continuation (lr=3e-4 too high for larger 270M model vs 52M model, continuing from 1000 step checkpoint at 2e-4, 0 warmup on continuation) (Neptune: MEG-72)
-
Same as above but switching LR back to 2e-4, attempting full 1 epoch run over 12 hours. (Neptune: MEG-80). This lasted up to around epoch 2000 before gradients exploded. Needs longer warmup, high starting LR fine but needs to decay more.
-
Same as above using Adam optimiser. SophiaG is overall just better, faster convergence everything, but needs to be tuned a bit. (Neptune: MEG-83).
-
Try 1e-4 LR. SophiaG seems to be very sensitive to LR, or maybe just model size? Dunno something is very sensitive as gradient always explodes. (Neptune: MEG-87). Plateua'd much later on in training. Suggests that issue with gradients is related to either the LR or some other aspect of SophiaG.
-
Changed Rho to 0.05, changed grad clip to 1.0, changed lr back to 2e-4 to 2e-5, still warmup of 2000. (Neptune: MEG-93). Made no difference, gonna switch back to Adam.
- Initial Training Run (Neptune: MEG-142, MEG-143, MEG-144, MEG-147, MEG-149, MEG-151) MEG-157). Very successful, using 512 seq len, GPT-Neo tokeniser.
megabyte_25k_1.2836014032363892.pt
MegaByte model trained on enwik8 for 25k batches (~36.44x passes over the dataset). Produces text as seen below. Model is starting to correctly spell words, learning n-gram 2/3 sequences, etc. Is continuing to converge. Trained for just over 12 hours on RTX 3060 Ti. (Full training completion on RTX 3060 Ti would take roughly 50 hours). Final validation loss is ~1.28 with a perplexity of 3.59.
Validation loss: 1.2836014032363892