Skip to content

Latest commit

 

History

History
101 lines (76 loc) · 4.8 KB

README.md

File metadata and controls

101 lines (76 loc) · 4.8 KB

WORK IN PROGRESS (do not use)

Latent Large Language Models

A work in progress.

(1) Node Setup

For a development setup with fast iterative deployment on LAN, follow the instruction from the playbooks/ directory.

For Internet scale training, we will need to build a Docker container...

(2) Dataset Setup

This is using my dataloader pip package from https://github.com/catid/dataloader

(3) Training

conda create -n lllm python=3.10 -y
conda activate lllm
pip install "numpy<2.0"
pip install packaging torch==2.3.1 torchvision torchaudio
pip install mamba_ssm
pip install causal-conv1d
pip install flash-attn
pip install -r requirements.txt
#pip install cupy (only for 1-bit LAMB)

Follow the instructions in the train/ directory.

TODO

Training TODO:

Model TODO:

Dataloader TODO:

Dataloader future improvements:

  • RHO-loss for the dataset using LLaMA-3 8B to provide reference loss for each token - need to convert to our tokenizer via approximation

Training future experiments:

FFN experiments:

  • Sharing FFN weights onion-style https://arxiv.org/abs/2104.06022
  • Share the majority of FFN weights between consecutive layers but only replace a few of them each time

Future model experiments:

Fine-tuning ideas:

  • Take LLaMA-3 70B Instruct-tuned output from each data chunk, and train the model to generate the same continuations (a way to skip fine-tuning?)

Onion training:

  1. Start with a very small model that is: nn.Embed -> SambaBlock1 -> Quantized1 (8-bit) -> SambaBlock1 -> heads. nn.Embed is taken from a pre-trained large model and is frozen. SambaBlock1 blocks have shared parameters. There is a FFN head that reproduces the input token ids with reconstruction loss. There is a second FFN head that predicts the next token with cross-entropy loss. And a third head that predicts the following token. (2) Train the model so that loss = reconstruction + next_token + second_next_token until it converges. (3) Freeze SambaBlock1. Insert a new SambaBlock2: nn.Embed -> SambaBlock1 -> Quantized1 -> SambaBlock2 -> Quantized2 -> SambaBlock2 -> Quantized1 -> SambaBlock1 -> heads (4) Continue training until convergence. (5) Repeat with a third block, etc. The Quantized layer involves kind of an auto-encoder thing that you split in half when inserting more blocks in between