WORK IN PROGRESS (do not use)

Latent Large Language Models

A work in progress.

(1) Node Setup

For a development setup with fast iterative deployment on LAN, follow the instruction from the playbooks/ directory.

For Internet scale training, we will need to build a Docker container...

(2) Dataset Setup

This is using my dataloader pip package from https://github.com/catid/dataloader

(3) Training

conda create -n lllm python=3.10 -y
conda activate lllm
pip install "numpy<2.0"
pip install packaging torch==2.3.1 torchvision torchaudio
pip install mamba_ssm
pip install causal-conv1d
pip install flash-attn
pip install -r requirements.txt
#pip install cupy (only for 1-bit LAMB)

Follow the instructions in the train/ directory.

TODO

Training TODO:

Implement Async DiLoCo: https://arxiv.org/pdf/2401.09135v1
Activation compression: https://github.com/zirui-ray-liu/Exact
Modified loss function: https://openreview.net/pdf?id=vHOO1lxggJ
Patch-based training: https://arxiv.org/pdf/2407.12665

Model TODO:

Use u-muP parameterization: https://arxiv.org/pdf/2407.17465
Add ALiBi/RoPE positional encoding
Replace SWA with KV cache compression as in https://github.com/VITA-Group/LoCoCo/ Modify FA2 to provide vertical token scores. Top-M, M = K*2 and then a learned projection Top-K to pick the tokens to keep from each window.
N-gram accelerated LM architecture based on results like https://arxiv.org/pdf/2407.12034
DRUGS on the second half of the model: https://github.com/EGjoni/DRUGS
Some layers attend to top-k input tokens from middle of model

Dataloader TODO:

Add support for returning the list of concatenated samples in flash_attn format
Add support for DCLM-Baseline 4T: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0

Dataloader future improvements:

RHO-loss for the dataset using LLaMA-3 8B to provide reference loss for each token - need to convert to our tokenizer via approximation

Training future experiments:

Meta-learning to try to estimate weight updates using larger batch sizes and more iterations from smaller batch sizes and single steps
Try optimi https://optimi.benjaminwarner.dev/
DataStates-LLM DeepSpeed checkpointing: https://github.com/datastates/datastates-llm
Distributed training using LoRA-Pro: https://arxiv.org/abs/2407.18242

FFN experiments:

Sharing FFN weights onion-style https://arxiv.org/abs/2104.06022
Share the majority of FFN weights between consecutive layers but only replace a few of them each time

Future model experiments:

Use an FFN output from previous layer to select which tokens are masked in the next SWA layer (MoD)
Instead of gating the whole layer, gate each head. Same for FFN output head gating.
VAR-style or diffusion style final block that predicts multiple tokens at once
Byte model
Spend a layer to drop tokens at KV cache point (mask for second half) per token
RWKV-6 for first half, Mix GLA and 20% Full attention for second half of YOCO
https://github.com/sustcsonglin/flash-linear-attention for implementations of fast linear attention
RWKV-6: https://github.com/Ronsor/rwkv-simple
https://github.com/HazyResearch/ThunderKittens
DeltaNet retrieval heads: https://github.com/sustcsonglin/flash-linear-attention/blob/4929ae7370f99ca60ab1d6e3903ba91ca9d981f0/fla/layers/delta_net.py#L48 https://arxiv.org/pdf/2406.06484
Mamba2 with cu_seqlen: https://github.com/zigzagcai/mamba Needs some fixes from master mamba branch
Mamba2 h_t ring buffer: https://x.com/MrCatid/status/1800995821543907626
Samba: SWA cross-attend to much earlier tokens deeper in to avoid squashed representation collapse
VQVAE compression of embedding layers: https://arxiv.org/abs/2406.11837
SparseK KV cache compression: https://arxiv.org/pdf/2406.16747

Fine-tuning ideas:

Take LLaMA-3 70B Instruct-tuned output from each data chunk, and train the model to generate the same continuations (a way to skip fine-tuning?)

Onion training:

Start with a very small model that is: nn.Embed -> SambaBlock1 -> Quantized1 (8-bit) -> SambaBlock1 -> heads. nn.Embed is taken from a pre-trained large model and is frozen. SambaBlock1 blocks have shared parameters. There is a FFN head that reproduces the input token ids with reconstruction loss. There is a second FFN head that predicts the next token with cross-entropy loss. And a third head that predicts the following token. (2) Train the model so that loss = reconstruction + next_token + second_next_token until it converges. (3) Freeze SambaBlock1. Insert a new SambaBlock2: nn.Embed -> SambaBlock1 -> Quantized1 -> SambaBlock2 -> Quantized2 -> SambaBlock2 -> Quantized1 -> SambaBlock1 -> heads (4) Continue training until convergence. (5) Repeat with a third block, etc. The Quantized layer involves kind of an auto-encoder thing that you split in half when inserting more blocks in between

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WORK IN PROGRESS (do not use)

Latent Large Language Models

(1) Node Setup

(2) Dataset Setup

(3) Training

TODO

Files

README.md

Latest commit

History

README.md

File metadata and controls

WORK IN PROGRESS (do not use)

Latent Large Language Models

(1) Node Setup

(2) Dataset Setup

(3) Training

TODO