# Master the Training of NanoGPT with DLRover

Welcome to an exhaustive guide on how to train the `NanoGPT` model using DLRover.

## What's NanoGPT?

NanoGPT is a specialized version of the famous GPT (Generative Pretrained Transformer) model.
What makes it unique is its role in evaluating the scalability and elasticity of the DLRover job controller.
It provides the ability to tweak hyperparameters like _n_layer_, _n_head_, and _n_embedding_,
making it possible to conduct tests on GPT models of varying sizes.

For a more in-depth dive into the fascinating world of NanoGPT, don't hesitate to visit [NanoGPT](https://github.com/karpathy/nanoGPT)
for the source code and a plethora of other valuable resources.

## Local GPT Training

- Pull the image with the model and data.

```bash
docker pull registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:pytorch-example
docker run -it registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:pytorch-example bash
cd /dlrover/examples/pytorch/nanogpt/
```

- Local run the training by `dlrover-run`

```bash
dlrover-run --nnodes=1 --max_restarts=2 --nproc_per_node=2 \
    train.py --n_layer 48 --n_head 16 --n_embd 1600 \
    --data_dir './' --epochs 50 --save_memory_interval 50 \
    --save_storage_interval 500
```

You also can run the FSDP and DeepSpeed example by using the `fsdp_train.py` and `ds_train.py`.

## Distributed GPT Training on k8s - Let's Dive In

### Setting Up the DLRover Job Controller

Follow the comprehensive guide in the [Controller Deployment](dlrover/docs/deployment/controller.md)
document to get your DLRover job controller up and running.

### Getting Started with a Sample YAML

Starting off with your journey to evaluating the performance of DLRover, you'll be submitting multiple training jobs.
This will be done using NanoGPT with a variety of parameter settings to gauge performance under different conditions.

Kick off the process with the following command:

```bash
kubectl -n dlrover apply -f  examples/pytorch/nanogpt/elastic_job.yaml
```

Upon successful application of the job configuration,
you can monitor the status of the training nodes using the command below:

```bash
kubectl -n dlrover get pods
```

Expect an output that resembles this:

```bash
NAME                                              READY   STATUS    RESTARTS   AGE
dlrover-controller-manager-7dccdf6c4d-grmks       2/2     Running   0          12h
elasticjob-torch-nanogpt-dlrover-master.          1/1     Running   0          20s
torch-nanogpt-edljob-worker-0                     1/1     Running   0          11s
torch-nanogpt-edljob-worker-1                     1/1     Running   0          11s
```

### Examine the results obtained from two different parameter settings

parameter settings 1:

```bash
# parameter settings in examples/pytorch/nanogpt/ddp_elastic_job.yaml
--n_layer 6 \
--n_head 6 \
--n_embd 384
```

parameter settings 2:

```bash
# parameter settings in examples/pytorch/nanogpt/ddp_elastic_job.yaml
--n_layer 12 \
--n_head 12 \
--n_embd 768
```

#### More detailed description of the pods

Worker-0 Logs

```bash
kubectl logs -n dlrover torch-nanogpt-edljob-worker-0
```

results with parameter settings 1:

```text
iter 0: loss 4.2279, time 4542.46ms, mfu -100.00%, lr 6.00e-04, total time 4.54s
iter 1: loss 3.5641, time 4439.20ms, mfu -100.00%, lr 6.00e-04, total time 8.98s
iter 2: loss 4.2329, time 4477.08ms, mfu -100.00%, lr 6.00e-04, total time 13.46s
iter 3: loss 3.6564, time 4579.50ms, mfu -100.00%, lr 6.00e-04, total time 18.04s
iter 4: loss 3.5026, time 4494.54ms, mfu -100.00%, lr 6.00e-04, total time 22.53s
iter 5: loss 3.2993, time 4451.15ms, mfu 0.33%, lr 6.00e-04, total time 26.98s
iter 6: loss 3.3318, time 4391.21ms, mfu 0.33%, lr 6.00e-04, total time 31.38s
```

results with parameter settings 2:

```text
iter 0: loss 4.4201, time 31329.07ms, mfu -100.00%, lr 6.00e-04, total time 31.33s
iter 1: loss 4.6237, time 30611.01ms, mfu -100.00%, lr 6.00e-04, total time 61.94s
iter 2: loss 6.7593, time 30294.34ms, mfu -100.00%, lr 6.00e-04, total time 92.23s
iter 3: loss 4.2238, time 30203.78ms, mfu -100.00%, lr 6.00e-04, total time 122.44s
iter 4: loss 6.1183, time 30100.29ms, mfu -100.00%, lr 6.00e-04, total time 152.54s
iter 5: loss 5.0796, time 30182.75ms, mfu 0.33%, lr 6.00e-04, total time 182.72s
iter 6: loss 4.5217, time 30303.39ms, mfu 0.33%, lr 6.00e-04, total time 213.02s
```

Worker-1 Logs

```bash
kubectl logs -n dlrover torch-nanogpt-edljob-worker-1
```

results with parameter settings 1:

```text
iter 0: loss 4.2382, time 4479.40ms, mfu -100.00%, lr 6.00e-04, total time 4.48s
iter 1: loss 3.5604, time 4557.53ms, mfu -100.00%, lr 6.00e-04, total time 9.04s
iter 2: loss 4.3411, time 4408.12ms, mfu -100.00%, lr 6.00e-04, total time 13.45s
iter 3: loss 3.7863, time 4537.51ms, mfu -100.00%, lr 6.00e-04, total time 17.98s
iter 4: loss 3.5153, time 4489.47ms, mfu -100.00%, lr 6.00e-04, total time 22.47s
iter 5: loss 3.3428, time 4567.38ms, mfu 0.32%, lr 6.00e-04, total time 27.04s
iter 6: loss 3.3700, time 4334.36ms, mfu 0.32%, lr 6.00e-04, total time 31.37s
```

results with parameter settings 2:

```text
iter 0: loss 4.4402, time 31209.29ms, mfu -100.00%, lr 6.00e-04, total time 31.21s
iter 1: loss 4.5574, time 30688.11ms, mfu -100.00%, lr 6.00e-04, total time 61.90s
iter 2: loss 6.7668, time 30233.15ms, mfu -100.00%, lr 6.00e-04, total time 92.13s
iter 3: loss 4.2619, time 30400.66ms, mfu -100.00%, lr 6.00e-04, total time 122.53s
iter 4: loss 6.2001, time 29960.20ms, mfu -100.00%, lr 6.00e-04, total time 152.49s
iter 5: loss 5.0426, time 30222.85ms, mfu 0.32%, lr 6.00e-04, total time 182.71s
iter 6: loss 4.5057, time 30200.79ms, mfu 0.32%, lr 6.00e-04, total time 212.92s
```

### Building from Docker - Step by Step

### Preparing Your Data

To begin, you need a text document which can be a novel, drama, or any textual content.
For instance, you can name this document as data.txt.

Here's an example of a Shakespearean dialogue:p

```text
BUCKINGHAM:
Welcome, sweet prince, to London, to your chamber.

GLOUCESTER:
Welcome, dear cousin, my thoughts' sovereign
The weary way hath made you melancholy.

PRINCE EDWARD:
No, uncle; but our crosses on the way
Have made it tedious, wearisome, and heavy
I want more uncles here to welcome me.
```

Alternatively, you can use our provided data, which is available in the [examples/pytorch/nanogpt/data.txt](examples/pytorch/nanogpt/data.txt).
This data has already been prepared for use.

### Time to Run the Preparation Script

Now that you have your data, let's run the preparation script as follows:

```bash
python examples/pytorch/nanogpt/prepare.py --src_data_path data.txt
This command generates a train.bin and val.bin file in the data directory.
```

### Building the Training Image for PyTorch Models

Having prepared the data, the final step involves building the training image of PyTorch models. Here's how you do it:

```bash
docker build -t easydl/dlrover-train-nanogpt:test -f docker/pytorch/nanogpt.dockerfile .
```

And voila! You're all set to run the model and dive into the world of Natural Language Processing.
I hope this adds more life and detail to your README document. Let me know if there's anything else you need help with!

## References

This eaxmple is built upon and significantly influenced by the [NanoGPT](https://github.com/karpathy/nanoGPT) project.
Several scripts from the project, including but not limited to `prepare.py`, `train.py`, and `model.py`,
have been adapted to our specific requirements.

The original scripts can be found in the NanoGPT repository: [NanoGPT](https://github.com/karpathy/nanoGPT)

## Acknowledgments

We would like to express our sincere gratitude to the authors and contributors of the NanoGPT project.
Their work has provided us with a strong foundation for our example,
and their insights have been invaluable for our development process. Thank you!