For information regarding system architecture, please refer to the Cloud TPU System Architecture page.
You can either follow these tutorials available on Google Cloud website:
- Training FairSeq Transformer on Cloud TPUs
- Training Resnet50 on Cloud TPUs
- Training PyTorch models on Cloud TPU Pods
Or the following README to run your model.
First create your Cloud TPU node with the corresponding release you wish to consume (TPU software version: ex. pytorch-0.5
):
Once you've created a Cloud TPU node, you can train your PyTorch models by either:
Follow these steps to train a PyTorch model with Docker on a TPU:
-
Create a Compute VM and install docker (or use COS VM image)
- Note: make sure the Compute VM is within the same zone as the TPU node you created or else performance will suffer, also ideally create a VM that has at least 16 cores (
n1-standard-16
) to not be VM compute/network bound.
Docker images with
torch
andtorch_xla
preinstalled in thepytorch
conda environment are distributed under:gcr.io/tpu-pytorch/xla
. - Note: make sure the Compute VM is within the same zone as the TPU node you created or else performance will suffer, also ideally create a VM that has at least 16 cores (
-
SSH into the VM and pull the stable docker image into the VM:
(vm)$ docker pull gcr.io/tpu-pytorch/xla:r0.5
Note we do also expose the following nightly Docker image versions, but we recommend you use a stable version (
r0.5
):gcr.io/tpu-pytorch/xla:nightly
gcr.io/tpu-pytorch/xla:nightly_YYYYMMDD (e.g.: gcr.io/tpu-pytorch/xla:nightly_20190531)
If you decide to consume this, be sure to create a TPU with
pytorch-nightly
version. -
Where
$TPU_IP_ADDRESS
(e.g.:10.1.1.2
) is your TPU Internal IP displayed in GCP UI, after pulling the docker image you can either:-
Run the container with a single command:
(vm)$ docker run --shm-size 16G -e XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" gcr.io/tpu-pytorch/xla:r0.5 python /pytorch/xla/test/test_train_mnist.py
-
Run the script in an interactive shell:
(vm)$ docker run -it --shm-size 16G gcr.io/tpu-pytorch/xla:r0.5 (pytorch) root@CONTAINERID:/$ export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" (pytorch) root@CONTAINERID:/$ python pytorch/xla/test/test_train_mnist.py
-
-
Create a Compute VM with PyTorch/XLA Image.
- In the GCP Console, go to the VM Instances page.
- Click Create Instance.
- Make sure the compute VM is within the same zone as the TPU node you created or else performance will suffer, also ideally create a VM that has at least 16 cores (
n1-standard-16
) to not be VM compute/network bound. - In the Boot disk section, click Change to choose our PyTorch/XLA image.
- At the bottom of the OS Images tab select the Debian GNU/Linux 9 Stretch + PyTorch/XLA image.
- Chose an appropriate dist size based on your dataset and click Select.
- Click Create to create the instance.
-
SSH into VM and activate the conda environment you wish to use. Each release (e.g.:
0.1
,0.5
,nightly
) is a separate conda environment.(vm)$ export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" (vm)$ conda env list # conda environments: # base * /anaconda3 torch-xla-0.1 /anaconda3/envs/torch-xla-0.1 torch-xla-0.5 /anaconda3/envs/torch-xla-0.5 torch-xla-nightly /anaconda3/envs/torch-xla-nightly (vm)$ conda activate torch-xla-0.5 (torch-xla-0.5)$ cd /usr/share/torch-xla-0.5/pytorch/xla (torch-xla-0.5)$ python test/test_train_mnist.py
To update the wheels
torch
andtorch_xla
to the latest nightly distribution (only updates yourtorch-xla-nightly
conda env), run:(vm)$ cd /usr/share/torch-xla-nightly/pytorch/xla (vm)$ . ./scripts/update_nightly_torch_wheels.sh
Whereas the previous section focused on training on a single TPU node, this section discusses distributed training in TPU Pods.
The recommended setup for running distributed training on TPU Pods uses the pairing of Compute VM Instance Groups and TPU Pods. Each of the Compute VM in the instance group drives 8 cores on the TPU Pod and so using an instance group ensures each of the Compute VMs use the identical base image.
Training on pods can be broken down to largely 3 different steps:
- Create your instance group (recommended) or Use a list of VM instances
- Create your TPU Pod
- Start distributed training
- Create an instance template.
- During creation, make sure to go to section "Identity and API access" → "Access Scopes" and select "Allow full access to all Cloud APIs".
- If you have already have a VM instance running that you used to train PyTorch/TPU workloads and want to use that exact setup for distributed training: instructions.
- Or, you can create an instance template using the PyTorch/XLA VM image we provide: instructions.
- Create an instance group to drive the TPU pod.
- This instance group is where all the input pipeline happens and where we feed all the tensors into the TPUs for training.
- Use the instance template created in step (1) to create your instance group.
- Make sure to (a) create the instance group in a single zone (same zone as the TPU Pod you'll create), (b) no autoscaling or health-checks, (c) number of instances (size of instance group) should be number of cores / 8 (ex. for a v3-32 you'd create an instance group of size 32/8 = 4).
- Here are the instructions for creating an instance group: instructions.
- Create a TPU pod (same as creating regular TPUs, just select more cores when selecting TPU type).
- Make sure that the TPU is in the same zone as the instance group.
- Make sure that the size of your instance group follows: # instances in group = number of TPU cores / 8.
- SSH into any of the VMs in the instance group and get in an environment where you have
torch
andtorch_xla
installed (whether that's a conda environment or docker container). - Let's say the command you ran to run a v3-8 was:
XLA_USE_BF16=1 python test/test_train_imagenet.py --fake_data
.
- To distribute training as a conda environment process:
(torch-xla-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --conda-env=torch-xla-nightly --env=XLA_USE_BF16=1 -- python /usr/share/torch-xla-0.5/pytorch/xla/test/test_train_imagenet.py --fake_data
- Or, to distribute training as a docker container:
(torch-xla-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla
(torch-xla-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --docker-image=gcr.io/tpu-pytorch/xla:nightly --docker-run-flag=--rm=true --docker-run-flag=--shm-size=50GB --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data
If you up to not use an instance group, you can decide to use a list of VM instances that you may have already created (or can create individually). Make sure that you create all the VM instances in the same zone as the TPU node, and also make sure that the VMs have the same configuration (datasets, VM size, disk size, etc.). Then you can start distributed training after creating your TPU pod. The difference is in the python -m torch_xla.distributed.xla_dist
command. For example, to use a list of VMs run the following command (ex. conda with v3-32):
(torch-xla-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla
(torch-xla-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --vm $VM1 --vm $VM2 --vm $VM3 --vm $VM4 --conda-env=torch-xla-nightly --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data
To learn more about TPU Pods check out this blog post.
You can also run your models on Colab. However, do note that performance may at times be severely impacted when running on Colab compared to creating your own VM and TPU pair.
Please note that we have nightly releases available so users usually don't have to build manually. This is mainly for OSS contributors. Please refer to contribution guide for instructions to build from source.
To run the tests, follow one of the options below:
-
Run on local CPU using the XRT client:
export XRT_DEVICE_MAP="CPU:0;/job:localservice/replica:0/task:0/device:XLA_CPU:0" export XRT_WORKERS="localservice:0;grpc:https://localhost:40934"
Select any free TCP port you prefer instead of 40934 (totally arbitrary).
-
Run on Cloud TPU using the XRT client, set the XRT_TPU_CONFIG environment variable:
export XRT_TPU_CONFIG="tpu_worker;0;<IP of the TPU node>:8470"
Note that the IP of the TPU node can change if the TPU node is reset. If PyTorch seem to hang at startup, verify that the IP of your TPU node is still the same of the one you have configured.
If you are planning to be building from source and hence using the latest PyTorch/TPU code base, it is suggested for you to select the Nightly builds when you create a Cloud TPU instance.
Then run test/run_tests.sh
and test/cpp/run_tests.sh
to verify the setup is working.
Please check out the API Guideline for the best practices to write models to run on TPU & TPU Pod devices.
If you see bad performance when using PyTorch/XLA, please check out the troubleshooting guide for how to avoid common pitfalls and how to debug.
We use github issues to communicate with users and open source contributors. Please file an issue for questions, bug reports, feature requests, install issues, RFCs, thoughts, etc.
Please refer to contribution guide for detailed instructions.