Skip to content

Latest commit

 

History

History
67 lines (57 loc) · 2.25 KB

TRAIN.md

File metadata and controls

67 lines (57 loc) · 2.25 KB

Tutorial for Training

Every experiment is defined using a yaml file under the projects/UNINEXT/configs folder. UNINEXT has three training stages: (1) Object365 pretraining (2) image-level joint training (3) video-level joint training. Corresponding yaml files start with obj365v2_32g, image_joint, and video_joint_r50 respectively. By default, we train UNINEXT using 32 or 16 A100 GPUs. Besides, if users are only interested in part of tasks like object detection (OD) and instance segmentation (IS), we also provide yaml files of single tasks, which start with single_task. By default, we run these experiments on a single node of 8 GPUs.

Single-Node Training

On a single node with 8 GPUs, run

python3 launch.py --nn 1 --uni 1 \
--config-file projects/UNINEXT/configs/${EXP_NAME}.yaml \
--resume OUTPUT_DIR outputs/${EXP_NAME} \

${EXP_NAME} should be replaced with a specific name. It's worth noting that video-level tasks depends on the weights of image-level tasks.

Task YAML Property
OD&IS single_task_det image
REC&RES single_task_rec image
SOT&VOS single_task_sot video
VIS single_task_vis video
RVOS single_task_rvos video

Multiple-Node Training

Take image-level joint training of UNINEXT with ResNet-50 backbone as an example, run the following commands.

On node 0, run

python3 launch.py --nn 2 --port <PORT> --worker_rank 0 --master_address <MASTER_ADDRESS> \
--uni 1 --config-file projects/UNINEXT/configs/image_joint_r50.yaml \
--resume OUTPUT_DIR ./image_joint_r50

On node 1, run

python3 launch.py --nn 2 --port <PORT> --worker_rank 1 --master_address <MASTER_ADDRESS> \
--uni 1 --config-file projects/UNINEXT/configs/image_joint_r50.yaml \
--resume OUTPUT_DIR ./image_joint_r50

<MASTER_ADDRESS> should be the IP address of node 0. <PORT> should be the same among multiple nodes. If <PORT> is not specifed, programm will generate a random number as <PORT>.