Environment for AI training management on OVHAI
-
Install required CLI tools: Bash, Docker, Bitwarden CLI, jq CLI.
-
Create or share bitwarden secrets for S3 and ovhai. Required fields:
- S3 (for DVC): custom fields named
accesskeyid
andsecretaccesskey
- OVHAI:
- username — ovhcloud nichandle username (the restricted user credentials not the owner one!)
- password — ovhcloud nichandle password
- custom field named
projectid
— ovh public cloud project id - custom field named
region
— ovh public cloud region uppercase e.g.GRA
- custom field named
docker-registry
— URL of ovhai docker registry
- S3 (for DVC): custom fields named
-
Add ovhai-env as a submodule:
$ git submodule init $ git submodule add [email protected]:Jblew/ovhai-env.git env
-
Run
env/bootstrap.sh
-
Configure environment in
.env
OVHAI_PLATFORM="darwin" IMAGE_NAME="cnn-tutorial" BITWARDEN_DVC_S3_CREDENTIALS_SECRET="name-of-the-bitwarden-secret" BITWARDEN_OVHAI_SECRET="..." IMAGE_DIR="${PROJECT_DIR}/image" IMAGE_ID_OUT_FILE="${PROJECT_DIR}/.image.id" DVC_S3_CREDENTIALS_OUT_FILE="${PROJECT_DIR}/.dvc/aws.credentials" OVHAI_CPU_COUNT=2 OVHAI_GPU_COUNT=1 CONTAINER_WORKDIR="/w" CONTAINER_CMD="python src/train.py" VOLUME_DATA_DIR="${PROJECT_DIR}/data" VOLUME_DATA_MOUNT="/w/data" VOLUME_DATA_NAME="cnn-tutorial-data" VOLUME_SRC_DIR="${PROJECT_DIR}/src" VOLUME_SRC_MOUNT="/w/src" VOLUME_SRC_NAME="cnn-tutorial-src" JOBS_DIR="${PROJECT_DIR}/jobs" VOLUME_OUTPUTS_MOUNT="/w/outputs" VOLUME_OUTPUTS_NAME_BASE="cnn-tutorial-outputs-" PARAMSJSON_FILE="${PROJECT_DIR}/params.json" CONTAINER_PARAMSJSON_ENV_NAME="PARAMSJSON"
-
Run
env/setup.sh
to install and configure environment
-
Run
env/image-build.sh
to build docker image and push to ovhai registry -
Run training job using
env/train-on-ovhai.sh
. This will create a directory for this experiment in thejobs/
with configuration files that will be used in the next step to obtain correct results. The params.json file will be serialized to env variable and passed to the training container. -
Get logs and download results using
env/results-download.sh
. This command will download logs and outputs for any subdirectory injobs/
that does not already have them downloaded