Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Commit

Permalink
Cancel queued AzureML jobs when starting a PR build (#640)
Browse files Browse the repository at this point in the history
AzureML jobs from failed previous PR builds do not get cancelled, consuming excessive resources. Now kill all queued and running jobs before starting new ones.
  • Loading branch information
ant0nsc committed Jan 25, 2022
1 parent 6a4919c commit 61d9cab
Show file tree
Hide file tree
Showing 9 changed files with 112 additions and 18 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ loss.

### Added
- ([#594](https://github.com/microsoft/InnerEye-DeepLearning/pull/594)) When supplying a "--tag" argument, the AzureML jobs use that value as the display name, to more easily distinguish run.
- ([#640](https://github.com/microsoft/InnerEye-DeepLearning/pull/640)) Cancel AzureML jobs from previous runs of the PR build in the same branch to reduce AML load
- ([#577](https://github.com/microsoft/InnerEye-DeepLearning/pull/577)) Commandline switch `monitor_gpu` to monitor
GPU utilization via Lightning's `GpuStatsMonitor`, switch `monitor_loading` to check batch loading times via
`BatchTimeCallback`, and `pl_profiler` to turn on the Lightning profiler (`simple`, `advanced`, or `pytorch`)
Expand Down
8 changes: 8 additions & 0 deletions azure-pipelines/azureml-conda-environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
name: AzureML_SDK
channels:
- defaults
dependencies:
- pip=20.1.1
- python=3.7.3
- pip:
- azureml-sdk==1.36.0
14 changes: 14 additions & 0 deletions azure-pipelines/build-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,12 @@ variables:
disable.coverage.autogenerate: 'true'

jobs:
- job: CancelPreviousJobs
pool:
vmImage: 'ubuntu-18.04'
steps:
- template: cancel_aml_jobs.yml

- job: Windows
pool:
vmImage: 'windows-2019'
Expand All @@ -30,6 +36,7 @@ jobs:
- template: build.yaml

- job: TrainInAzureML
dependsOn: CancelPreviousJobs
variables:
- name: tag
value: 'TrainBasicModel'
Expand All @@ -48,6 +55,7 @@ jobs:
test_run_title: tests_after_training_single_run

- job: RunGpuTestsInAzureML
dependsOn: CancelPreviousJobs
variables:
- name: tag
value: 'RunGpuTests'
Expand All @@ -70,6 +78,7 @@ jobs:
# is trained, because we use this build to also check the "submit_for_inference" code, that
# presently only handles single channel models.
- job: TrainInAzureMLViaSubmodule
dependsOn: CancelPreviousJobs
variables:
- name: model
value: 'BasicModel2Epochs1Channel'
Expand All @@ -90,6 +99,7 @@ jobs:

# Train a 2-element ensemble model
- job: TrainEnsemble
dependsOn: CancelPreviousJobs
variables:
- name: model
value: 'BasicModelForEnsembleTest'
Expand All @@ -114,6 +124,7 @@ jobs:

# Train a model on 2 nodes
- job: Train2Nodes
dependsOn: CancelPreviousJobs
variables:
- name: model
value: 'BasicModel2EpochsMoreData'
Expand All @@ -135,6 +146,7 @@ jobs:
test_run_title: tests_after_training_2node_run

- job: TrainHelloWorld
dependsOn: CancelPreviousJobs
variables:
- name: model
value: 'HelloWorld'
Expand All @@ -152,6 +164,7 @@ jobs:
# Run HelloContainer on 2 nodes. HelloContainer uses native Lighting test set inference, which can get
# confused after doing multi-node training in the same script.
- job: TrainHelloContainer
dependsOn: CancelPreviousJobs
variables:
- name: model
value: 'HelloContainer'
Expand All @@ -176,6 +189,7 @@ jobs:
# regressions in AML when requesting more than the default amount of memory. This needs to run with all subjects to
# trigger the bug, total runtime 10min
- job: TrainLung
dependsOn: CancelPreviousJobs
variables:
- name: model
value: 'Lung'
Expand Down
2 changes: 2 additions & 0 deletions azure-pipelines/build_data_quality.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
steps:
- template: checkout.yml

- template: prepare_conda.yml

- bash: |
conda env create --file InnerEye-DataQuality/environment.yml --name InnerEyeDataQuality
source activate InnerEyeDataQuality
Expand Down
46 changes: 46 additions & 0 deletions azure-pipelines/cancel_aml_jobs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# ------------------------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
# ------------------------------------------------------------------------------------------
import os

from azureml._restclient.constants import RunStatus
from azureml.core import Experiment, Run, Workspace
from azureml.core.authentication import ServicePrincipalAuthentication


def cancel_running_and_queued_jobs() -> None:
environ = os.environ
print("Authenticating")
auth = ServicePrincipalAuthentication(
tenant_id='72f988bf-86f1-41af-91ab-2d7cd011db47',
service_principal_id=environ["APPLICATION_ID"],
service_principal_password=environ["APPLICATION_KEY"])
print("Getting AML workspace")
workspace = Workspace.get(
name="InnerEye-DeepLearning",
auth=auth,
subscription_id=environ["SUBSCRIPTION_ID"],
resource_group="InnerEye-DeepLearning")
branch = environ["BRANCH"]
print(f"Branch: {branch}")
if not branch.startswith("refs/pull/"):
print("This branch is not a PR branch, hence not cancelling anything.")
exit(0)
experiment_name = branch.replace("/", "_")
print(f"Experiment: {experiment_name}")
experiment = Experiment(workspace, name=experiment_name)
print(f"Retrieved experiment {experiment.name}")
for run in experiment.get_runs(include_children=True, properties={}):
assert isinstance(run, Run)
status_suffix = f"'{run.status}' run {run.id} ({run.display_name})"
if run.status in (RunStatus.COMPLETED, RunStatus.FAILED, RunStatus.FINALIZING, RunStatus.CANCELED,
RunStatus.CANCEL_REQUESTED):
print(f"Skipping {status_suffix}")
else:
print(f"Cancelling {status_suffix}")
run.cancel()


if __name__ == "__main__":
cancel_running_and_queued_jobs()
27 changes: 27 additions & 0 deletions azure-pipelines/cancel_aml_jobs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
steps:
- checkout: self

- template: prepare_conda.yml

# https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#pythonanaconda
- task: Cache@2
displayName: Use cached Conda environment AzureML_SDK
inputs:
# Beware of changing the cache key or path independently, safest to change in sync
key: 'usr_share_miniconda_azureml_conda | "$(Agent.OS)" | azure-pipelines/azureml-conda-environment.yml'
path: /usr/share/miniconda/envs
cacheHitVar: CONDA_CACHE_RESTORED

- script: conda env create --file azure-pipelines/azureml-conda-environment.yml
displayName: Create Conda environment AzureML_SDK
condition: eq(variables.CONDA_CACHE_RESTORED, 'false')

- bash: |
source activate AzureML_SDK
python azure-pipelines/cancel_aml_jobs.py
displayName: Cancel jobs from previous run
env:
SUBSCRIPTION_ID: $(InnerEyeDevSubscriptionID)
APPLICATION_ID: $(InnerEyeDeepLearningServicePrincipalID)
APPLICATION_KEY: $(InnerEyeDeepLearningServicePrincipalKey)
BRANCH: $(Build.SourceBranch)
18 changes: 0 additions & 18 deletions azure-pipelines/checkout.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,3 @@ steps:
- checkout: self
lfs: true
submodules: true

- bash: |
subdir=bin
echo "Adding this directory to PATH: $CONDA/$subdir"
echo "##vso[task.prependpath]$CONDA/$subdir"
displayName: Add conda to PATH
condition: succeeded()
- bash: |
conda install conda=4.8.3 -y
conda --version
conda list
displayName: Print conda version and initial package list
- bash: |
sudo chown -R $USER /usr/share/miniconda
condition: and(succeeded(), eq( variables['Agent.OS'], 'Linux' ))
displayName: Take ownership of conda installation
2 changes: 2 additions & 0 deletions azure-pipelines/inner_eye_env.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ steps:

- template: store_settings.yml

- template: prepare_conda.yml

# https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#pythonanaconda
- task: Cache@2
displayName: Use cached Conda environment
Expand Down
12 changes: 12 additions & 0 deletions azure-pipelines/prepare_conda.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
steps:
- bash: |
subdir=bin
echo "Adding this directory to PATH: $CONDA/$subdir"
echo "##vso[task.prependpath]$CONDA/$subdir"
displayName: Add conda to PATH
condition: succeeded()
- bash: |
sudo chown -R $USER /usr/share/miniconda
condition: and(succeeded(), eq( variables['Agent.OS'], 'Linux' ))
displayName: Take ownership of conda installation

0 comments on commit 61d9cab

Please sign in to comment.