[v1.x] Address CI failures with docker timeouts (v2) #19890

josephevans · 2021-02-12T05:21:14Z

Description

Add random sleep (between 2-10 sec) to give docker time to flush pulled images to disk and minimize chance of race condition between jenkins slave slots (on same machine) which causes docker run timeout.

mxnet-bot · 2021-02-12T05:21:17Z

Hey @josephevans , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-cpu, centos-gpu, clang, miscellaneous, windows-gpu, unix-gpu, sanity, windows-cpu, centos-cpu, website, edge]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

waytrue17

LGTM, thanks!

josephevans · 2021-02-12T19:16:15Z

These 3 PRs are having the symptoms of docker run command failing:

#19868
#19851
#19888

mseth10 · 2021-02-12T19:24:18Z

ci/safe_docker_run.py

@@ -117,6 +119,9 @@ def run(self, *args, **kwargs) -> int:
 ret = 0
 try:
 # Race condition:
+ # add a random sleep to (a) give docker time to flush disk buffer after pulling image
+ # and (b) minimize race conditions between jenkins runs on same host
+ time.sleep(random.randint(2,10))


how does random help? vs let's say a fixed wait of 5 seconds?

Each jenkins slave (linux cpu nodes, at least) have 2 "slots" they can run in parallel, and when 2 jobs using the same docker images start at the exact same time on these 2 slots, they both will attempt to pull down the image from ECR and start a container. If we randomize the delay, the idea is that both containers won't be requested to start at the exact same time.

mseth10

LGTM, thanks for the fix!

* Add random sleep only, since retry attempts are already implemented. * Reduce random sleep to 2-10 sec. Co-authored-by: Joe Evans <[email protected]>

* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (#19654) * [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (#19788) * Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5. * Set symlink for python3 to point to newly installed 3.6 version. * Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version. * Setup symlinks in /usr/local/bin, since it comes first in the path. * Don't use absolute path for python3 executable, just use python3 from path. Co-authored-by: Joe Evans <[email protected]> * Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (#19828) Co-authored-by: Joe Evans <[email protected]> * [v1.x] For ECR, ensure we sanitize region input from environment variable (#19882) * Set default for cache_intermediate. * Make sure we sanitize region extracted from registry, since we pass it to os.system. Co-authored-by: Joe Evans <[email protected]> * [v1.x] Address CI failures with docker timeouts (v2) (#19890) * Add random sleep only, since retry attempts are already implemented. * Reduce random sleep to 2-10 sec. Co-authored-by: Joe Evans <[email protected]> * [v1.x] CI fixes to make more stable and upgradable (#19895) * Test moving pipelines from p3 to g4. * Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.) * Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395 * Remove old files. * Fix comment * Set default environment variables * Fix GPU syntax. * Use subprocess.run and redirect output to stdout, don't run docker in interactive mode. * Check if codecov works without providing parameters now. * Send docker stderr to sys.stderr * Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'. Co-authored-by: Joe Evans <[email protected]> * fix cd * fix cudnn version for cu10.2 buiuld * WAR the dataloader issue with forked processes holding stale references (#19924) * skip some tests * fix ski[ * [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (#19959) * update cude compt for cd * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu110 * Update runtime_functions.sh * Update Dockerfile.build.ubuntu_gpu_cu110 * Update Dockerfile.build.ubuntu_gpu_cu102 * update command Co-authored-by: Joe Evans <[email protected]> Co-authored-by: Joe Evans <[email protected]> Co-authored-by: Joe Evans <[email protected]> Co-authored-by: Przemyslaw Tredak <[email protected]>

* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (apache#19654) * [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (apache#19788) * Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5. * Set symlink for python3 to point to newly installed 3.6 version. * Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version. * Setup symlinks in /usr/local/bin, since it comes first in the path. * Don't use absolute path for python3 executable, just use python3 from path. Co-authored-by: Joe Evans <[email protected]> * Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (apache#19828) Co-authored-by: Joe Evans <[email protected]> * [v1.x] For ECR, ensure we sanitize region input from environment variable (apache#19882) * Set default for cache_intermediate. * Make sure we sanitize region extracted from registry, since we pass it to os.system. Co-authored-by: Joe Evans <[email protected]> * [v1.x] Address CI failures with docker timeouts (v2) (apache#19890) * Add random sleep only, since retry attempts are already implemented. * Reduce random sleep to 2-10 sec. Co-authored-by: Joe Evans <[email protected]> * [v1.x] CI fixes to make more stable and upgradable (apache#19895) * Test moving pipelines from p3 to g4. * Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.) * Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395 * Remove old files. * Fix comment * Set default environment variables * Fix GPU syntax. * Use subprocess.run and redirect output to stdout, don't run docker in interactive mode. * Check if codecov works without providing parameters now. * Send docker stderr to sys.stderr * Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'. Co-authored-by: Joe Evans <[email protected]> * fix cd * fix cudnn version for cu10.2 buiuld * WAR the dataloader issue with forked processes holding stale references (apache#19924) * skip some tests * fix ski[ * [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (apache#19959) * update cude compt for cd * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu110 * Update runtime_functions.sh * Update Dockerfile.build.ubuntu_gpu_cu110 * Update Dockerfile.build.ubuntu_gpu_cu102 * update command Co-authored-by: Joe Evans <[email protected]> Co-authored-by: Joe Evans <[email protected]> Co-authored-by: Joe Evans <[email protected]> Co-authored-by: Przemyslaw Tredak <[email protected]>

Add random sleep only, since retry attempts are already implemented.

3b76515

josephevans requested review from aaronmarkham and marcoabreu as code owners February 12, 2021 05:21

lanking520 added the pr-awaiting-testing PR is reviewed and waiting CI build and test label Feb 12, 2021

Reduce random sleep to 2-10 sec.

6406b83

lanking520 added pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress labels Feb 12, 2021

waytrue17 approved these changes Feb 12, 2021

View reviewed changes

lanking520 added pr-awaiting-review PR is waiting for code review and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 12, 2021

mseth10 reviewed Feb 12, 2021

View reviewed changes

mseth10 approved these changes Feb 12, 2021

View reviewed changes

mseth10 merged commit b5b6743 into apache:v1.x Feb 12, 2021

josephevans deleted the ci_docker_timeouts_v1.x_v2 branch February 12, 2021 19:36

josephevans mentioned this pull request Feb 24, 2021

[v1.8.x] Backport PRs from v1.x branch #19946

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.x] Address CI failures with docker timeouts (v2) #19890

[v1.x] Address CI failures with docker timeouts (v2) #19890

josephevans commented Feb 12, 2021 •

edited

Loading

mxnet-bot commented Feb 12, 2021

waytrue17 left a comment

josephevans commented Feb 12, 2021

mseth10 Feb 12, 2021

josephevans Feb 12, 2021

mseth10 left a comment

[v1.x] Address CI failures with docker timeouts (v2) #19890

[v1.x] Address CI failures with docker timeouts (v2) #19890

Conversation

josephevans commented Feb 12, 2021 • edited Loading

Description

mxnet-bot commented Feb 12, 2021

waytrue17 left a comment

Choose a reason for hiding this comment

josephevans commented Feb 12, 2021

mseth10 Feb 12, 2021

Choose a reason for hiding this comment

josephevans Feb 12, 2021

Choose a reason for hiding this comment

mseth10 left a comment

Choose a reason for hiding this comment

josephevans commented Feb 12, 2021 •

edited

Loading