Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[CI] Prevent timeouts when rebuilding containers with docker. #13818

Merged
merged 4 commits into from
Jan 11, 2019

Conversation

larroy
Copy link
Contributor

@larroy larroy commented Jan 9, 2019

Increase timeout from 120 to 180 for pipelines
Increase timeout for docker pull as we get timeout when rebuilding the docker cache:

http:https://jenkins.mxnet-ci.amazon-ml.com/job/restricted-docker-cache-refresh/job/master/1190/console

Limit parallel builds to 10

Description

Mitigation for failing CI

fixes #13817

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@anirudhacharya
Copy link
Member

@mxnet-label-bot add [pr-awaiting-review, Scala]

Copy link
Member

@lanking520 lanking520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, could you please point to some PRs that have this issue?

@larroy
Copy link
Contributor Author

larroy commented Jan 10, 2019

@larroy
Copy link
Contributor Author

larroy commented Jan 10, 2019

I restarted 4 PRs because of this issue

Copy link
Contributor

@aaronmarkham aaronmarkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larroy
Copy link
Contributor Author

larroy commented Jan 10, 2019

@aaronmarkham yes, could we move that into the ci/ folder for consistency? it's easy to miss if we have scripts and infrastructure in the docs folder.

@marcoabreu
Copy link
Contributor

Could we hold on with the merge please. I'm not really sure whether this fixes the problem or works around another regression

@aaronmarkham
Copy link
Contributor

@aaronmarkham yes, could we move that into the ci/ folder for consistency? it's easy to miss if we have scripts and infrastructure in the docs folder.

When I first put this together @marcoabreu and I discussed that, but I can't remember why it was better to have it in docs. Maybe that's changed? Marco, do you remember why? If we need to leave it there we could add some notes to the CI readme so it doesn't get overlooked.

@larroy
Copy link
Contributor Author

larroy commented Jan 10, 2019

Could we hold on with the merge please. I'm not really sure whether this fixes the problem or works around another regression

While we hold on CI is having time outs. It took me quite a bit to get the PR to pass CI because of the timeouts (had to manually rebuild the cache). What steps are you taking to understand if it fixes the problem? What makes you think my fix doesn't address the problem? If CI is having failures we can't merge PRs that fix CI because of protected master.

@gavinmbell
Copy link

👍
lgtm

Copy link
Member

@lanking520 lanking520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small concern there. Otherwise LGTM

ci/docker_cache.py Show resolved Hide resolved
Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

@larroy
Copy link
Contributor Author

larroy commented Jan 11, 2019

Responded, would appreciate if this would be merged to prevent CI failures.

Copy link
Contributor

@lebeg lebeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great to merge

@marcoabreu marcoabreu merged commit 9c3253d into apache:master Jan 11, 2019
zhaoyao73 added a commit to zhaoyao73/incubator-mxnet that referenced this pull request Jan 11, 2019
* upstream/master: (109 commits)
  Code modification for  testcases of various network models in directory example (apache#12498)
  [CI] Prevent timeouts when rebuilding containers with docker. (apache#13818)
  fix Makefile for rpkg (apache#13590)
  change to compile time (apache#13835)
  Disabled flaky test (apache#13758)
  Improve license_header tool by only traversing files under revision c… (apache#13803)
  Removes unneeded nvidia driver ppa installation (apache#13814)
  Add Local test stage and option to jump directly to menu item from commandline (apache#13809)
  Remove MXNET_STORAGE_FALLBACK_LOG_VERBOSE from test_autograd.py (apache#13830)
  Fix scala doc build break for v1.3.1 (apache#13820)
  [MXNET-1263] Unit Tests for Java Predictor and Object Detector APIs (apache#13794)
  [MXNET-1260] Float64 DType computation support in Scala/Java (apache#13678)
  onnx export ops (apache#13821)
  [MXNET-880] ONNX export: Random uniform, Random normal, MaxRoiPool (apache#13676)
  fix minor indentation (apache#13827)
  Fixing a symlink issue with R install (apache#13708)
  remove useless code (apache#13777)
  ONNX ops: norm exported and lpnormalization imported (apache#13806)
  Add new Maven build for Scala package (apache#13819)
  Dockerfiles for Publish Testing (apache#13707)
  ...
@larroy larroy deleted the ci branch April 5, 2019 18:34
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
…#13818)

* Prevent timeouts when rebuilding containers with docker.
Increase timeout from 120 to 180 for pipelines

* Increase docker cache timeout

* Increase timeout also for docs

* limit parallel builds to 10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review Scala
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] timeouts due to docker image rebuild
9 participants