Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClearML Dockerfile fix #9876

Merged
merged 1 commit into from
Oct 20, 2022
Merged

ClearML Dockerfile fix #9876

merged 1 commit into from
Oct 20, 2022

Conversation

glenn-jocher
Copy link
Member

@glenn-jocher glenn-jocher commented Oct 20, 2022

Signed-off-by: Glenn Jocher [email protected]

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Update to YOLOv5 Dockerfile, optimizing Python package installations.

πŸ“Š Key Changes

  • Removed torchtext and torchvision from the pip uninstall command.
  • Updated pip install command to exclude clearml and fix OpenCV version constraint.

🎯 Purpose & Impact

  • πŸ” The Dockerfile is simplified by removing unnecessary uninstalls.
  • πŸ“ˆ Users will experience more stable builds due to the pinned version of OpenCV.
  • ❌ Removing clearml could hint at a streamlining of dependencies for specific use-cases, reducing image size and build time.

Signed-off-by: Glenn Jocher <[email protected]>
@glenn-jocher
Copy link
Member Author

@thepycoder ran into a ValueError with ClearML default install in Dockerfile on DDP training. Occurs when ClearML is installed and a training command is run (no auth or other steps taken).

root@44816c00311e:/usr/src/app# python -m torch.distributed.run --nproc_per_node 2 --master_port 1 train.py --data coco128.yaml --weights yolov5s.pt --img 640 --device 2,3
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=2,3, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
remote: Enumerating objects: 29, done.
remote: Counting objects: 100% (29/29), done.
remote: Compressing objects: 100% (16/16), done.
remote: Total 29 (delta 17), reused 22 (delta 13), pack-reused 0
Unpacking objects: 100% (29/29), 19.13 KiB | 2.13 MiB/s, done.
From https://github.com/ultralytics/yolov5
   6371de8..3b1a9d2  master     -> origin/master
   319b395..c6e9ea5  exp8       -> origin/exp8
github: ⚠️ YOLOv5 is out of date by 1 commit. Use `git pull` or `git clone https://github.com/ultralytics/yolov5` to update.
YOLOv5 πŸš€ v6.2-203-g6371de8 Python-3.8.13 torch-1.12.1+cu113 CUDA:2 (A100-SXM-80GB, 81251MiB)
                                                             CUDA:3 (A100-SXM-80GB, 81251MiB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 πŸš€ runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http:https://localhost:6006/
Traceback (most recent call last):
  File "train.py", line 630, in <module>
    main(opt)
  File "train.py", line 524, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 93, in train
    loggers = Loggers(save_dir, weights, opt, hyp, LOGGER)  # loggers instance
  File "/usr/src/app/utils/loggers/__init__.py", line 121, in __init__
    self.clearml = ClearmlLogger(self.opt, self.hyp)
  File "/usr/src/app/utils/loggers/clearml/clearml_utils.py", line 87, in __init__
    self.task = Task.init(
  File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 601, in init
    task = cls._create_dev_task(
  File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 3122, in _create_dev_task
    task = cls(
  File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 199, in __init__
    super(Task, self).__init__(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/task/task.py", line 155, in __init__
    super(Task, self).__init__(id=task_id, session=session, log=log)
  File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 145, in __init__
    super(IdObjectBase, self).__init__(session, log, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 39, in __init__
    self._session = session or self._get_default_session()
  File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 115, in _get_default_session
    InterfaceBase._default_session = Session(
  File "/opt/conda/lib/python3.8/site-packages/clearml/backend_api/session/session.py", line 186, in __init__
    raise ValueError(
ValueError: ClearML configuration could not be found (missing `~/clearml.conf` or Environment CLEARML_API_HOST)
To get started with ClearML: setup your own `clearml-server`, or create a free account at https://app.clear.ml
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 418 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 417) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-10-20_18:10:27
  host      : 44816c00311e
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 417)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@44816c00311e:/usr/src/app# 

Also occurs with basic single-GPU training:

root@44816c00311e:/usr/src/app# python train.py --data coco128.yaml --weights yolov5s.pt --img 640 --device 2
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=2, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 5 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (5/5), 3.12 KiB | 3.13 MiB/s, done.
From https://github.com/ultralytics/yolov5
 * [new branch]      glenn-jocher-patch-2 -> origin/glenn-jocher-patch-2
github: ⚠️ YOLOv5 is out of date by 1 commit. Use `git pull` or `git clone https://github.com/ultralytics/yolov5` to update.
YOLOv5 πŸš€ v6.2-203-g6371de8 Python-3.8.13 torch-1.12.1+cu113 CUDA:2 (A100-SXM-80GB, 81251MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 πŸš€ runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http:https://localhost:6006/
Traceback (most recent call last):
  File "train.py", line 630, in <module>
    main(opt)
  File "train.py", line 524, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 93, in train
    loggers = Loggers(save_dir, weights, opt, hyp, LOGGER)  # loggers instance
  File "/usr/src/app/utils/loggers/__init__.py", line 121, in __init__
    self.clearml = ClearmlLogger(self.opt, self.hyp)
  File "/usr/src/app/utils/loggers/clearml/clearml_utils.py", line 87, in __init__
    self.task = Task.init(
  File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 601, in init
    task = cls._create_dev_task(
  File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 3122, in _create_dev_task
    task = cls(
  File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 199, in __init__
    super(Task, self).__init__(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/task/task.py", line 155, in __init__
    super(Task, self).__init__(id=task_id, session=session, log=log)
  File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 145, in __init__
    super(IdObjectBase, self).__init__(session, log, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 39, in __init__
    self._session = session or self._get_default_session()
  File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 115, in _get_default_session
    InterfaceBase._default_session = Session(
  File "/opt/conda/lib/python3.8/site-packages/clearml/backend_api/session/session.py", line 186, in __init__
    raise ValueError(
ValueError: ClearML configuration could not be found (missing `~/clearml.conf` or Environment CLEARML_API_HOST)
To get started with ClearML: setup your own `clearml-server`, or create a free account at https://app.clear.ml
root@44816c00311e:/usr/src/app# 

@glenn-jocher glenn-jocher mentioned this pull request Oct 20, 2022
1 task
@glenn-jocher glenn-jocher merged commit eef9057 into master Oct 20, 2022
@glenn-jocher glenn-jocher deleted the glenn-jocher-patch-2 branch October 20, 2022 18:17
@glenn-jocher
Copy link
Member Author

glenn-jocher commented Oct 20, 2022

@thepycoder seems unrelated to Docker. I'll raise a bug report.

EDIT: raised in #9877

@glenn-jocher glenn-jocher linked an issue Oct 20, 2022 that may be closed by this pull request
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-GPU distributed error
1 participant