Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with the --multi-scale option with CUDA #7678

Closed
1 task done
DP1701 opened this issue May 3, 2022 · 33 comments
Closed
1 task done

Problems with the --multi-scale option with CUDA #7678

DP1701 opened this issue May 3, 2022 · 33 comments
Labels
bug Something isn't working

Comments

@DP1701
Copy link

DP1701 commented May 3, 2022

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

Training does not take place if the --multi-scale option is activated. Stops directly in the first epoch at the beginning.

(YOLOv5_enviroment) userA@dgx:~/yolov5$ python train.py --multi-scale

train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=True, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v6.1-171-gc4862fc torch 1.11.0+cu113 CUDA:0 (A100-SXM4-40GB, 40537MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http:https://localhost:6006/

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs

Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
WARNING: DP not recommended, use torch.distributed.run for best DDP Multi-GPU results.
See Multi-GPU Tutorial at https://github.com/ultralytics/yolov5/issues/475 to get started.
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/raid/USERDATA/userA/YOLO_detectors/yolov5_new/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%
val: Scanning '/raid/USERDATA/userA/YOLO_detectors/yolov5_new/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|█
Plotting labels to runs/train/exp2/labels.jpg... 

AutoAnchor: 4.27 anchors/target, 0.994 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp2
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
  0%|          | 0/8 [00:00<?, ?it/s]                                                                                                                                          
Traceback (most recent call last):
  File "train.py", line 668, in <module>
    main(opt)
  File "train.py", line 563, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 349, in train
    pred = model(imgs)  # forward
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 158, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 175, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 44, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter
    res = scatter_map(inputs)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 23, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 19, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 96, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 189, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
``

### Environment

YOLOv5 🚀 v6.1-170-gbff6e51 torch 1.11.0+cu113 CUDA:0 (A100-SXM4-40GB, 40537MiB)
Python 3.8.10

pip list

Package Version


absl-py 1.0.0
albumentations 1.1.0
cachetools 4.2.4
certifi 2021.10.8
charset-normalizer 2.0.9
cycler 0.11.0
fonttools 4.28.3
google-auth 2.3.3
google-auth-oauthlib 0.4.6
grpcio 1.42.0
idna 3.3
imageio 2.13.3
importlib-metadata 4.8.2
joblib 1.1.0
kiwisolver 1.3.2
Markdown 3.3.6
matplotlib 3.5.1
networkx 2.6.3
numpy 1.21.4
oauthlib 3.1.1
opencv-python 4.5.4.60
opencv-python-headless 4.5.4.60
packaging 21.3
pandas 1.3.5
Pillow 8.4.0
pip 20.0.2
pkg-resources 0.0.0
protobuf 3.19.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pyparsing 3.0.6
python-dateutil 2.8.2
pytz 2021.3
PyWavelets 1.2.0
PyYAML 6.0
qudida 0.0.4
requests 2.26.0
requests-oauthlib 1.3.0
rsa 4.8
scikit-image 0.19.0
scikit-learn 1.0.1
scipy 1.7.3
seaborn 0.11.2
setuptools 44.0.0
six 1.16.0
tensorboard 2.7.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.0
thop 0.0.31.post2005241907
threadpoolctl 3.0.0
tifffile 2021.11.2
torch 1.11.0+cu113
torchaudio 0.11.0+cu113
torchvision 0.12.0+cu113
tqdm 4.62.3
typing-extensions 4.0.1
urllib3 1.26.7
Werkzeug 2.0.2
wheel 0.37.0
zipp 3.6.0


nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Thu_Feb_10_18:23:41_PST_2022
Cuda compilation tools, release 11.6, V11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0

Ubuntu 20.04.3 LTS


### Minimal Reproducible Example

python train.py --multi-scale

### Additional

_No response_

### Are you willing to submit a PR?

- [ ] Yes I'd like to help by submitting a PR!
@DP1701 DP1701 added the bug Something isn't working label May 3, 2022
@glenn-jocher
Copy link
Member

glenn-jocher commented May 3, 2022

@DP1701 your error message clearly states RuntimeError: CUDA error: out of memory.

YOLOv5 🚀 can be trained on CPU, single-GPU, or multi-GPU. When training on GPU it is important to keep your batch-size small enough that you do not use all of your GPU memory, otherwise you will see a CUDA Out Of Memory (OOM) Error and your training will crash. You can observe your CUDA memory utilization using either the nvidia-smi command or by viewing your console output:

Screenshot 2021-05-28 at 12 19 51

CUDA Out of Memory Solutions

If you encounter a CUDA OOM error, the steps you can take to reduce your memory usage are:

  • Reduce --batch-size
  • Reduce --img-size
  • Reduce model size, i.e. from YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s > YOLOv5n
  • Train with multi-GPU at the same --batch-size
  • Upgrade your hardware to a larger GPU
  • Train on free GPU backends with up to 16GB of CUDA memory: Open In Colab Open In Kaggle

AutoBatch

You can use YOLOv5 AutoBatch (NEW) to find the best batch size for your training by passing --batch-size -1. AutoBatch will solve for a 90% CUDA memory-utilization batch-size given your training settings. AutoBatch is experimental, and only works for Single-GPU training. It may not work on all systems, and is not recommended for production use.

Screenshot 2021-11-06 at 12 31 10

Good luck 🍀 and let us know if you have any other questions!

@DP1701
Copy link
Author

DP1701 commented May 3, 2022

@glenn-jocher
With --batch -1 --epochs 10
GPU A100 device 6 (Memory is empty and no calculation takes place on it)

Bildschirmfoto 2022-05-03 um 09 16 53

(YOLOv5_enviroment) userA@dgx:~/YOLO_detectors/yolov5_new/yolov5$ python train.py --batch -1 --epochs 10 --multi-scale --device 6
train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=-1, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=6, multi_scale=True, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v6.1-172-ge305aba torch 1.11.0+cu113 CUDA:6 (A100-SXM4-40GB, 40537MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http:https://localhost:6006/

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs

Transferred 349/349 items from yolov5s.pt
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (A100-SXM4-40GB) 39.59G total, 0.07G reserved, 0.05G allocated, 39.47G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     7235389       16.53         0.281         22.84         14.51        (1, 3, 640, 640)                    list
     7235389       33.06         0.476         23.82         14.13        (2, 3, 640, 640)                    list
     7235389       66.13         0.883          23.1         14.99        (4, 3, 640, 640)                    list
     7235389       132.3         1.739         24.25         17.83        (8, 3, 640, 640)                    list
     7235389       264.5         3.347         36.36         28.69       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 172 for CUDA:0 35.63G/39.59G (90%)
Scaled weight_decay = 0.0013437500000000001
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/raid/USERDATA/userA/YOLO_detectors/yolov5_new/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?i
val: Scanning '/raid/USERDATA/userA/YOLO_detectors/yolov5_new/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/
Plotting labels to runs/train/exp10/labels.jpg... 

AutoAnchor: 4.27 anchors/target, 0.994 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp10
Starting training for 10 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/9     35.2G   0.04462   0.05522   0.01507      1686       704: 100%|██████████| 1/1 [00:03<00:00,  3.85s/it]                                                                                           
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 1/1 [00:01<00:00,  1.45s/it]                                                                           
                 all        128        929      0.669      0.661      0.712      0.475

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/9     39.2G   0.04532   0.04671   0.01566      1492       736: 100%|██████████| 1/1 [00:00<00:00,  2.00it/s]                                                                                           
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 1/1 [00:01<00:00,  1.45s/it]                                                                           
                 all        128        929      0.701      0.631      0.703       0.46

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/9     39.2G   0.05562    0.2054   0.03119      1816       352: 100%|██████████| 1/1 [00:00<00:00,  6.71it/s]                                                                                           
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 1/1 [00:01<00:00,  1.38s/it]                                                                           
                 all        128        929      0.708      0.632      0.701      0.457

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       3/9     39.2G   0.04507   0.07174   0.01706      1529       576: 100%|██████████| 1/1 [00:00<00:00,  3.15it/s]                                                                                           
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 1/1 [00:00<00:00,  2.53it/s]                                                                           
                 all        128        929      0.737      0.623      0.706      0.462

     Epoch   gpu_mem       box       obj       cls    labels  img_size
  0%|          | 0/1 [00:00<?, ?it/s]                                                                                                                                                                           
Traceback (most recent call last):
  File "train.py", line 668, in <module>
    main(opt)
  File "train.py", line 563, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 349, in train
    pred = model(imgs)  # forward
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/USERDATA/userA/YOLO_detectors/yolov5_new/yolov5/models/yolo.py", line 135, in forward
    return self._forward_once(x, profile, visualize)  # single-scale inference, train
  File "/raid/USERDATA/userA/YOLO_detectors/yolov5_new/yolov5/models/yolo.py", line 158, in _forward_once
    x = m(x)  # run
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/USERDATA/userA/YOLO_detectors/yolov5_new/yolov5/models/yolo.py", line 57, in forward
    x[i] = self.m[i](x[i])  # conv
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/raid/USERDATA/userA/YOLO_detectors/YOLOv5_enviroment/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

With --batch 8 --epochs 10

(YOLOv5_enviroment) userA@dgx:~/YOLO_detectors/yolov5_new/yolov5$ python train.py --batch 8 --epochs 10 --multi-scale --device 6
train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=8, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=6, multi_scale=True, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v6.1-172-ge305aba torch 1.11.0+cu113 CUDA:6 (A100-SXM4-40GB, 40537MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http:https://localhost:6006/

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model summary: 270 layers, 7235389 parameters, 7235389 gradients, 16.5 GFLOPs

Transferred 349/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/raid/USERDATA/userA/YOLO_detectors/yolov5_new/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?i
val: Scanning '/raid/USERDATA/userA/YOLO_detectors/yolov5_new/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/
Plotting labels to runs/train/exp12/labels.jpg... 

AutoAnchor: 4.27 anchors/target, 0.994 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp12
Starting training for 10 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/9     3.82G   0.04653   0.07317   0.02241       119       448: 100%|██████████| 16/16 [00:03<00:00,  4.15it/s]                                                                                         
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 8/8 [00:00<00:00, 14.67it/s]                                                                           
                 all        128        929      0.662      0.685       0.72      0.471

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/9     4.05G   0.04652   0.07795    0.0223       113       640: 100%|██████████| 16/16 [00:01<00:00, 14.66it/s]                                                                                         
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 8/8 [00:00<00:00, 15.19it/s]                                                                           
                 all        128        929      0.805      0.618      0.735      0.475

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/9     4.29G   0.04754   0.08584   0.01922        84       928: 100%|██████████| 16/16 [00:01<00:00, 14.69it/s]                                                                                         
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 8/8 [00:00<00:00, 14.48it/s]                                                                           
                 all        128        929      0.743      0.641      0.717      0.457

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       3/9     4.29G   0.04975   0.08548   0.01883       119       832: 100%|██████████| 16/16 [00:01<00:00, 15.61it/s]                                                                                         
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 8/8 [00:00<00:00, 15.16it/s]                                                                           
                 all        128        929      0.739      0.653      0.714      0.427

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       4/9     4.29G    0.0512   0.06284   0.02203        69       384: 100%|██████████| 16/16 [00:01<00:00, 14.97it/s]                                                                                         
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 8/8 [00:00<00:00, 11.38it/s]                                                                           
                 all        128        929      0.653      0.645      0.659      0.354

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       5/9     4.29G   0.05181   0.08603   0.01982        61       448: 100%|██████████| 16/16 [00:01<00:00, 15.84it/s]                                                                                         
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 8/8 [00:00<00:00, 15.19it/s]                                                                           
                 all        128        929      0.734      0.666      0.725      0.394

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       6/9     4.29G   0.05344   0.07487   0.01788       125       800: 100%|██████████| 16/16 [00:01<00:00, 15.45it/s]                                                                                         
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 8/8 [00:00<00:00, 14.91it/s]                                                                           
                 all        128        929      0.703      0.627      0.694      0.393

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       7/9     4.29G   0.05734   0.07648   0.01843        76       544: 100%|██████████| 16/16 [00:01<00:00, 15.83it/s]                                                                                         
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 8/8 [00:00<00:00, 14.70it/s]                                                                           
                 all        128        929      0.692      0.597      0.672      0.375

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       8/9     4.29G   0.05089   0.09513    0.0185        83       416: 100%|██████████| 16/16 [00:01<00:00, 15.81it/s]                                                                                         
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 8/8 [00:00<00:00, 14.99it/s]                                                                           
                 all        128        929      0.777       0.65      0.733      0.455

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       9/9     4.29G   0.05116   0.09732   0.01881       153       800: 100%|██████████| 16/16 [00:01<00:00, 15.54it/s]                                                                                         
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 8/8 [00:00<00:00, 14.55it/s]                                                                           
                 all        128        929      0.714      0.703      0.755      0.478

10 epochs completed in 0.006 hours.
Optimizer stripped from runs/train/exp12/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp12/weights/best.pt, 14.8MB
...

@glenn-jocher
Copy link
Member

@DP1701 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. If batch -1 causes issues then I'd suggest you don't use it.

@Symbadian
Copy link

Hi @DP170, I am running a GPU can you guide me on how you initiated the GPU device, please?
was there something in the code that you amended and which aspect of the code was this?

I am trying to run my GPU via a Linux server and it's proving extremely challenging!

Thanx for acknowledging my digital presence in advance

@DP1701
Copy link
Author

DP1701 commented Jan 16, 2023

HI @Symbadian,

You don't have to change anything in the code. The following command in the terminal is sufficient:

(For mulit-gpu training)

python3 -m torch.distributed.launch --nproc_per_node NUMBER_OF_GPUs train.py --data path_to_your_data --img image_size --weights weights --batch batch_size --epochs number_of_epochs --device number_of_devices

The information in bold is information to be provided. Important: The stack size must be greater than 0 when using multi-GPU training.

It is important that you have installed Pytorch with CUDA support. Otherwise it will not work.

@Symbadian
Copy link

@DP1701 hi pal, I wish that was the case! this is my error with the torch library and I am not sure how to solve this!

ERROR: Could not find a version that satisfies the requirement torchvision>=0.8.1 (from versions: 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.1, 0.2.2, 0.2.2.post2, 0.2.2.post3)
ERROR: No matching distribution found for torchvision>=0.8.1
requirements: Command 'pip install "torchvision>=0.8.1" ' returned non-zero exit status 1.
YOLOv5 🚀 v6.2-61-gffbce385 Python-3.10.8 torch-1.12.1 CPU```

@Symbadian
Copy link

Symbadian commented Jan 16, 2023

@DP1701 I ran this code and it came back with
launch.py: error: unrecognized arguments: --nproc_per_node 4 train.py --data coco128.yaml --img
where can I find the repo download that you are using?
Can you guide me to such, please?

@DP1701
Copy link
Author

DP1701 commented Jan 16, 2023

What packages do you have installed?

pip3 list

Try this:

python3 train.py 

Does it work?

@Symbadian
Copy link

@DP1701 hey pal, I manage to install the necessary pytorch

Package                 Version
----------------------- --------------------
absl-py                 1.4.0
aiohttp                 3.8.3
aiosignal               1.3.1
asttokens               2.2.1
async-timeout           4.0.2
attrs                   22.2.0
backcall                0.2.0
blinker                 1.5
Bottleneck              1.3.5
brotlipy                0.7.0
cachetools              5.2.1
certifi                 2022.12.7
cffi                    1.15.1
charset-normalizer      3.0.1
click                   8.0.4
colorama                0.4.6
contourpy               1.0.5
cryptography            38.0.4
cycler                  0.11.0
decorator               5.1.1
executing               1.2.0
flit_core               3.6.0
fonttools               4.25.0
frozenlist              1.3.3
future                  0.18.2
google-auth             2.15.0
google-auth-oauthlib    0.4.6
grpcio                  1.42.0
idna                    3.4
importlib-metadata      6.0.0
ipython                 8.8.0
jedi                    0.18.2
kiwisolver              1.4.4
Markdown                3.4.1
MarkupSafe              2.1.1
matplotlib              3.6.2
matplotlib-inline       0.1.6
multidict               6.0.2
munkres                 1.1.4
numexpr                 2.8.4
numpy                   1.23.5
oauthlib                3.2.2
opencv-python           4.7.0.68
packaging               23.0
pandas                  1.5.2
parso                   0.8.3
patsy                   0.5.3
pexpect                 4.8.0
pickleshare             0.7.5
Pillow                  9.3.0
pip                     22.3.1
prompt-toolkit          3.0.36
protobuf                3.20.1
psutil                  5.9.4
ptyprocess              0.7.0
pure-eval               0.2.2
pyasn1                  0.4.8
pyasn1-modules          0.2.7
pycparser               2.21
Pygments                2.14.0
PyJWT                   2.6.0
pyOpenSSL               23.0.0
pyparsing               3.0.9
PySocks                 1.7.1
python-dateutil         2.8.2
pytz                    2022.7
pyu2f                   0.1.5
PyYAML                  6.0
requests                2.28.2
requests-oauthlib       1.3.1
rsa                     4.9
scipy                   1.9.3
seaborn                 0.12.2
setuptools              65.6.3
six                     1.16.0
stack-data              0.6.2
statsmodels             0.13.2
tensorboard             2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.1
thop                    0.1.1.post2209072238
torch                   1.12.1
torchvision             0.1.8
tqdm                    4.64.1
traitlets               5.8.1
typing_extensions       4.4.0
urllib3                 1.26.14
wcwidth                 0.2.6
Werkzeug                2.2.2
wheel                   0.37.1
yarl                    1.8.1
zipp                    3.11.0```

@Symbadian
Copy link

and I ran the next bits (python3 train.py) I got a new error

Traceback (most recent call last):
  File "/home/MattCCTV/YOLO9Classes/yolov5/train.py", line 42, in <module>
    import val as validate  # for end-of-epoch mAP
  File "/home/MattCCTV/YOLO9Classes/yolov5/val.py", line 37, in <module>
    from models.common import DetectMultiBackend
  File "/home/MattCCTV/YOLO9Classes/yolov5/models/common.py", line 23, in <module>
    from utils.dataloaders import exif_transpose, letterbox
  File "/home/MattCCTV/YOLO9Classes/yolov5/utils/dataloaders.py", line 31, in <module>
    from utils.augmentations import (Albumentations, augment_hsv, classify_albumentations, classify_transforms, copy_paste,
  File "/home/MattCCTV/YOLO9Classes/yolov5/utils/augmentations.py", line 12, in <module>
    import torchvision.transforms.functional as TF
ModuleNotFoundError: No module named 'torchvision.transforms.functional'; 'torchvision.transforms' is not a package```


I am now trying to find out what that is and how to solve this...I am not sure this is problematic!!!
In the read-me file, I followed all the instructions line by line...

and still, these errors persist...

@DP1701
Copy link
Author

DP1701 commented Jan 16, 2023

Uninstall torch and torchvision.

Then type in:

pip3 install torch torchvision

@Symbadian
Copy link

still pal, $ python3 train.py Traceback (most recent call last): File "/home/MattCCTV/YOLO9Classes/yolov5/train.py", line 42, in <module> import val as validate # for end-of-epoch mAP File "/home/MattCCTV/YOLO9Classes/yolov5/val.py", line 37, in <module> from models.common import DetectMultiBackend File "/home/MattCCTV/YOLO9Classes/yolov5/models/common.py", line 23, in <module> from utils.dataloaders import exif_transpose, letterbox File "/home/MattCCTV/YOLO9Classes/yolov5/utils/dataloaders.py", line 31, in <module> from utils.augmentations import (Albumentations, augment_hsv, classify_albumentations, classify_transforms, copy_paste, File "/home/MattCCTV/YOLO9Classes/yolov5/utils/augmentations.py", line 13, in <module> import torchvision.transforms.functional as TF ModuleNotFoundError: No module named 'torchvision.transforms.functional'; 'torchvision.transforms' is not a package

@DP1701
Copy link
Author

DP1701 commented Jan 16, 2023

Do you still have torchvision 0.1.8 installed?

@Symbadian
Copy link

Symbadian commented Jan 16, 2023 via email

@DP1701
Copy link
Author

DP1701 commented Jan 16, 2023

Install this:

pip3 install torch==1.12.0+cu116 torchvision==0.13.0+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

But uninstall torch and torchvision beforehand.
Torchvision 0.1.8 is out of date.

@Symbadian
Copy link

Symbadian commented Jan 16, 2023 via email

@Symbadian
Copy link

Symbadian commented Jan 16, 2023 via email

@Symbadian
Copy link

@DP1701 got this error:

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116
ERROR: Could not find a version that satisfies the requirement torch==1.12.0+cu116 (from versions: none)
ERROR: No matching distribution found for torch==1.12.0+cu116```

@DP1701
Copy link
Author

DP1701 commented Jan 16, 2023

Here are all the version:

Link

You have to check which system, which python version, which Cuda version you have installed.

You could also try just:

pip3 install torch==1.12.0
pip3 install torchvision==0.13.0

Python >=3.7, <=3.10 is required

@Symbadian
Copy link

@DP1701 Yip, went through those already..
None of them seems to be working for me...
Hence, I tried the torchvision 0.1.8 and that seems to be the only one that's working

@Symbadian
Copy link

Symbadian commented Jan 16, 2023

Can you guide me to the repo?
So that I can get the latest files for processing..
This cannot be the right way for the installs... I am getting too many errors at this stage..

@DP1701
Copy link
Author

DP1701 commented Jan 16, 2023

You need at least torchvision>=0.8.1 for YOLOv5.

@Symbadian
Copy link

@DP1701 I would have to agree with you here! but it or they rather is not installing no matter what I do!
I have been trying to get these installs for two weeks now and every day I am faced with the same challenge!

Maybe I should just update the repo files and be done with it???!!!

@DP1701
Copy link
Author

DP1701 commented Jan 16, 2023

If by repo files you mean the files from YOLO, then nothing will change. Alternatively, you could try miniforge (Conda): Link.

After the installation, create a new environment with:

conda create --name YOLOv5 python=3.9.13
conda activate YOLOv5

Then install Pytorch with the Conda instruction

@DP1701
Copy link
Author

DP1701 commented Jan 16, 2023

And then install the rest that YOLOv5 needs with pip

@Symbadian
Copy link

@DP1701 Ok will try that now..

@Symbadian
Copy link

conda create -name YOLOv5 python=3.9.13

Hi @DP1701 I got this:

Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - yolo9c

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-ppc64le
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-ppc64le
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.```

@DP1701
Copy link
Author

DP1701 commented Jan 16, 2023

insert a second - character before name:

conda create --name YOLOv5 python=3.9.13

@Symbadian
Copy link

Symbadian commented Jan 16, 2023

ok will try that now
it took quite a while to delete the files from the GPU server, my apologies

  File "/home/MattCCTV/YOLO9Classes/yolov5/train.py", line 29, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'```

@Symbadian
Copy link

Symbadian commented Jan 16, 2023

ok will try that now it took quite a while to delete the files from the GPU server, my apologies

  File "/home/MattCCTV/YOLO9Classes/yolov5/train.py", line 29, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'```

may I request some guidance on the torch command that I should apply here, please?

My confidence is a tad low, I thought I had an idea, but it seems like it's more tricky than expected!

Thus far, I have been unsuccessful with my selection of the torch..

@DP1701
Copy link
Author

DP1701 commented Jan 16, 2023

You must first install torch and torchvision with Conda if you have installed and activated the environment correctly. Take a look at the link to Pytorch that I sent you today.

@Symbadian
Copy link

Symbadian commented Jan 16, 2023

@DP1701 I see but when I check out the compatibility pytorch/pytorch#47776

Torch doesn't work well here??!!!??!
WIth this
conda create --name YOLOv5 python=3.9.13 ???!!!

SO I tried
conda install -c pytorch-lts torchvision

AND I GOT THE ERROR BELOW*

to be incompatible with the existing python installation in your environment:

Specifications:

  - torchvision -> python[version='>=3.8,<3.9.0a0']

Your python: python==3.9.12

If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.

The following specifications were found to be incompatible with your CUDA driver:

  - feature:/linux-ppc64le::__cuda==10.2=0
  - feature:|@/linux-ppc64le::__cuda==10.2=0

Your installed CUDA driver is: 10.2```

@Symbadian
Copy link

@DP1701 I see but when I check out the compatibility pytorch/pytorch#47776

Torch doesn't work well here??!!!??! WIth this conda create --name YOLOv5 python=3.9.13 ???!!!

SO I tried conda install -c pytorch-lts torchvision

AND I GOT THE ERROR BELOW*

to be incompatible with the existing python installation in your environment:

Specifications:

  - torchvision -> python[version='>=3.8,<3.9.0a0']

Your python: python==3.9.12

If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.

The following specifications were found to be incompatible with your CUDA driver:

  - feature:/linux-ppc64le::__cuda==10.2=0
  - feature:|@/linux-ppc64le::__cuda==10.2=0

Your installed CUDA driver is: 10.2```

I'm in need of some assistance to understand this, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants