Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add InfiniteDataLoader class #876

Merged
merged 2 commits into from
Aug 31, 2020

Conversation

NanoCode012
Copy link
Contributor

@NanoCode012 NanoCode012 commented Aug 29, 2020

Feature

Dataloader takes a chunk of time at the start of every epoch to start worker processes. We only need to initialize it once at epoch 1 through this InfiniteDataLoader class which subclasses the DataLoader class.

Other repo implements this as well. It was where I got the idea from.
Resources:

Results

Theoretically, this would improve reduce training time on small datasets more than large ones as their epoch times are shorter. The training time gap should also increase as we increase the number of epochs.

I would like to test this fully before this PR is ready, especially in DDP mode as the resources said conflicting things on the DDP sampler.

If you have some certain dataset you would like me to test speed on, please feel free to add comment.

master: 08e97a2
Env: conda py37
Model: 5s
Batch-size / Total Batch-size: 64
GPU: V100
Epochs: 10

dataset InfiniteDataLoader GPU(s) Total times (h)
coco128 No 1 0.038, 0.040, 0.039, 0.041
Yes 0.022, 0.024, 0.021
No 2 0.032, 0.032
Yes 0.018, 0.020
coco2017 No 1 2.099, 2.067, 2.029
Yes 2.577, 2.438, 2.438 /// 1.989, 1.993, 1.996
No 2 1.309, 1.321, 1.326
Yes 2.141, 2.119, 1.797 /// 1.268, 1.237, 1.265

Note:

  • coco128: Maybe due to finetuning hyp on lr, the mAPs stay around 68 even after 10 epochs which was a bit unusual.
  • coco2017: mAP seems consistent

(See comment below for graphs)

Commands to test
# single gpu
python train.py --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --batch-size 64 --epoch 10 --device 3

# multi-gpu
python -m torch.distributed.launch --master_port 9950 --nproc_per_node 2 train.py --data coco128.yaml --cfg yolov5s.yaml --weights yolov5s.pt --batch-size 64 --epoch 10 --device 3,4

Prospectives:

multiprocessing.spawn causes dataloaders to initalize workers at the start of every epoch(or batch) in a slow way and it slows down speed significantly. That is why we are still using launch. Maybe this PR could open the way to use mp.spawn. This way, we can eliminate DP and just use --device to specify single or DDP.

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Introduction of an Infinite DataLoader for improved data loading in the YOLOv5 training pipeline.

πŸ“Š Key Changes

  • Replaced the traditional torch.utils.data.DataLoader with a new custom InfiniteDataLoader.
  • Added two new internal classes: InfiniteDataLoader and _RepeatSampler.
    • InfiniteDataLoader reuses workers to constantly feed data.
    • _RepeatSampler is designed to repeat the sampling process indefinitely.

🎯 Purpose & Impact

  • The purpose of these changes is to optimize the data loading mechanism during model training, which can potentially:
    • Minimize data loading bottlenecks by reusing workers more efficiently.
    • Improve training speed due to less downtime between batches.
  • The impact should be noticeable on systems with high compute capabilities, as they will better maintain data throughput during training sessions. πŸš€
  • Users may experience smoother and faster training iterations, especially when training on large datasets. πŸŽοΈπŸ’¨

Only initializes at first epoch. Saves time.
@glenn-jocher
Copy link
Member

@NanoCode012 this is super interesting. The same idea actually crossed my mind yesterday, because when I watch the evolution VMs running, sometimes I run the nvidia-smi command, and occasionally a GPU is at 0% utilization. Evolution epochs do not run test.py after each epoch, nor do they save checkpoints, but there is still a period of several seconds of 0% utilization like you said, to reinitialize the dataloader.

I think this might really help smaller datasets train faster. That's amazing you've already implemented it. Are you keeping an if statement at the end of each batch to check when an epoch would have elapsed and then running all the post-epoch functionality then? I'll check out the changes.

utils/datasets.py Outdated Show resolved Hide resolved
utils/datasets.py Outdated Show resolved Hide resolved
@NanoCode012
Copy link
Contributor Author

Hm, I added results from COCO. I'm not sure if there was some kind of bottleneck when I was training the InfiniteDataloader set because I had another 5m training at the same time (on diff GPU). It finished when I started to train the non-infinite one.

If it wasn't bottleneck, this could be a problem. I've looked at pytorchlightning and rightman's pytorch repo and they used very similar code to the current one though.

@NanoCode012
Copy link
Contributor Author

I did a re-run, and there actually was some bottleneck. The new results are lower than their non-infinite counterpart.

@glenn-jocher
Copy link
Member

I tested out the Inifite dataloader on VOC training.

The epoch transition did seem a few seconds faster to me, final time was roughly the same, about 5 hours for both. mAP seemed fine also with the infinite loader though, did not see any problems there. That's really strange that you saw some slower times with infinite. What do you the double slashes // mean?

With regards to the COCO finetune results, this is uncharted territory for me. The finetuning hyps are evolved for 50 epochs of VOC, which I hope will be closer to the typical custom dataset than full COCO from scratch. That's really interesting that you achieved similar results for 20 and 100 finetuning epochs as from 300 from scratch. That's actually awesome news, because then perhaps I can run an evolution campaign for COCO finetuning using only 20 epochs results rather than having to evolve hyps on 300 epochs from scratch (which is impractical really for anyone without free cloud access, looking at you google brain team).

I did a re-run, and there actually was some bottleneck. The new results are lower than their non-infinite counterpart.

What do you mean by lower? Slower?

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Aug 31, 2020

The epoch transition did seem a few seconds faster to me, final time was roughly the same, about 5 hours for both.

Hm. I think this PR would be geared for those training on custom dataset with short epoch times but large number of epochs. I read somewhere that it can save 1hr of training for COCO 300 epochs on 8x V100.

If we assume 10 seconds of time saved, that'll be 10*300 = 3000s/60 = 50 minutes per GPU. That would save some money.

That's really strange that you saw some slower times with infinite. What do you the double slashes // mean?

The slashes separates my first run and second run with Infinite. The first run probably had some bottleneck with my 5m trainings. That was why I ran it a second time.

What do you mean by lower? Slower?

Sorry, by lower, I meant lower training time = faster training speed. I will add a graph to visualize later on.

For example, COCO 2017
Base: 2.099 | Inf: 1.989

That's actually awesome news, because then perhaps I can run an evolution campaign for COCO finetuning using only 20 epochs results rather than having to evolve hyps on 300 epochs from scratch (which is impractical really for anyone without free cloud access, looking at you google brain team).

Oh yes! You are right. However, the fine tuned version were not able to reach the same height as ones training from scratch, so it should only be used as a guide. I'm thinking of setting one or two GPU to test this theory for a week. Could you give me the evolve commands and the starting hyp? (Should we use the InfiniteDataloader branch?)

20 epochs would take around 10*20=200 minutes = 3 hours per generation for 5s. One week could do 40-55 generations. If the first week got some promising results, I may let it run longer.

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Aug 31, 2020

For some visualization, here are a few graphs. Time scale is in minutes. Some runs are averaged.

image

image

I did 300 epochs for coco128 to see the gap over long epochs. Single GPU. When batchsize is the whole dataset, it seems to perform poorly in both speed and mAP.

image

image

@glenn-jocher
Copy link
Member

@NanoCode012 ah ok, so infinite provides faster speed in all cases, and does not appear to harm mAP. It seems to be good news all around. Is this ready to merge then?

@NanoCode012 NanoCode012 marked this pull request as ready for review August 31, 2020 17:40
@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Aug 31, 2020

Ready!

I'm thinking of setting one or two GPU to test this theory for a week. Could you give me the evolve commands and the starting hyp?

Would you like me to try to evolve 5s regarding the earlier theory ?

@glenn-jocher
Copy link
Member

Regarding the evolution, the results I'm seeing on VOC are exciting, but I'm not sure if they are repeatable on COCO. Here is a basic summary of what I've been doing:

  • YOLOv3 from scratch 30 epochs: last year I started from darknet defaults, and evolved several hundred generations using yolov3-tiny for 30 epochs (10% of full training). This produced mixed results, as 1) yolov3-tiny was different enough than yolov3-spp that improvements in tiny (10M params) did not correlate well with improvements in yolov3-spp (60M params), and 2) improvements when training to 30 epochs did not correlate well at all to improvements training to 300. Hyps evolved for 30 epochs would tend to severely overtrain when applied for 300 epochs. I was stuck since I did not have sufficient computing power to evolve yolov3-spp for 300 epochs, so I did some manual tuning, and these are essentially the hyps we see today in hyp.scratch.yaml.
  • YOLOv5m finetuning VOC 50 epochs. A month or two ago I fixed and updated evolution for YOLOv5, and seperately we gained a PR for voc.yaml. I realized VOC is similar to COCO, but trains much faster. About 12x faster per epoch. I also saw that finetuning VOC for 50 epochs produced better results than training it from scratch for 300 epochs. With v3.0 release I obtained YOLOv5m 50 epoch finetuning baseline of about 86.5 mAP using the same hyps as from scratch. I started hyp evolution campaign for 300 generations (now about 200 generations complete), and have seen mAP increase from 86.5 to 89.5 currently. Evolution command is in utils/evolve.sh with a few changes to code (will provide full details later):
# Hyperparameter evolution commands
while true; do
  python train.py --batch 64 --weights yolov5m.pt --data voc.yaml --img 512 --epochs 50 --evolve --bucket ult/voc --device $1
done
  • YOLOv5m on COCO 30 epochs? This is the potential next step. Both of the problems with YOLOv3 evolution are somewhat addressed now, because the v5 models are more similar than the v3 models. I think/hope 5m results can correlate better with 5l and 5x than v3-tiny could with v3-spp. 5s may be too small to do this well though, but I don't really know. Second, since you found finetuning to 20-100 epochs provided similar final mAPs to training from scratch, then I think perhaps we can evolve COCO finetuning hyps of this shorter training cycle. It's still very slow, but much faster than trying to evolve from 300 epochs results.

I'll raise a new issue with more details and complete code/commands for COCO finetuning evolution later today.

@glenn-jocher glenn-jocher merged commit 1e15aad into ultralytics:master Aug 31, 2020
@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 2, 2020

@NanoCode012 ok, I'm thinking of how to run COCO finetuning evolution. This is going to be pretty slow unfortunately, but that's just how it is. I'm trying to figure out if I can get a 9-image mosaic update in before doing this. My tests don't seem to show a huge effect from switching from 4 to 9 image mosaic unfortunately. I should have more solid results in about a day.
#897

In any case, I've created a tentative yolov5:evolve docker image for evolving COCO https://hub.docker.com/r/ultralytics/yolov5/tags.

The most efficient manner of evolving (i.e. FLOPS/day) is to evolve using separate containers, one per GPU. So for example for a 4-GPU machine (as shown in evolve.sh), this runs 4 containers, assigning a different GPU to each container, but with all containers pulling data from the local volume -v.

# Start on 4-GPU machine
for i in 0 1 2 3; do
  t=ultralytics/yolov5:evolve && sudo docker pull $t && sudo docker run -d --ipc=host --gpus all -v "$(pwd)"/VOC:/usr/src/VOC $t bash utils/evolve.sh $i
  sleep 60 # avoid simultaneous evolve.txt read/write
done

I usually assign a special GCP bucket to receive evolution results, and then all containers (no matter which machine or gpu they are on), thus read and write from that same common source. This works well most of the time, but occasionally I run into simultaneous read/write problems with gsutil, the GCP command line utility that handles all the traffic in and out of the bucket. I'm going to try to deploy a fix for this, and should have everything all set in the next couple days to begin COCO finetuning evolution.

The command I had in mind is this (which would go in evolve.sh), and the hyp.finetuning.yaml file would be updated to the latest VOC results, which are below. On a T4, which I'm using, each epoch takes about 1 hour, so we'd get about 4 generations done every 5 days. If I can deploy 8 T4's, this would be about 45 generations per week. If you can pitch in GPU hours also (and of course anyone else reading this that would like to contribute), then we can both evolve to/from the same bucket, and maybe get this done faster!

# Hyperparameter evolution commands
while true; do
  # python train.py --batch 64 --weights yolov5m.pt --data voc.yaml --img 512 --epochs 50 --evolve --bucket ult/voc --device $1
  python train.py --batch 40 --weights yolov5m.pt --data coco.yaml --img 640 --epochs 30 --evolve --bucket ult/voc --device $1
done
# Hyperparameters for VOC finetuning
# python train.py --batch 64 --weights yolov5m.pt --data voc.yaml --img 512 --epochs 50
# See tutorials for hyperparameter evolution https://docs.ultralytics.com/yolov5


# Hyperparameter Evolution Results
# Generations: 249
#                   P         R     mAP.5 mAP.5:.95       box       obj       cls
# Metrics:        0.6     0.936     0.896     0.684    0.0115   0.00805   0.00146

lr0: 0.0032
lrf: 0.12
momentum: 0.843
weight_decay: 0.00036
giou: 0.0296
cls: 0.243
cls_pw: 0.631
obj: 0.301
obj_pw: 0.911
iou_t: 0.2
anchor_t: 2.91
anchors: 3.63
fl_gamma: 0.0
hsv_h: 0.0138
hsv_s: 0.664
hsv_v: 0.464
degrees: 0.373
translate: 0.245
scale: 0.898
shear: 0.602
perspective: 0.0
flipud: 0.00856
fliplr: 0.5
mixup: 0.243

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Sep 2, 2020

Hi @glenn-jocher . I saw that you've done 250+ generations already. Cool! (I think we should create a new Issue for this, so there is more visibility and potentially more helpers.)

I built a simple docker off your yolov5:evolve with jupyterlab, so I can visualize it as it evolves. (Here is link for anyone reading + curious and reference)

I checked that I can cp/ls from your bucket. Do I need certain permissions to copy to the bucket? (Not familiar with GCP)

Fine-tuning cannot reach same map as a training from scratch.
I did a comparison for finetuning coco on the 5m at different batch-size + epochs.

5m finetune

Command:

python train.py --data coco.yaml --cfg yolov5m.yaml --weights yolov5m.pt --batch-size $bs --epoch $e

Base mAP is 62.4. (From test.py normal, not using coco eval map)

Overview:
image

Closeup at peak:
image

Total time for each:
image

From the above, it is safe to say that batchsize 64 with 40 epochs produces the "best" results. I'm not sure whether I should re-run it again to confirm this.

A small test has been done for 5l. Nothing conclusive can be said.

5l finetune

Base mAP is 65.7. (From test.py normal, not using coco eval map)

image

I will use one or two V100s for this. We can see how it turns out after a week or two.
If we go with bs 64 and 40 epochs, it would take around 15h(non-infinite) for one generation. One week would be 11 generations.

On a T4, which I'm using, each epoch takes about 1 hour, so we'd get about 4 generations done every 5 days. If I can deploy 8 T4's, this would be about 45 generations per week.

Is this time for COCO or VOC? If this is COCO, this would be amazing because I could only do 11 generations a week with a single V100 whereas 4T4 could easily do 20? Your setup is really efficient!

This works well most of the time, but occasionally I run into simultaneous read/write problems with gsutil, the GCP command line utility that handles all the traffic in and out of the bucket. I'm going to try to deploy a fix for this, and should have everything all set in the next couple days to begin COCO finetuning evolution.

I will wait for the fix then! Meanwhile, I will set a few runs for a better comparison for the 5l.

Edit: I just saw that you use batch-size 40 for 5m. I didn't realize you changed it. Will set finetune test for this.

@glenn-jocher
Copy link
Member

@NanoCode012 agree, will open a new issue on this soon.

--batch 40 is max possible on single 15GB T4 YOLOv5m GCP training with docker image, or --batch 48 with single 16GB V100. I guess you must have used multi-GPU to reach 64.

It's possible your finetuning mAPs are higher than you think: test.py runs a slightly more comprehensive (but slightly slower) mAP solution when called directly i.e. python test.py than when called by train.py, so testing after training is complete will always result in a slight mAP increase, perhaps 0.5-1.0% higher. The main differences are 0.5 pad (i.e. 16 pixels), and possibly multi_label NMS (this is always multi_label=True now, but may be set to False during training in the future). pycocotools mAP is unrelated to the above, and will also independently increase mAP slightly.

By the way, I just tested a custom dataset and was surprised to see that it finetunes much better with hyp.scratch.yaml than hyp.finetune.yaml. I'm pretty confused by this result. It's possible evolution results on one dataset may not correlate well to a second dataset unfortunately. I'll have to think about it.

Screen Shot 2020-09-02 at 1 40 17 PM

@glenn-jocher
Copy link
Member

@NanoCode012 ah sorry, to answer your other question, the times are for COCO. COCO trains about 10X slower than VOC.

VOC can train much faster because it has less images (16k vs 118k), which are natively smaller and which I train smaller (512 vs 640), and can --cache for faster dataloading due to smaller size. VOC 512 epoch takes 5 min on T4 or 2 min on V100, vs 60 min or 20 min for COCO 640.

@glenn-jocher glenn-jocher mentioned this pull request Sep 2, 2020
@NanoCode012 NanoCode012 deleted the dataloaders branch September 3, 2020 04:04
@NanoCode012
Copy link
Contributor Author

Hi @glenn-jocher ,

It's possible your finetuning mAPs are higher than you think: test.py runs a slightly more comprehensive (but slightly slower) mAP solution when called directly i.e. python test.py than when called by train.py, so testing after training is complete will always result in a slight mAP increase, perhaps 0.5-1.0% higher

I tried to do the below but got the reverse. test got a lower mAP.

# From ReadMe.md
python test.py --data coco.yaml --img 640 --conf 0.001 --weights ...

As an example of my bs64 of 40 epochs coco2017,

Tensorboard test pycocotools
last 61.64 61.4 (-0.14) 62.28 (+0.64)
best 62.13 61.9 (-0.23) 62.77 (+0.64)

On coco128 google colab,

train test
last 70.5 70.2
best 70.5 70.2

It's possible evolution results on one dataset may not correlate well to a second dataset unfortunately. I'll have to think about it.

Hmm, I was actually thinking if there could be one hyp file for each dataset/goal ( hyp_voc, hyp_coco, hyp_scratch) then users could choose. The reason being that different custom datasets may be affected differently by different hyps.

The hard part would be usability (need to explain to users.. tutorials) and maintenance. It's hard for a one-fit-all solution.

to answer your other question, the times are for COCO. COCO trains about 10X slower than VOC.

Okay!

@glenn-jocher
Copy link
Member

I tried to do the below but got the reverse. test got a lower mAP.

Oh, that's really strange. I've not seen that before. I was just running through the VOC results. I checked the difference here between using the final last.pt mAP from training and running test.py afterwards using last.pt. Most improved, with the greatest improvement in [email protected]:0.95.

Β  train [email protected] test [email protected]
YOLOv5s 85.6 85.7
YOLOv5m 89.3 89.3
YOLOv5x 90.2 90.5
YOLOv5x 91.4 91.5
Β  train [email protected]:0.95 test [email protected]:0.95
YOLOv5s 60.4 60.7
YOLOv5m 68.0 68.3
YOLOv5x 70.4 70.8
YOLOv5x 72.4 73.0

@glenn-jocher
Copy link
Member

Best VOC mAP is 92.2!

@glenn-jocher
Copy link
Member

!python test.py --data voc.yaml --weights '../drive/My Drive/cloud/runs/voc/exp3_yolov5x/weights/last.pt' --img 640 --iou 0.50 --augment

Namespace(augment=True, batch_size=32, conf_thres=0.001, data='./data/voc.yaml', device='', img_size=640, iou_thres=0.5, merge=False, save_json=False, save_txt=False, single_cls=False, task='val', verbose=False, weights=['../drive/My Drive/cloud/runs/voc/exp3_yolov5x/weights/last.pt'])
Using CUDA device0 _CudaDeviceProperties(name='Tesla P100-PCIE-16GB', total_memory=16280MB)

Fusing layers... 
Model Summary: 284 layers, 8.85745e+07 parameters, 8.45317e+07 gradients
Scanning labels ../VOC/labels/val.cache (4952 found, 0 missing, 0 empty, 0 duplicate, for 4952 images): 4952it [00:00, 18511.94it/s]
               Class      Images     Targets           P           R      [email protected]  [email protected]:.95: 100% 155/155 [04:38<00:00,  1.80s/it]
                 all    4.95e+03     1.2e+04       0.587       0.963       0.922       0.743
Speed: 53.0/1.3/54.2 ms inference/NMS/total per 640x640 image at batch-size 32

@NanoCode012
Copy link
Contributor Author

Best VOC mAP is 92.2!

Hi @glenn-jocher Congrats! How much did it go up by changing only hyp? Do the hyps affect the models differently since you used the 5m model to train?
Would evolving hyp cause it to overfit for only VOC?


--iou 0.50

Adding this changed quite a lot!

From earlier comment, bs64 e40.

Tensorboard test pycocotools
last 61.64 62 (+0.36) 62.81 (+1.17)
best 62.13 62.5 (+0.37) 63.32 (+1.19)
yolov5m - 62.8 63.53 (+0.73)

Here's an interesting effect of the finetune vs scratch on the 5m (still training)

image


Since mosaic9 did not work out, are you planning to add some more changes (like with the new file sotabench) before we apply evolve to COCO?

@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 3, 2020

@NanoCode012 haha, yes that's what I'm worried about. I thought I was evolving some finetuning hyps that would be good for the whole world, but now I'm thinking maybe they're just mainly good for VOC. The final hyps are drastically different than from scratch hyps I started from. lr0 for example drops from 0.01 to 0.003 and momentum dropped from 0.937 to 0.843. This produced about +3 mAP increase in VOC.

The good news is that all models s/m/l/x appeared to benefit equally from evolving on YOLOv5m, so that's great news on it's own. That means that 5m results at least can be counted on to correlate well with the entire range. I'm going to finetune the 4 from the initial hyps (just once for 50 epochs) also to do a better before and after comparison, because right now I'm just relying on my memory.

I was looking at sotabench yesterday and decided to try it out, as their existing results seem quite slow to me. It's possible we exceed all of the existing models greatly in terms of mAP at a given speed level. But I found a few bugs in their examples, submitted a PR to their repo, and alltogethor found support very limited there (forum has 10 posts over 2 years), which is unfortunate because the idea seems great.

Mosaic9 didn't fail, it just didn't provide markedly different results than mosaic4 in the VOC tests I ran. I think a key factor in the mosaic is cropping at the image edges, but this is for further exploration. So I suppose that yes, I just need to fix the gsutil overwrite bug and then we can start finetuning COCO. I see your plot there, that's super important, as I'm not sure which hyps to start evolving from. Looks like blue is going to win, but let's wait.

glenn-jocher added a commit that referenced this pull request Sep 4, 2020
@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Sep 4, 2020

Hi @glenn-jocher , contrary to our expectations, scratch won! Finetune is from gen 306 of VOC.

image

Tensorboard test pycocotools
scratch 62.47 62.3 (-0.17) 63.14 (+0.67)
finetune 61.48 61.7 (+0.22) 62.55 (+1.07)
prev finetune 62.13 61.9 (-0.22) 62.77 (+0.64)

python --data coco.yaml --weights ... (No setting iou)

This really made me think whether all my past results could be improved if I had used the scratch hyp or that the current hyp are overfitting to VOC.

I see that you've made a bug fix for gs util. I'm thinking of the possibility that two reads and upload and same time, cancelling the other.

I was thinking of using a Mutex lock styled approach.

read lock.txt

while lock is in-use
    sleep a few seconds
    read lock.txt


set lock to in-use
do calculation
release lock

This is to prevent writing at same time. Theoretically, there would not be a high chance that they will read at the same time between the both of us, but if others were to join in, the chances would be increased..

This would be a sure-fire way of blocking albeit expensive, and I don't think this is the norm with blocking in python..

Edit: Added table.
Edit: I set a few tests with 5m on hyp scratch.

@glenn-jocher
Copy link
Member

@NanoCode012 wow! Yup, well that's interesting. It's likely the hyperparameter space has many local minima, and hyp.scratch.yaml clearly appears to be a better local minima for COCO than hyp.finetune.yaml, so we should start the hyp evolution there. That's unfortunate that the VOC finetuning results do not correlate well with COCO.

Yes, I made a fix! When evolving to a local evolve.txt it is almost impossible to read/write at the same time, as the speeds are near instantaneous (and the file is small), so there are no issues evolving locally, but when evolving from different VMs/nodes to a cloud evolve.txt, gsutil can take several seconds to make the connection and read/write, which sometimes causes a corrupt file if another VM is doing the same at the same time, which causes gsutil to delete the file when it detects corruption, losing all results (!). The new fix should avoid this by ignoring corrupted/empty files, so only a single generation from a single node would be lost rather than the entire file.

Ok, so know to start from hyp.scratch.yaml, we know to use YOLOv5m, now all that's left is to decide a number of epochs and start. I see you used 40 there with good results.

@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 4, 2020

@NanoCode012 I just finished a set of YOLOv5m 20 epochs finetuning for each hyp file. I get the same results, scratch is better. We can essentially start evolution now, but another idea came to me. I'm thinking the dip in results on epoch 1 may be due to the warmup settings. The warmup slowly ramps up lr and momentum during the first 3 epochs, it is mainly intended for training from scratch to help stability. The intial values for lr and momentum are 0.0 and 0.9 generally, but there is a param group 2 that starts with a very aggressive lr of 0.10, and actually ramps this down to lr0. When training from scratch this helps adjust output biases especially, and works well because bias params are few in number and are not naturally unstable the way weights can be.

I'm thinking this might be related to the initial drop on finetuning. The effect on final mAP is likely limited, but I'm exploring whether to adjust the warmup when pretrained=True.

results

EDIT: Another option is to make these warmup settings evolve-able, but we already have 23 hyps so I'd rather not grow the group if I don't have to.

@glenn-jocher
Copy link
Member

@NanoCode012 ok, I am testing a handful of warmup strategies for finetuning with hyp.scratch.yaml now (including no warmup). I should have results by tomorrow.

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Sep 5, 2020

Hi @glenn-jocher

The new fix should avoid this by ignoring corrupted/empty files, so only a single generation from a single node would be lost rather than the entire file.

Thanks for the explanation.

Ok, so know to start from hyp.scratch.yaml, we know to use YOLOv5m, now all that's left is to decide a number of epochs and start. I see you used 40 there with good results.

Yep. hyp scratch.

Batchsize test from hyp finetune test, 5m for 40 epochs.
(I forgot to mention I'm using a single V100 32 gb and not multi-gpu.)

Batch-size HighestmAP @ epoch Total time
32 61.93 @ 21 15h 30 mn
40 61.83 @ 26 13h 58 mn
64 62.13 @ 25 14h 5 mn

Time is only for reference. There can be bottlenecks as I train multiple at a time. Would using multiple containers be faster than a single container? I've only done multiple training on a single container at a time, so I can keep track of which commit version I'm on and consolidate the results on Tensorboard.

Yesterday, I've also set tests for different epochs besides 40 such as 20,30,50,60,70,100. The last three are at around epoch 55, so we will see the results by tomorrow as well. This should give us a good grasp on which to choose. We should balance number of epochs with the accuracy. I suspect the longer the epoch, the accuracy becomes marginally better.

I am testing a handful of warmup strategies for finetuning with hyp.scratch.yaml now (including no warmup). I should have results by tomorrow.

Looking forward to these results!

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Sep 5, 2020

Hi @glenn-jocher , my tests on different epochs are now done.

Overview:

image

Closer look near peak:

image

An extra 10 epochs is around 3-4 hours. An epoch is around 19-20 mins.

Results at highest:

Total Epoch Total time Highest train mAP @ epoch
20 7hr 22mn 62.07 @ 20
30 10hr 47mn 62.35 @ 29
40 13h 8mn 62.47 @ 40
50 17h 29mn 62.49 @ 47
60 20h 58mn 62.49 @ 55
70 1d 0h 18mn 62.53 @ 68
100 1d 10h 12mn 62.62 @ 90

We should safely be able to use epoch 40 at highest batch-size (64 for me) unless your warmup results proves otherwise. My next concern is how far we are going to tune the hyps (to not overfit) and whether these will have direct correlation to training from scratch as we do not have any conclusive evidence it would happen, only that it reaches near the same value. For ex, pycocotest for finetune_100 reaches 63.32 .


I just finished a set of YOLOv5m 20 epochs finetuning for each hyp file. I get the same results, scratch is better.

Do you also see test scores lower than train scores? I see them across my entire training.

@glenn-jocher
Copy link
Member

@NanoCode012 wow, that's really good work! I really need to switch to tensorboard for my cloud training, I'm still stuck plotting results.txt files. I tested 12 different warmup techniques (YOLOv5m --epochs 10), and was surprised to see minimal impact overall. One interesting takeaway is results320, no warmup, shows by far the best GIoU loss (need to rename this to box loss, CIoU is used now), which seems to show that box loss always benefits from more training to a greater degree than the other two.

Screen Shot 2020-09-05 at 12 00 34 PM

results

If I zoom in on the final 5 epochs I see that 322 was the best, which starts from a very low initial momentum of 0.5 for all param groups. But 324 (pink, initial momentum = 0.8, initial bias lr0 = 0.01) showed the best trend in the final epochs.
results

So I think I'll create a docker image with a mix of the best two results above (initial momentum 0.6, initial bias lr 0.05), and also make them evolveable.

From your results we see diminishing returns from increased epochs, with 20 to 30 showing the largest improvement vs added time. So it looks like 30 may be a good sweet spot.

@glenn-jocher
Copy link
Member

@NanoCode012 BTW, about your other question, how finetune hyps will relate to scratch hyps, I really don't know. Evolving finetuning hyps on COCO is probably our most feasible task. We could evolve from-scratch hyps similarly, perhaps for more epochs, i.e. 50 or 100, but these would take much longer to test, since we'd want to apply them to a full 300 epochs, and its possible that scratch hyps evolved for even 100 epochs would overfit far too soon when trained to 300 epochs, so we'd be back to playing the game of preventing early overfitting on our results. This is basically what I did with YOLOv3 last year, it's a very difficult path for those with limited resources. If we were Google we could use a few 8X V100 machines to evolve scratch hyps to 300 full epochs, problem solved, but that door isn't open to us unfortunately, so evolving finetuning for 30 seems like our best bet.

To be clear though I don't know how well it will work, it's a big experiment. In ML often you just have to do it and find out.

@glenn-jocher
Copy link
Member

I plotted just the momentum changes I tried, leaving everything else the same. The results are huge! Lowering initial momentum helps greatly retain some of the initial mAP, and leads to lower validation losses. This makes me think I should test lower initial momentum values as well, maybe 0.2, 0.1, 0.0.
results

This is initial bias LR, all else equal. Default 0.1 seems too high, 0.0 too low.
results

This is the number of warmup epochs. Well, hmm looking at this one could argue that the best finetuning warmup strategy is no warmup strategy.
results

Closeup of warmup epochs. The zero warmup trend in the final epochs looks the best, red slop is steepest.
results

@glenn-jocher
Copy link
Member

@NanoCode012 hyp evolution is all set now! See #918

burglarhobbit pushed a commit to burglarhobbit/yolov5 that referenced this pull request Jan 1, 2021
* Add InfiniteDataLoader

Only initializes at first epoch. Saves time.

* Moved class to a better location
burglarhobbit pushed a commit to burglarhobbit/yolov5 that referenced this pull request Jan 1, 2021
KMint1819 pushed a commit to KMint1819/yolov5 that referenced this pull request May 12, 2021
* Add InfiniteDataLoader

Only initializes at first epoch. Saves time.

* Moved class to a better location
KMint1819 pushed a commit to KMint1819/yolov5 that referenced this pull request May 12, 2021
BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
* Add InfiniteDataLoader

Only initializes at first epoch. Saves time.

* Moved class to a better location
BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants