ResNet synthetic data performance enhancement. #5225
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
All numbers are form a DGX-1 with V100s
tl;dr; I improved synthetic data performance from ~4,800 images/sec to 5,500 images/sec 14.6% speedup on ResNetV1 FP16 maybe more with smaller models.
The current Synthetic data has a couple problems. 1) the dtype is set to float32 and is then cast on the GPU (which is something that needs changed for real data a well but is less problematic and I will do a PR for next) no matter what 2) it does not seem to have prefetch. Both of these combine for a situation where real data is faster than synthetic data: Real data ~5,200 images/sec ResNet V1 and ~4,800 images/sec synthetic data.
During my testing I found:
This solution still has the host to device copy, which I believe can only be removed with a custom dataset and I have some doubt it is worth it in the near-term.
For followup work is to move the tf.cast to fp16 for real data as part of the input pipeline and then removing the tf.cast in resnet_run_loop had a small but seemingly consistent improvement. It also seems more valid and keeps work off the GPU.