-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MPS] Fast math env var #129007
[MPS] Fast math env var #129007
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129007
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4a15373 with merge base 9a7e251 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind landing it separately from the rest of the stack?
Also, it would be nice to do it not just thru the environment variable, but via property as well, something like torch.backends.mps.use_fast_math
?
Also, I feel like we are at the point when it's probably time to start compiling those offline rather than during first use of the op, so it would not be helpful anyway
Hmm where do we usually store this kind of state, in python or c++? can I use the environment variable as proxy (ground truth) such that the property and environment variable are synchronized? |
@malfet if we plan to compile the kernels offline, let's not put much effort on adding the property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
@pytorchbot merge -f "Lint + MPS are green" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This PR generalizes the multi_tensor_apply function for other fused optimizers Pull Request resolved: #129105 Approved by: https://github.com/malfet ghstack dependencies: #129006, #129008, #129007
``` [-------------------------------------- Fused SGD --------------------------------------] | Fused: True | Fused: False 1 threads: ------------------------------------------------------------------------------ numel: 1024, num_tensors: 100, momentum: True | 2 | 15 numel: 1024, num_tensors: 100, momentum: False | 2 | 5 numel: 65536, num_tensors: 100, momentum: True | 3 | 16 numel: 65536, num_tensors: 100, momentum: False | 2 | 5 numel: 1048576, num_tensors: 100, momentum: True | 11 | 16 numel: 1048576, num_tensors: 100, momentum: False | 8 | 6 numel: 1024, num_tensors: 500, momentum: True | 29 | 70 numel: 1024, num_tensors: 500, momentum: False | 20 | 24 numel: 65536, num_tensors: 500, momentum: True | 33 | 76 numel: 65536, num_tensors: 500, momentum: False | 22 | 26 numel: 1048576, num_tensors: 500, momentum: True | 70 | 80 numel: 1048576, num_tensors: 500, momentum: False | 43 | 40 numel: 1024, num_tensors: 1000, momentum: True | 108 | 139 numel: 1024, num_tensors: 1000, momentum: False | 72 | 48 numel: 65536, num_tensors: 1000, momentum: True | 116 | 150 numel: 65536, num_tensors: 1000, momentum: False | 77 | 52 numel: 1048576, num_tensors: 1000, momentum: True | 190 | 170 numel: 1048576, num_tensors: 1000, momentum: False | 120 | 50 ``` ```python def profile_fused_sgd(): from torch.optim.sgd import sgd import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, momentum_buffer_list, fused): fn( params, grads, momentum_buffer_list, momentum=True if len(momentum_buffer_list) > 0 else False, dampening=0.0, nesterov=False, foreach=False, fused=fused, lr=1e-3, weight_decay=.0, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device = "mps" results = [] for num_tensors, numel, momentum in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False]): sublabel = f"numel: {numel}, num_tensors: {num_tensors}, momentum: {momentum}" print(sublabel) params, grads = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(2)] momentum_buffer_list = [torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] if momentum else [] fn = sgd for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, momentum_buffer_list, fused)', label='Fused SGD', sub_label=sublabel, globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Pull Request resolved: #129350 Approved by: https://github.com/janeyx99 ghstack dependencies: #129006, #129008, #129007, #129105
Stack from ghstack (oldest at bottom):
Allow users to decide whether they want to have fast math enabled via env var