-
Notifications
You must be signed in to change notification settings - Fork 22k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] torch.to(obj, device)
supporting recursive lists/dicts/tuples of tensors probably by uplifting/promoting torch.distributed.utils._recursive_to
#69431
Comments
Wouldn't this be nicely solved by just using pytree from here |
maybe! but I think it's still useful to have this primitive in core as it appears a lot in user code and very few people know about pytrees (i certainly don't). it could also be to better have a simple manual version instead to be exactly sure what's happening. e.g. a simple version makes it clear that it would be slow-ish for giant lists of python numbers (as there would be a lot of checks before returning the value), not clear what's happening with pytree. it also seems to have some flatten/unflatten concept... also, torch.to doesn't exist now or isn't documented |
Ho I used that because you mentioned it above :p But indeed that doesn't even exist. |
The lack of We do write "crawlers" like this occasionally. NumPy doesn't have any functionality like this, does it? |
Yeah, I just propose to use the opportunity and introduce torch.to supporting slightly more generic version, supporting basic Python structs (also supported by TorchScript), be it with pytree utils or not. I'm proposing this on improving UX grounds, so can't argue with that this isn't an "impediment". However, similar functions get re-rolled practically in every project. I somewhat agree, but even in implementing custom device-movers for custom structures having an existing thing working for some known types will make the code simpler. My code above just passes through non PyTorch things. And I also agree/think that it would have been nicer to have more modular/reusable (but still simple enough) "collation" / "conversion" routines. NumPy doesn't have a concept of device anyway, so it doesn't have this functionality obviously (the need for other automated conversions like dtype conversions is much less), but the reason isn't relevant IMO. |
I saw somewhere more examples of pytree, so I now understand better that pytree traversal indeed can be used for making this recursive device transfer :) But I would still suggest, that it's a good default behavior for torch.to (to be introduced) without forcing users to make their own pytree'd version, since this recursive mode is often needed. There might be some optimizations like preserving data sharing so that if we do torch.to(torch.split(...), ...) the converted tensors are still views over parts of a single tensor. Bit it might be too special-case |
I also wonder if treemap is parallel (akin to _foreach methods). Probably for |
Related PR that actually implements this recursive utility: #77187, but not in generic namespace :( |
Found this issue after @vadimkantorov comment on a related PR. Agreed that such a utility would be quite useful and PT-D would then not need such custom logic to move inputs for DDP / FSDP. @albanD @jbschlosser @mruberry Do we have any new thoughts on this feature and whether this is something that the core team might be able to address? |
Any reason for not using |
I understand this would be convenient, but we try not to add sugar to core PyTorch operations. |
To be explicit: this would involve use of |
Well, for |
there are some stuffs about streams, not sure if tree_map will be able to handle it? may also be a good idea to support non_blocking and calling custom methods |
So now in theory, torch.to could just be implemented in terms of torch.utils._pytree.tree_map. Compared to vanilla tree_map, torch.to could theoretically do all allocations in one batch (asynchronously) / in a single allocation, but not sure what could be a good design for it. Although, not even sure if it's optimal of doing one large allocation or many smaller ones wrt reuse of already allocated segments that might not be contiguous |
although in another context (sparse tensors) this post describes an optimization option that torch.to could also do (minimize the number of host->device copies): https://pytorch.org/blog/optimizing-production-pytorch-performance-with-graph-transformations/#31-combining-input-sparse-features |
torch.to(obj, device)
supporting recursive lists/dicts/tuples of tensors probably by uplifting/promoting torch.distributed.utils._recursive_to
torch.to(obj, device)
supporting recursive lists/dicts/tuples of tensors probably by uplifting/promoting torch.distributed.utils._recursive_totorch.to(obj, device)
supporting recursive lists/dicts/tuples of tensors probably by uplifting/promoting torch.distributed.utils._recursive_to
I guess pytree would be the way to go for this in 2024 :D |
On the workability side, yes it would cut it, but having some builtin option for separate copy thread or efficiently moving a lot of tensors in one allocation might be interesting - especially if it's already implemented in distributed context. Otheriwse, for functional programming side, I think implementing torch.to as tree_map_only is a worthy shortcut for users to have in core and for consistency between functions/methods :) |
I do not think that changing torch.to() would be good for consistency actually. That would make the behavior significantly different between function and methods. |
For basic individual tensors, the behavior would be exactly the same anyway? So for all inputs acceptable for instance .to method, the function behavior torch.to would be the same, right? I'm proposing to keep them strictly identical to all individual tensors (so, both .to and torch.to would continue to exist), and potentially have more knobs/functionality for more complex inputs, like tensor lists or pytree object hierarchies |
Btw So maybe it is the right moment for it to be promoted at least to |
We should remove this code and just use tree_map there instead. It's going to be a lot more reliable... |
I think it'd be also a great impl for a publicly advertised shortcut :) Otherwise, this duplicate code exists in distributed_utils and in this ZeroOptimizer.py A more custom op could somehow parallelize copies (if a large tensor list is passed as input) using multiple CUDA threads or allocate a single large contig memory chunk on GPU (and maybe do all this in async fashion if this suits to be able to schedule the copies on user-provided background CUDA stream used for copies) |
I think that if we need to add one more API for our users to know about, I prefer for them to learn about pytree. It will allow them to solve many of their problems with one line. Compared to a specialized API that will only solve a single problem. |
I think, Another advantage to introduce torch.to is that later on some optimizations can be made for processing large lists |
In theory, nothing prevents Dynamo from tracing pytree work and doing fancy optimization there :D cc @zou3519 for pytree as a public feature |
Btw, funnily, global torch-bound |
Seems like an HF's attempt to implement something like |
Often it is needed to move model results to cpu (or inputs to gpu). Once the data structures get a bit complicated, dicts and lists appear often in model results. Often we have to roll a little utility method like below. If indeed other people had to write this sort of utilities, may make sense to include something like this to core.
cc @albanD @mruberry @jbschlosser @walterddr
The text was updated successfully, but these errors were encountered: