[data] Add DataIterator.materialize
#43210
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
This PR introduces a
DataIterator.materialize
API that fully executes/consumes a data iterator and returns it as aMaterializedDataset
for the user to continue processing it.The reason to add this API is to support model training in Ray Train that requires the full dataset up front. For example,
xgboost
needs to consider the full dataset to fit decision trees and expects that full dataset to be .The
get_dataset_shard
API which bridges Ray Data and Ray Train callsstreaming_split
on the dataset, where the number of splits is the number of training workers. This works well for SGD training schemes (typical for Torch, Tensorflow users), since the typical training procedure is to estimate the gradient on a small batch of data at a time. Fitting decision trees requires searching for the best split over the entire dataset, where the batch by batch dataloading is not suitable.With this change, the following workflow is now possible:
XGBoost training with a data iterator
Note that there actually is support for xgboost training with data iterators, but it is experimental and possibly less performant: https://xgboost.readthedocs.io/en/latest/tutorials/external_memory.html#data-iterator
Related PR
This PR is a pre-requisite for #42767
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.