Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[data] Add
DataIterator.materialize
(#43210)
This PR introduces a `DataIterator.materialize` API that fully executes/consumes a data iterator and returns it as a `MaterializedDataset` for the user to continue processing it. The reason to add this API is to support model training in Ray Train that requires the full dataset up front. For example, `xgboost` needs to consider the full dataset to fit decision trees and expects that full dataset to be . The `get_dataset_shard` API which bridges Ray Data and Ray Train calls `streaming_split` on the dataset, where the number of splits is the number of training workers. This works well for SGD training schemes (typical for Torch, Tensorflow users), since the typical training procedure is to estimate the gradient on a small batch of data at a time. Fitting decision trees requires searching for the best split over the entire dataset, where the batch by batch dataloading is not suitable. Signed-off-by: Justin Yu <[email protected]>
- Loading branch information