Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Datasets] Add sub progress bar support for AllToAllOperator (#33302)
This PR is to add sub progress bar support for AllToAllOperator. Before this PR, we don't report AllToAllOperator progress correctly with streaming executor. The sub progress bar was used internally in `AllToAllOperator.bulk_fn`. This PR is to hook the sub progress bar up with overall progress bar reporting, so they are shown up in console properly. The change includes: * `AllToAllStage.sub_stage_names`: to indicate name of each sub-stage / sub-progress bar (e.g. `ShuffleMap`) * `AllToAllOperator.sub_progress_bar_names`: same as `AllToAllStage.sub_stage_names`. * `AllToAllOperator.initialize_sub_progress_bars()/close_sub_progress_bars()`: called from `OpState` to control initializing and closing thse sub progress bars. * `TaskContext.sub_progress_bar_iter`: the iterator of sub progress bar to be used in each `AllToAllOperator.bulk_fn`. Examples: 1. random_shuffle() and repartition() ```py >>> import ray >>> import time >>> def sleep(x): ... time.sleep(0.1) ... return x ... >>> >>> for _ in ( ... ray.data.range_tensor(5000, shape=(80, 80, 3), parallelism=200) ... .map_batches(sleep, num_cpus=2) ... .map_batches(sleep, compute=ray.data.ActorPoolStrategy(2, 4)) ... .random_shuffle() ... .map_batches(sleep, num_cpus=2) ... .repartition(400) ... .map_batches(sleep, num_cpus=2) ... .iter_batches() ... ): ... pass 2023-03-14 14:41:19,062 INFO worker.py:1550 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 2023-03-14 14:41:21,218 INFO streaming_executor.py:74 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange] -> TaskPoolMapOperator[MapBatches(sleep)] -> ActorPoolMapOperator[MapBatches(sleep)] -> AllToAllOperator[RandomShuffle] -> TaskPoolMapOperator[MapBatches(sleep)] -> AllToAllOperator[Repartition] -> TaskPoolMapOperator[MapBatches(sleep)] Resource usage vs limits: 0.0/10.0 CPU, 0.0/0.0 GPU, 14.5 MiB/512.0 MiB object_store_memory 0: 0%| | 0/1 [00:24<?, ?it/2023-03-14 14:41:45,936 qWARNING plan.py:572 -- Warning: The Ray cluster currently does not have any available CPUs. The Dataset job will hang unless more CPUs are freed up. A common reason is that cluster resources are used by Actors or Tune trials; see the following link for more details: https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune (raylet) Spilled 2922 MiB, 2208 objects, write throughput 1118 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message. Resource usage vs limits: 2.0/10.0 CPU, 0.0/0.0 GPU, 677.51 MiB/512.0 MiB object_store_memory 0: 0%| | 0/1 [00:31<?, ?it ReadRange: 0 active, 0 queued 1: 100%|██████████████████████████████████████████████████| 200/200 [00:29<00:00, 16.23it/s] MapBatches(sleep): 0 active, 0 queued 2: 100%|██████████████████████████████████████████| 200/200 [00:29<00:00, 16.59it/s] MapBatches(sleep): 0 active, 0 queued, 0 actors [200 locality hits, 0 misses] 3: 100%|██| 200/200 [00:29<00:00, 17.25it/s] RandomShuffle: 0 active, 0 queued 4: 100%|██████████████████████████████████████████████| 200/200 [00:26<00:00, 19.93s/it] *- ShuffleMap 5: 100%|████████████████████████████████████████████████████████████████| 200/200 [00:29<00:00, 80.30it/s] *- ShuffleReduce 6: 100%|█████████████████████████████████████████████████████████████| 200/200 [00:29<00:00, 10.66it/s] MapBatches(sleep): 0 active, 0 queued 7: 100%|██████████████████████████████████████████| 200/200 [00:26<00:00, 42.88it/s] Repartition: 0 active, 0 queued 8: 0%| | 1/400 [00:28<3:12:04, 28.88s/it] *- Repartition 9: 94%|███████████████████████████████████████████████████████████▌ | 378/400 [00:27<00:00, 23.50it/s] MapBatches(sleep): 5 active, 362 queued 10: 8%|███▎ | 33/400 [00:31<00:56, 6.55it/s] output: 3 queued 11: 8%|█████ | 32/400 [00:31<00:56, 6.57it/s] ``` 2. repartition(shuffle=True) ```py Resource usage vs limits: 0.0/10.0 CPU, 0.0/0.0 GPU, 0.0 MiB/512.0 MiB object_store_memory 0: 0%| | 0/1 [00:00<?, ?it/s] ReadCSV->Repartition: 0 active, 0 queued, 0 output 1: 0%| | 0/10 [00:00<?, ?it/s] *- ShuffleMap 2: 0%| | 0/10 [00:00<?, ?it/s] *- ShuffleReduce 3: 0%| | 0/10 [00:00<?, ?it/s] output: 0 queued 4: 0%| | 0/10 [00:00<?, ?it/s] ``` 3. sort() ```py Resource usage vs limits: 0.0/10.0 CPU, 0.0/0.0 GPU, 0.0 MiB/512.0 MiB object_store_memory 0: 0%| | 0/1 [00:00<?, ?it/s] Sort: 0 active, 0 queued, 0 output 1: 0%| | 0/10 [00:00<?, ?it/s] *- SortSample 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.72it/s] *- ShuffleMap 3: 20%|████████████████████▏ | 2/10 [00:18<01:00, 7.58s/it] *- ShuffleReduce 4: 0%| | 0/10 [00:00<?, ?it/s] output: 0 queued 5: 0%| | 0/10 [00:00<?, ?it/s] ``` 4. groupby().aggregate() ```py Resource usage vs limits: 0.0/10.0 CPU, 0.0/0.0 GPU, 0.0 MiB/512.0 MiB object_store_memory 0: 0%| | 0/1 [00:00<?, ?it/s] Aggregate: 0 active, 0 queued, 0 output 1: 0%| | 0/10 [00:00<?, ?it/s] *- SortSample 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.72it/s] *- ShuffleMap 3: 20%|████████████████████▏ | 2/10 [00:18<01:00, 7.58s/it] *- ShuffleReduce 4: 0%| | 0/10 [00:00<?, ?it/s] output: 0 queued 5: 0%| | 0/10 [00:00<?, ?it/s] ``` 5. groupby().map_groups() ```py Resource usage vs limits: 4.0/10.0 CPU, 0.0/0.0 GPU, 792.74 MiB/512.0 MiB object_store_memory 0: 0%| | 0/1 [00:03<?, ?it/s] Sort: 0 active, 0 queued, 0 output 1: 0%| | 0/10 [00:00<?, ?it/s] *- SortSample 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 25.20it/s] *- ShuffleMap 3: 0%| | 0/10 [00:00<?, ?it/s] *- ShuffleReduce 4: 0%| | 0/10 [00:00<?, ?it/s] MapBatches(group_fn): 4 active, 6 queued 5: 0%| | 0/10 [00:03<?, ?it/s] output: 0 queued 6: 0%| | 0/10 [00:00<?, ?it/s] ```
- Loading branch information