Dataset doc updates (ray-project#19815)

rickyyx · Nov 5, 2021 · 6102912 · 6102912
1 parent 44b38e9
commit 6102912
Show file tree

Hide file tree

Showing 7 changed files with 44 additions and 9 deletions.
diff --git a/README.rst b/README.rst
@@ -22,7 +22,7 @@ Ray is packaged with the following libraries for accelerating machine learning w
 - `Tune`_: Scalable Hyperparameter Tuning
 - `RLlib`_: Scalable Reinforcement Learning
 - `Train`_: Distributed Deep Learning (alpha)
-- `Datasets`_: Flexible Distributed Data Loading (beta)
+- `Datasets`_: Distributed Data Loading and Compute (beta)
 
 As well as libraries for taking ML and distributed apps to production:
 

diff --git a/doc/source/data/dataset-compute-1.png b/doc/source/data/dataset-compute-1.png
diff --git a/doc/source/data/dataset-loading-1.png b/doc/source/data/dataset-loading-1.png
diff --git a/doc/source/data/dataset-loading-2.png b/doc/source/data/dataset-loading-2.png
diff --git a/doc/source/data/dataset.rst b/doc/source/data/dataset.rst
@@ -1,7 +1,7 @@
 .. _datasets:
 
-Datasets: Flexible Distributed Data Loading
-===========================================
+Datasets: Distributed Data Loading and Compute
+==============================================
 
 .. tip::
 
@@ -14,6 +14,41 @@ Ray Datasets are the standard way to load and exchange data in Ray libraries and
 ..
  https://docs.google.com/drawings/d/16AwJeBNR46_TsrkOmMbGaBK7u-OPsf_V8fHjU-d2PPQ/edit
 
+Data Loading for ML Training
+----------------------------
+
+Ray Datasets are designed to load and preprocess data for distributed :ref:`ML training pipelines <train-docs>`. Compared to other loading solutions, Datasets is more flexible (e.g., can express higher-quality `per-epoch global shuffles <examples/big_data_ingestion.html>`__) and provides `higher overall performance <https://www.anyscale.com/blog/why-third-generation-ml-platforms-are-more-performant>`__.
+
+Datasets is not intended as a replacement for more general data processing systems. Its utility is as the last-mile bridge from ETL pipeline outputs to distributed applications and libraries in Ray:
+
+.. image:: dataset-loading-1.png
+ :width: 650px
+ :align: center
+
+..
+ https://docs.google.com/presentation/d/1l03C1-4jsujvEFZUM4JVNy8Ju8jnY5Lc_3q7MBWi2PQ/edit
+
+Ray-integrated DataFrame libraries can also be seamlessly used with Datasets, to enable running a full data to ML pipeline completely within Ray without requiring data to be materialized to external storage:
+
+.. image:: dataset-loading-2.png
+ :width: 650px
+ :align: center
+
+See the following for more Dataset ML use cases and benchmarks:
+
+- [slides] `Talk given at PyData 2021 <https://docs.google.com/presentation/d/1zANPlmrxQkjPU62I-p92oFO3rJrmjVhs73hL4YbM4C4>`_
+
+General Parallel Compute
+------------------------
+
+Beyond data loading, Datasets simplifies general purpose parallel GPU/CPU compute in Ray (e.g., for `GPU batch inference <dataset.html#transforming-datasets>`__). Datasets provides a higher level API for Ray tasks and actors in such embarassingly parallel compute situations, internally handling operations like batching, pipelining, and memory management.
+
+.. image:: dataset-compute-1.png
+ :width: 500px
+ :align: center
+
+Since it is built on Ray, Datasets can leverage the full functionality of Ray's distributed scheduler, e.g., using actors for optimizing setup time and GPU scheduling via the ``num_gpus`` argument.
+
 Concepts
 --------
 Ray Datasets implement `Distributed Arrow <https://arrow.apache.org/>`__. A Dataset consists of a list of Ray object references to *blocks*. Each block holds a set of items in either an `Arrow table <https://arrow.apache.org/docs/python/data.html#tables>`__ or a Python list (for Arrow incompatible objects). Having multiple blocks in a dataset allows for parallel transformation and ingest of the data (e.g., into :ref:`Ray Train <train-docs>` for ML training).

diff --git a/python/ray/data/grouped_dataset.py b/python/ray/data/grouped_dataset.py
@@ -17,7 +17,7 @@
 
 @PublicAPI(stability="beta")
 class GroupedDataset(Generic[T]):
- """Implements a lazy dataset grouped by key (Experimental).
+ """Represents a grouped dataset created by calling ``Dataset.groupby()``.
 
  The actual groupby is deferred until an aggregation is applied.
  """

diff --git a/python/ray/data/read_api.py b/python/ray/data/read_api.py
@@ -111,15 +111,15 @@ def range_arrow(n: int, *, parallelism: int = 200) -> Dataset[ArrowRow]:
 
 @PublicAPI(stability="beta")
 def range_tensor(n: int, *, shape: Tuple = (1, ),
- parallelism: int = 200) -> Dataset[np.ndarray]:
+ parallelism: int = 200) -> Dataset[ArrowRow]:
  """Create a Tensor dataset from a range of integers [0..n).
 
  Examples:
  >>> ds = ray.data.range_tensor(1000, shape=(3, 10))
- >>> ds.map_batches(lambda arr: arr ** 2).show()
+ >>> ds.map_batches(lambda arr: arr * 2, batch_format="pandas").show()
 
- This is similar to range(), but uses np.ndarrays to hold the integers
- in tensor form. The dataset has overall the shape ``(n,) + shape``.
+ This is similar to range_arrow(), but uses the ArrowTensorArray extension
+ type. The dataset elements take the form {"value": array(N, shape=shape)}.
 
  Args:
  n: The upper bound of the range of integer records.
@@ -128,7 +128,7 @@ def range_tensor(n: int, *, shape: Tuple = (1, ),
  Parallelism may be limited by the number of items.
 
  Returns:
- Dataset holding the integers as tensors.
+ Dataset holding the integers as Arrow tensor records.
  """
  return read_datasource(
  RangeDatasource(),