Skip to content

Commit

Permalink
Dataset doc updates (ray-project#19815)
Browse files Browse the repository at this point in the history
  • Loading branch information
ericl committed Nov 5, 2021
1 parent 44b38e9 commit 6102912
Show file tree
Hide file tree
Showing 7 changed files with 44 additions and 9 deletions.
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Ray is packaged with the following libraries for accelerating machine learning w
- `Tune`_: Scalable Hyperparameter Tuning
- `RLlib`_: Scalable Reinforcement Learning
- `Train`_: Distributed Deep Learning (alpha)
- `Datasets`_: Flexible Distributed Data Loading (beta)
- `Datasets`_: Distributed Data Loading and Compute (beta)

As well as libraries for taking ML and distributed apps to production:

Expand Down
Binary file added doc/source/data/dataset-compute-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/source/data/dataset-loading-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/source/data/dataset-loading-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 37 additions & 2 deletions doc/source/data/dataset.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _datasets:

Datasets: Flexible Distributed Data Loading
===========================================
Datasets: Distributed Data Loading and Compute
==============================================

.. tip::

Expand All @@ -14,6 +14,41 @@ Ray Datasets are the standard way to load and exchange data in Ray libraries and
..
https://docs.google.com/drawings/d/16AwJeBNR46_TsrkOmMbGaBK7u-OPsf_V8fHjU-d2PPQ/edit
Data Loading for ML Training
----------------------------

Ray Datasets are designed to load and preprocess data for distributed :ref:`ML training pipelines <train-docs>`. Compared to other loading solutions, Datasets is more flexible (e.g., can express higher-quality `per-epoch global shuffles <examples/big_data_ingestion.html>`__) and provides `higher overall performance <https://www.anyscale.com/blog/why-third-generation-ml-platforms-are-more-performant>`__.

Datasets is not intended as a replacement for more general data processing systems. Its utility is as the last-mile bridge from ETL pipeline outputs to distributed applications and libraries in Ray:

.. image:: dataset-loading-1.png
:width: 650px
:align: center

..
https://docs.google.com/presentation/d/1l03C1-4jsujvEFZUM4JVNy8Ju8jnY5Lc_3q7MBWi2PQ/edit
Ray-integrated DataFrame libraries can also be seamlessly used with Datasets, to enable running a full data to ML pipeline completely within Ray without requiring data to be materialized to external storage:

.. image:: dataset-loading-2.png
:width: 650px
:align: center

See the following for more Dataset ML use cases and benchmarks:

- [slides] `Talk given at PyData 2021 <https://docs.google.com/presentation/d/1zANPlmrxQkjPU62I-p92oFO3rJrmjVhs73hL4YbM4C4>`_

General Parallel Compute
------------------------

Beyond data loading, Datasets simplifies general purpose parallel GPU/CPU compute in Ray (e.g., for `GPU batch inference <dataset.html#transforming-datasets>`__). Datasets provides a higher level API for Ray tasks and actors in such embarassingly parallel compute situations, internally handling operations like batching, pipelining, and memory management.

.. image:: dataset-compute-1.png
:width: 500px
:align: center

Since it is built on Ray, Datasets can leverage the full functionality of Ray's distributed scheduler, e.g., using actors for optimizing setup time and GPU scheduling via the ``num_gpus`` argument.

Concepts
--------
Ray Datasets implement `Distributed Arrow <https://arrow.apache.org/>`__. A Dataset consists of a list of Ray object references to *blocks*. Each block holds a set of items in either an `Arrow table <https://arrow.apache.org/docs/python/data.html#tables>`__ or a Python list (for Arrow incompatible objects). Having multiple blocks in a dataset allows for parallel transformation and ingest of the data (e.g., into :ref:`Ray Train <train-docs>` for ML training).
Expand Down
2 changes: 1 addition & 1 deletion python/ray/data/grouped_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

@PublicAPI(stability="beta")
class GroupedDataset(Generic[T]):
"""Implements a lazy dataset grouped by key (Experimental).
"""Represents a grouped dataset created by calling ``Dataset.groupby()``.
The actual groupby is deferred until an aggregation is applied.
"""
Expand Down
10 changes: 5 additions & 5 deletions python/ray/data/read_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,15 +111,15 @@ def range_arrow(n: int, *, parallelism: int = 200) -> Dataset[ArrowRow]:

@PublicAPI(stability="beta")
def range_tensor(n: int, *, shape: Tuple = (1, ),
parallelism: int = 200) -> Dataset[np.ndarray]:
parallelism: int = 200) -> Dataset[ArrowRow]:
"""Create a Tensor dataset from a range of integers [0..n).
Examples:
>>> ds = ray.data.range_tensor(1000, shape=(3, 10))
>>> ds.map_batches(lambda arr: arr ** 2).show()
>>> ds.map_batches(lambda arr: arr * 2, batch_format="pandas").show()
This is similar to range(), but uses np.ndarrays to hold the integers
in tensor form. The dataset has overall the shape ``(n,) + shape``.
This is similar to range_arrow(), but uses the ArrowTensorArray extension
type. The dataset elements take the form {"value": array(N, shape=shape)}.
Args:
n: The upper bound of the range of integer records.
Expand All @@ -128,7 +128,7 @@ def range_tensor(n: int, *, shape: Tuple = (1, ),
Parallelism may be limited by the number of items.
Returns:
Dataset holding the integers as tensors.
Dataset holding the integers as Arrow tensor records.
"""
return read_datasource(
RangeDatasource(),
Expand Down

0 comments on commit 6102912

Please sign in to comment.