Skip to content

Commit

Permalink
[Data][Docs] Revise "ML Tensor Support" (ray-project#34999)
Browse files Browse the repository at this point in the history
Users are confused by the inconsistent use of "tensor" and "array" in the Ray Data documentation. This PR clarifies that tensor data is represented as ndarrays.

In addition, this PR revises the "ML Tensor Support" user guide to remove redundant or outdated information.

Co-authored-by: angelinalg <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
  • Loading branch information
3 people committed May 5, 2023
1 parent d01113a commit 1f576eb
Show file tree
Hide file tree
Showing 9 changed files with 165 additions and 249 deletions.
192 changes: 0 additions & 192 deletions doc/source/data/data-tensor-support.rst

This file was deleted.

8 changes: 4 additions & 4 deletions doc/source/data/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Streaming Batch Inference
-------------------------

Ray Data simplifies general purpose parallel GPU and CPU compute in Ray through its
powerful :ref:`Datastream <datastream_concept>` primitive. Datastreams enable workloads such as
powerful :ref:`Datastream <datastream_concept>` primitive. Datastreams enable workloads such as
:ref:`GPU batch inference <ref-use-cases-batch-infer>` to run efficiently on large datasets,
maximizing resource utilization by keeping the working data fitting into Ray object store memory.

Expand All @@ -44,16 +44,16 @@ As part of the Ray ecosystem, Ray Data can leverage the full functionality of Ra
e.g., using actors for optimizing setup time and GPU scheduling, and supports data throughputs of
100GiB/s or more for common inference workloads.

To learn more about the features Ray Data supports, read the
To learn more about the features Ray Data supports, read the
:ref:`Data User Guide <data_user_guide>`.

---------------------------------------
Streaming Preprocessing for ML Training
---------------------------------------

Use Ray Data to load and preprocess data for distributed :ref:`ML training pipelines <train-docs>` in a streaming fashion.
Ray Data is intended to serve as a last-mile bridge from storage or ETL pipeline outputs to distributed
applications and libraries in Ray. Don't use it as a replacement for more general data
Ray Data serves as a last-mile bridge from storage or ETL pipeline outputs to distributed
applications and libraries in Ray. Don't use it as a replacement for more general data
processing systems.

.. image:: images/datastream-loading-1.png
Expand Down
36 changes: 3 additions & 33 deletions doc/source/data/doc_code/loading_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,36 +20,6 @@
# __gen_synth_tabular_range_end__
# fmt: on

# fmt: off
# __gen_synth_tensor_range_begin__
# Create a Datastream of tensors.
ds = ray.data.range_tensor(100 * 64 * 64, shape=(64, 64))
# -> Datastream(
# num_blocks=200,
# num_rows=409600,
# schema={data: numpy.ndarray(shape=(64, 64), dtype=int64)}
# )

ds.take_batch(5)
# -> {'data': array(
# [[[0, 0, 0, ..., 0, 0, 0],
# [0, 0, 0, ..., 0, 0, 0],
# [0, 0, 0, ..., 0, 0, 0],
# ...,
# [0, 0, 0, ..., 0, 0, 0],
# [0, 0, 0, ..., 0, 0, 0],
# [0, 0, 0, ..., 0, 0, 0]],
# ...
# [[4, 4, 4, ..., 4, 4, 4],
# [4, 4, 4, ..., 4, 4, 4],
# [4, 4, 4, ..., 4, 4, 4],
# ...,
# [4, 4, 4, ..., 4, 4, 4],
# [4, 4, 4, ..., 4, 4, 4],
# [4, 4, 4, ..., 4, 4, 4]]])}
# __gen_synth_tensor_range_end__
# fmt: on

# fmt: off
# __from_items_begin__
# Create a Datastream from python dicts.
Expand Down Expand Up @@ -131,7 +101,7 @@
# fmt: off
# __read_images_begin__
ds = ray.data.read_images("example:https://image-datasets/simple")
# -> Datastream(num_blocks=3, num_rows=3,
# -> Datastream(num_blocks=3, num_rows=3,
# schema={image: numpy.ndarray(shape=(32, 32, 3), dtype=uint8)})

ds.take(1)
Expand Down Expand Up @@ -436,7 +406,7 @@
# 'passenger_count': 1,
# 'trip_distance': 1.5,
# 'rate_code_id': '1',
# 'store_and_fwd_flag': 'N',
# 'store_and_fwd_flag': 'N',
# ...,
# }
# {
Expand All @@ -446,7 +416,7 @@
# 'passenger_count': 1,
# 'trip_distance': 2.5999999046325684,
# 'rate_code_id': '1',
# 'store_and_fwd_flag': 'N',
# 'store_and_fwd_flag': 'N',
# ...,
# }
# __read_parquet_s3_end__
Expand Down
5 changes: 3 additions & 2 deletions doc/source/data/key-concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ Each block holds a set of records in an `Arrow table <https://arrow.apache.org/d
`pandas DataFrame <https://pandas.pydata.org/docs/reference/frame.html>`_.
Having multiple blocks in a datastream allows for parallel transformation and ingest.

For ML use cases, Datastream also natively supports mixing :ref:`Tensors <data_tensor_support>` and tabular data.
For ML use cases, Datastream natively supports mixing tensors with tabular data. To
learn more, read :ref:`Working with tensor data <working_with_tensors>`.

The following figure visualizes a datastream with three blocks, each holding 1000 rows. Note that certain blocks
may not be computed yet. Normally, callers iterate over datastream blocks in a streaming fashion, so that not all
Expand Down Expand Up @@ -100,7 +101,7 @@ Fault tolerance

Datastream performs *lineage reconstruction* to recover data. If an application error or
system failure occurs, Datastream recreates lost blocks by re-executing tasks. If ``compute=ActorPoolStrategy(size=n)`` is used, then Ray
will restart the actor used for computing the block prior to re-executing the task.
restarts the actor used for computing the block prior to re-executing the task.

Fault tolerance is not supported if the original worker process that created the Datastream dies.
This is because the creator stores the metadata for the :ref:`objects <object-fault-tolerance>` that comprise the Datastream.
37 changes: 23 additions & 14 deletions doc/source/data/loading-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,22 @@ Generating Synthetic Data
.. tab-item:: Tensor Range

Create a datastream from a range of integers, packing this integer range into
tensors of the provided shape.

.. literalinclude:: ./doc_code/loading_data.py
:language: python
:start-after: __gen_synth_tensor_range_begin__
:end-before: __gen_synth_tensor_range_end__
ndarrays of the provided shape.

.. doctest::

>>> import ray
>>> ds = ray.data.range_tensor(100 * 64 * 64, shape=(64, 64))
>>> ds.schema()
Schema({'data': numpy.ndarray(shape=(64, 64), dtype=int64)})
>>> ds.show(1)
{'data': array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])}

.. _datastream_reading_from_storage:

Expand Down Expand Up @@ -105,10 +115,10 @@ Common File Formats

.. tab-item:: NumPy

Read NumPy files and directories. The NumPy data will be represented via the Ray Data
:class:`tensor extension type <ray.data.extensions.tensor_extension.ArrowTensorType>`.
Refer to the :ref:`tensor data guide <data_tensor_support>` for more information on working
with tensors.
Read NumPy files and directories.

This function represents NumPy data as ndarrays. To learn more, read
:ref:`Working with tensor data <working_with_tensors>`.

.. literalinclude:: ./doc_code/loading_data.py
:language: python
Expand All @@ -132,9 +142,8 @@ Common File Formats

Call :func:`~ray.data.read_images` to read images.

This function represents image data using the Ray Data
:class:`tensor extension type <ray.data.extensions.tensor_extension.ArrowTensorType>`.
For more information on working with tensors, refer to the :ref:`tensor data guide <data_tensor_support>`.
This function represents images as ndarrays. To learn more, read
:ref:`Working with tensor data <working_with_tensors>`.

.. literalinclude:: ./doc_code/loading_data.py
:language: python
Expand Down Expand Up @@ -588,7 +597,7 @@ the collection. The execution results are then used to create a Datastream.
Reading From SQL Databases
--------------------------

Call :func:`~ray.data.read_sql` to read data from a database that provides a
Call :func:`~ray.data.read_sql` to read data from a database that provides a
`Python DB API2-compliant <https://peps.python.org/pep-0249/>`_ connector.

.. tab-set::
Expand Down
4 changes: 2 additions & 2 deletions doc/source/data/performance-tips.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ These stats can be used to understand the performance of your Datastream workloa
.. code-block::
Stage 1 ReadRange->Map->Map: 16/16 blocks executed in 0.37s
Stage 1 ReadRange->Map->Map: 16/16 blocks executed in 0.37s
* Remote wall time: 101.55ms min, 331.39ms max, 135.24ms mean, 2.16s total
* Remote cpu time: 7.42ms min, 15.88ms max, 11.01ms mean, 176.15ms total
* Peak heap memory usage (MiB): 157.18 min, 157.73 max, 157 mean
Expand Down Expand Up @@ -89,7 +89,7 @@ may incur data copies; which conversions cause data copying is given in the belo

.. note::
\* No copies occur when converting between Arrow, Pandas, and NumPy formats for columns
represented in the Ray Data tensor extension type (except for bool arrays).
represented as ndarrays (except for bool arrays).


Parquet Column Pruning
Expand Down
2 changes: 1 addition & 1 deletion doc/source/data/user-guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ show you how achieve several tasks.
transforming-data
consuming-data
batch_inference
data-tensor-support
working-with-tensors
custom-datasource
data-internals
performance-tips
Expand Down
Loading

0 comments on commit 1f576eb

Please sign in to comment.