[Data][Docs] Revise "ML Tensor Support" (ray-project#34999)

Users are confused by the inconsistent use of "tensor" and "array" in the Ray Data documentation. This PR clarifies that tensor data is represented as ndarrays. In addition, this PR revises the "ML Tensor Support" user guide to remove redundant or outdated information. Co-authored-by: angelinalg <[email protected]> Co-authored-by: Eric Liang <[email protected]>
rickyyx · May 5, 2023 · 1f576eb · 1f576eb
1 parent d01113a
commit 1f576eb
Show file tree

Hide file tree

Showing 9 changed files with 165 additions and 249 deletions.
diff --git a/doc/source/data/data-tensor-support.rst b/doc/source/data/data-tensor-support.rst
diff --git a/doc/source/data/data.rst b/doc/source/data/data.rst
@@ -29,7 +29,7 @@ Streaming Batch Inference
 -------------------------
 
 Ray Data simplifies general purpose parallel GPU and CPU compute in Ray through its
-powerful :ref:`Datastream <datastream_concept>` primitive. Datastreams enable workloads such as 
+powerful :ref:`Datastream <datastream_concept>` primitive. Datastreams enable workloads such as
 :ref:`GPU batch inference <ref-use-cases-batch-infer>` to run efficiently on large datasets,
 maximizing resource utilization by keeping the working data fitting into Ray object store memory.
 
@@ -44,16 +44,16 @@ As part of the Ray ecosystem, Ray Data can leverage the full functionality of Ra
 e.g., using actors for optimizing setup time and GPU scheduling, and supports data throughputs of
 100GiB/s or more for common inference workloads.
 
-To learn more about the features Ray Data supports, read the 
+To learn more about the features Ray Data supports, read the
 :ref:`Data User Guide <data_user_guide>`.
 
 ---------------------------------------
 Streaming Preprocessing for ML Training
 ---------------------------------------
 
 Use Ray Data to load and preprocess data for distributed :ref:`ML training pipelines <train-docs>` in a streaming fashion.
-Ray Data is intended to serve as a last-mile bridge from storage or ETL pipeline outputs to distributed 
-applications and libraries in Ray. Don't use it as a replacement for more general data 
+Ray Data serves as a last-mile bridge from storage or ETL pipeline outputs to distributed
+applications and libraries in Ray. Don't use it as a replacement for more general data
 processing systems.
 
 .. image:: images/datastream-loading-1.png

diff --git a/doc/source/data/doc_code/loading_data.py b/doc/source/data/doc_code/loading_data.py
@@ -20,36 +20,6 @@
 # __gen_synth_tabular_range_end__
 # fmt: on
 
-# fmt: off
-# __gen_synth_tensor_range_begin__
-# Create a Datastream of tensors.
-ds = ray.data.range_tensor(100 * 64 * 64, shape=(64, 64))
-# -> Datastream(
-# num_blocks=200,
-# num_rows=409600,
-# schema={data: numpy.ndarray(shape=(64, 64), dtype=int64)}
-# )
-
-ds.take_batch(5)
-# -> {'data': array(
-# [[[0, 0, 0, ..., 0, 0, 0],
-# [0, 0, 0, ..., 0, 0, 0],
-# [0, 0, 0, ..., 0, 0, 0],
-# ...,
-# [0, 0, 0, ..., 0, 0, 0],
-# [0, 0, 0, ..., 0, 0, 0],
-# [0, 0, 0, ..., 0, 0, 0]],
-# ...
-# [[4, 4, 4, ..., 4, 4, 4],
-# [4, 4, 4, ..., 4, 4, 4],
-# [4, 4, 4, ..., 4, 4, 4],
-# ...,
-# [4, 4, 4, ..., 4, 4, 4],
-# [4, 4, 4, ..., 4, 4, 4],
-# [4, 4, 4, ..., 4, 4, 4]]])}
-# __gen_synth_tensor_range_end__
-# fmt: on
-
 # fmt: off
 # __from_items_begin__
 # Create a Datastream from python dicts.
@@ -131,7 +101,7 @@
 # fmt: off
 # __read_images_begin__
 ds = ray.data.read_images("example:https://image-datasets/simple")
-# -> Datastream(num_blocks=3, num_rows=3, 
+# -> Datastream(num_blocks=3, num_rows=3,
 # schema={image: numpy.ndarray(shape=(32, 32, 3), dtype=uint8)})
 
 ds.take(1)
@@ -436,7 +406,7 @@
 # 'passenger_count': 1,
 # 'trip_distance': 1.5,
 # 'rate_code_id': '1',
-# 'store_and_fwd_flag': 'N', 
+# 'store_and_fwd_flag': 'N',
 # ...,
 # }
 # {
@@ -446,7 +416,7 @@
 # 'passenger_count': 1,
 # 'trip_distance': 2.5999999046325684,
 # 'rate_code_id': '1',
-# 'store_and_fwd_flag': 'N', 
+# 'store_and_fwd_flag': 'N',
 # ...,
 # }
 # __read_parquet_s3_end__

diff --git a/doc/source/data/key-concepts.rst b/doc/source/data/key-concepts.rst
@@ -15,7 +15,8 @@ Each block holds a set of records in an `Arrow table <https://arrow.apache.org/d
 `pandas DataFrame <https://pandas.pydata.org/docs/reference/frame.html>`_.
 Having multiple blocks in a datastream allows for parallel transformation and ingest.
 
-For ML use cases, Datastream also natively supports mixing :ref:`Tensors <data_tensor_support>` and tabular data.
+For ML use cases, Datastream natively supports mixing tensors with tabular data. To
+learn more, read :ref:`Working with tensor data <working_with_tensors>`.
 
 The following figure visualizes a datastream with three blocks, each holding 1000 rows. Note that certain blocks
 may not be computed yet. Normally, callers iterate over datastream blocks in a streaming fashion, so that not all
@@ -100,7 +101,7 @@ Fault tolerance
 
 Datastream performs *lineage reconstruction* to recover data. If an application error or
 system failure occurs, Datastream recreates lost blocks by re-executing tasks. If ``compute=ActorPoolStrategy(size=n)`` is used, then Ray
-will restart the actor used for computing the block prior to re-executing the task. 
+restarts the actor used for computing the block prior to re-executing the task.
 
 Fault tolerance is not supported if the original worker process that created the Datastream dies.
 This is because the creator stores the metadata for the :ref:`objects <object-fault-tolerance>` that comprise the Datastream.
diff --git a/doc/source/data/loading-data.rst b/doc/source/data/loading-data.rst
@@ -30,12 +30,22 @@ Generating Synthetic Data
  .. tab-item:: Tensor Range
 
  Create a datastream from a range of integers, packing this integer range into
- tensors of the provided shape.
-
- .. literalinclude:: ./doc_code/loading_data.py
- :language: python
- :start-after: __gen_synth_tensor_range_begin__
- :end-before: __gen_synth_tensor_range_end__
+ ndarrays of the provided shape.
+
+ .. doctest::
+
+ >>> import ray
+ >>> ds = ray.data.range_tensor(100 * 64 * 64, shape=(64, 64))
+ >>> ds.schema()
+ Schema({'data': numpy.ndarray(shape=(64, 64), dtype=int64)})
+ >>> ds.show(1)
+ {'data': array([[0, 0, 0, ..., 0, 0, 0],
+ [0, 0, 0, ..., 0, 0, 0],
+ [0, 0, 0, ..., 0, 0, 0],
+ ...,
+ [0, 0, 0, ..., 0, 0, 0],
+ [0, 0, 0, ..., 0, 0, 0],
+ [0, 0, 0, ..., 0, 0, 0]])}
 
 .. _datastream_reading_from_storage:
 
@@ -105,10 +115,10 @@ Common File Formats
 
  .. tab-item:: NumPy
 
- Read NumPy files and directories. The NumPy data will be represented via the Ray Data
- :class:`tensor extension type <ray.data.extensions.tensor_extension.ArrowTensorType>`.
- Refer to the :ref:`tensor data guide <data_tensor_support>` for more information on working
- with tensors.
+ Read NumPy files and directories.
+
+ This function represents NumPy data as ndarrays. To learn more, read
+ :ref:`Working with tensor data <working_with_tensors>`.
 
  .. literalinclude:: ./doc_code/loading_data.py
  :language: python
@@ -132,9 +142,8 @@ Common File Formats
 
  Call :func:`~ray.data.read_images` to read images.
 
- This function represents image data using the Ray Data
- :class:`tensor extension type <ray.data.extensions.tensor_extension.ArrowTensorType>`.
- For more information on working with tensors, refer to the :ref:`tensor data guide <data_tensor_support>`.
+ This function represents images as ndarrays. To learn more, read
+ :ref:`Working with tensor data <working_with_tensors>`.
 
  .. literalinclude:: ./doc_code/loading_data.py
  :language: python
@@ -588,7 +597,7 @@ the collection. The execution results are then used to create a Datastream.
 Reading From SQL Databases
 --------------------------
 
-Call :func:`~ray.data.read_sql` to read data from a database that provides a 
+Call :func:`~ray.data.read_sql` to read data from a database that provides a
 `Python DB API2-compliant <https://peps.python.org/pep-0249/>`_ connector.
 
 .. tab-set::

diff --git a/doc/source/data/performance-tips.rst b/doc/source/data/performance-tips.rst
@@ -29,7 +29,7 @@ These stats can be used to understand the performance of your Datastream workloa
 
 .. code-block::
 
- Stage 1 ReadRange->Map->Map: 16/16 blocks executed in 0.37s 
+ Stage 1 ReadRange->Map->Map: 16/16 blocks executed in 0.37s
  * Remote wall time: 101.55ms min, 331.39ms max, 135.24ms mean, 2.16s total
  * Remote cpu time: 7.42ms min, 15.88ms max, 11.01ms mean, 176.15ms total
  * Peak heap memory usage (MiB): 157.18 min, 157.73 max, 157 mean
@@ -89,7 +89,7 @@ may incur data copies; which conversions cause data copying is given in the belo
 
 .. note::
  \* No copies occur when converting between Arrow, Pandas, and NumPy formats for columns
- represented in the Ray Data tensor extension type (except for bool arrays).
+ represented as ndarrays (except for bool arrays).
 
 
 Parquet Column Pruning

diff --git a/doc/source/data/user-guide.rst b/doc/source/data/user-guide.rst
@@ -16,7 +16,7 @@ show you how achieve several tasks.
  transforming-data
  consuming-data
  batch_inference
- data-tensor-support
+ working-with-tensors
  custom-datasource
  data-internals
  performance-tips