Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] [Docs] Standardize API Refs for Input/Output #37017

Merged
merged 21 commits into from
Jul 7, 2023
Merged
Prev Previous commit
Next Next commit
Merge branch 'master' of github.com:ray-project/ray into standardize-…
…api-ref

Signed-off-by: amogkam <[email protected]>
  • Loading branch information
amogkam committed Jun 30, 2023
commit e2ddefcd70231e4be8966a7db6d5b5e8ba9770bd
25 changes: 11 additions & 14 deletions python/ray/data/read_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -603,10 +603,9 @@ def read_parquet(
which automatically determines the optimal parallelism for your configuration. You should not need to manually set this value in most cases.
For details on how the parallelism is automatically determined and guidance on how to tune it, see the :ref:`Tuning read parallelism guide <read_parallelism>`. Parallelism is upper bounded by the total number of records in all the parquet files.
ray_remote_args: kwargs passed to :meth:`~ray.remote` in the read tasks.
tensor_column_schema: A dict of column name to tensor dtype and shape
tensor_column_schema: A dict of column name to pyarrow dtype and shape
amogkam marked this conversation as resolved.
Show resolved Hide resolved
mappings for converting a Parquet column containing serialized
tensors (ndarrays) as their elements to our tensor column extension
type. This assumes that the tensors were serialized in the raw
tensors (ndarrays) as their elements to Pyarrow tensors. This assumes that the tensors were serialized in the raw
NumPy array format in C-contiguous order (e.g. via
amogkam marked this conversation as resolved.
Show resolved Hide resolved
`arr.tobytes()`).
meta_provider: A :ref:`file metadata provider <metadata_provider>`. Custom
Expand Down Expand Up @@ -790,24 +789,22 @@ def read_parquet_bulk(
the dataset. Defaults to -1 which automatically determines the optimal parallelism for your configuration. You should not need to manually set this value in most cases. For details on how the parallelism is automatically determined and guidance on how to tune it, see the :ref:`Tuning read parallelism guide <read_parallelism>`. Parallelism is upper bounded by the total number of records in all the parquet files.
ray_remote_args: kwargs passed to :meth:`~ray.remote` in the read tasks.
arrow_open_file_args: kwargs passed to `pyarrow.fs.FileSystem.open_input_file <https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystem.html#pyarrow.fs.FileSystem.open_input_file>`_. when opening input files to read.
tensor_column_schema: A dict of column name --> tensor dtype and shape
tensor_column_schema: A dict of column name to pyarrow dtype and shape
amogkam marked this conversation as resolved.
Show resolved Hide resolved
mappings for converting a Parquet column containing serialized
tensors (ndarrays) as their elements to our tensor column extension
type. This assumes that the tensors were serialized in the raw
tensors (ndarrays) as their elements to Pyarrow tensors. This assumes that the tensors were serialized in the raw
NumPy array format in C-contiguous order (e.g. via
``arr.tobytes()``).
meta_provider: File metadata provider. Defaults to a fast file metadata
provider that skips file size collection and requires all input paths to be
files. Change to ``DefaultFileMetadataProvider`` or a custom metadata
provider if directory expansion and/or file metadata resolution is required.
partition_filter: Path-based partition filter, if any. Can be used
`arr.tobytes()`).
meta_provider: A :ref:`file metadata provider <metadata_provider>`. Custom
metadata providers may be able to resolve file metadata more quickly and/or accurately. In most cases you do not need to set this.
partition_filter: A :class:`~ray.data.datasource.partitioning.PathPartitionFilter`. Can be used
with a custom callback to read only selected partitions of a dataset.
By default, this filters out any file paths whose file extension does not
match "*.parquet*".
arrow_parquet_args: Other parquet read options to pass to pyarrow.
arrow_parquet_args: Other parquet read options to pass to pyarrow. For the full
amogkam marked this conversation as resolved.
Show resolved Hide resolved
set of arguments, see `the pyarrow API https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment`_

Returns:
Dataset producing Arrow records read from the specified paths.
:class:`~ray.data.Dataset` producing records read from the specified paths.
"""
arrow_parquet_args = _resolve_parquet_args(
tensor_column_schema,
Expand Down
You are viewing a condensed version of this merge commit. You can view the full changes here.