Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] [Docs] Standardize API Refs for Input/Output #37017

Merged
merged 21 commits into from
Jul 7, 2023
Merged
Prev Previous commit
Next Next commit
wip
Signed-off-by: amogkam <[email protected]>
  • Loading branch information
amogkam committed Jun 30, 2023
commit e401bef1d4fa5e6255c2f088131bdb3ec90415a7
53 changes: 19 additions & 34 deletions python/ray/data/read_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -601,7 +601,7 @@ def read_parquet(
read during the file scan.
parallelism: The amount of parallelism to use for the dataset. Defaults to -1
amogkam marked this conversation as resolved.
Show resolved Hide resolved
which automatically determines the optimal parallelism for your configuration. You should not need to manually set this value in most cases.
For details on how the parallelism is automatically determined and guidance on how to tune it, see the :ref:`Tuning read parallelism guide <read_parallelism>`. Parallelism is upper bounded by the total number of records in all the parquet files..
For details on how the parallelism is automatically determined and guidance on how to tune it, see the :ref:`Tuning read parallelism guide <read_parallelism>`. Parallelism is upper bounded by the total number of records in all the parquet files.
ray_remote_args: kwargs passed to :meth:`~ray.remote` in the read tasks.
tensor_column_schema: A dict of column name to tensor dtype and shape
mappings for converting a Parquet column containing serialized
Expand Down Expand Up @@ -760,51 +760,36 @@ def read_parquet_bulk(
),
**arrow_parquet_args,
) -> Dataset:
"""Create :class:`~ray.data.Dataset` from a large number (such as >1K) of parquet files.
"""Create :class:`~ray.data.Dataset` from parquet files without reading metadata.

Unlike :meth:`~ray.data.read_parquet`, this method

By default, ONLY file paths should be provided as input (i.e. no directory paths),
and an OSError will be raised if one or more paths point to directories. If your
use-case requires directory paths, then the metadata provider should be changed to
one that supports directory expansion (e.g. ``DefaultFileMetadataProvider``).
You should use :meth:`~ray.data.read_parquet` in most cases.
amogkam marked this conversation as resolved.
Show resolved Hide resolved

Use :meth:`~ray.data.read_parquet_bulk` if all the provided paths point to files and metadata fetching using :meth:`~ray.data.read_parquet` takes too long or the parquet files do not all have a unified schema.

Offers improved performance vs. :func:`read_parquet` due to not using PyArrow's
``ParquetDataset`` abstraction, whose latency scales linearly with the number of
input files due to collecting all file metadata on a single node.
There may be performance slowdowns when using this method with parquet files that are very large.

Also supports a wider variety of input Parquet file types than :func:`read_parquet`
due to not trying to merge and resolve a unified schema for all files.
.. warning::

However, unlike :func:`read_parquet`, this does not offer file metadata resolution
by default, so a custom metadata provider should be provided if your use-case
requires a unified schema, block sizes, row counts, etc.
Only file paths should be provided as input (i.e. no directory paths). An OSError will be raised if one or more paths point to directories. If your
use-case requires directory paths, then use :meth:`~ray.data.read_parquet` instead.

Examples:
>>> # Read multiple local files. You should always provide only input file
>>> # paths (i.e. no directory paths) when known to minimize read latency.
>>> ray.data.read_parquet_bulk( # doctest: +SKIP
... ["/path/to/file1", "/path/to/file2"])

>>> # Read a directory of files in remote storage. Caution should be taken
>>> # when providing directory paths, since the time to both check each path
>>> # type and expand its contents may result in greatly increased latency
>>> # and/or request rate throttling from cloud storage service providers.
>>> ray.data.read_parquet_bulk( # doctest: +SKIP
... "s3:https://bucket/path",
... meta_provider=DefaultFileMetadataProvider())

Args:
paths: A single file path or a list of file paths. If one or more directories
are provided, then ``meta_provider`` should also be set to an implementation
that supports directory expansion (e.g. ``DefaultFileMetadataProvider``).
filesystem: The filesystem implementation to read from.
columns: A list of column names to read.
parallelism: The requested parallelism of the read. Parallelism may be
limited by the number of files of the dataset.
ray_remote_args: kwargs passed to ray.remote in the read tasks.
arrow_open_file_args: kwargs passed to
``pyarrow.fs.FileSystem.open_input_file``.
paths: A single file path or a list of file paths.
filesystem: The pyarrow filesystem
amogkam marked this conversation as resolved.
Show resolved Hide resolved
implementation to read from. These are
amogkam marked this conversation as resolved.
Show resolved Hide resolved
specified in the `pyarrow docs <https://arrow.apache.org/docs/python/api/filesystems.html#filesystem-implementations>`_. You should specify this if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with ``s3:https://``, the `S3FileSystem` is used.
columns: A list of column names to read. Only the
specified columns will be read during the file scan.
amogkam marked this conversation as resolved.
Show resolved Hide resolved
parallelism: The amount of parallelism to use for
the dataset. Defaults to -1 which automatically determines the optimal parallelism for your configuration. You should not need to manually set this value in most cases. For details on how the parallelism is automatically determined and guidance on how to tune it, see the :ref:`Tuning read parallelism guide <read_parallelism>`. Parallelism is upper bounded by the total number of records in all the parquet files.
ray_remote_args: kwargs passed to :meth:`~ray.remote` in the read tasks.
arrow_open_file_args: kwargs passed to `pyarrow.fs.FileSystem.open_input_file <https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystem.html#pyarrow.fs.FileSystem.open_input_file>`_. when opening input files to read.
tensor_column_schema: A dict of column name --> tensor dtype and shape
mappings for converting a Parquet column containing serialized
tensors (ndarrays) as their elements to our tensor column extension
Expand Down