Skip to content

Commit

Permalink
[Datasets] Add basic e2e Datasets example on NYC taxi dataset (ray-pr…
Browse files Browse the repository at this point in the history
…oject#24874)

This PR adds a dedicated docs page for examples, and adds a basic e2e tabular data processing example on the NYC taxi dataset.

The goal of this example is to demonstrate basic data reading, inspection, transformations, and shuffling, along with ingestion into dummy model trainers and doing dummy batch inference, for tabular (Parquet) data.
  • Loading branch information
clarkzinzow committed May 19, 2022
1 parent 399334d commit 6c0a457
Show file tree
Hide file tree
Showing 7 changed files with 1,286 additions and 2 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -210,3 +210,6 @@ workflow_data/

# vscode java extention generated
.factorypath

# Jupyter Notebooks
**/.ipynb_checkpoints/
7 changes: 6 additions & 1 deletion doc/source/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,12 @@ parts:
- file: data/getting-started
- file: data/key-concepts
- file: data/user-guide
- file: data/examples/big_data_ingestion
- file: data/examples/index
sections:
- file: data/examples/nyc_taxi_basic_processing
title: Processing the NYC taxi dataset
- file: data/examples/big_data_ingestion
title: Large-scale ML Ingest
- file: data/package-ref
- file: data/integrations

Expand Down
19 changes: 18 additions & 1 deletion doc/source/data/examples/BUILD
Original file line number Diff line number Diff line change
@@ -1,5 +1,22 @@
load("//bazel:python.bzl", "py_test_run_all_notebooks")

filegroup(
name = "data_examples",
srcs = glob(["*.ipynb"]),
visibility = ["//doc:__subpackages__"]
)
)

# --------------------------------------------------------------------
# Test all doc/source/data/examples notebooks.
# --------------------------------------------------------------------

# big_data_ingestion.ipynb is not tested right now due to large resource requirements
# and a need of a general overhaul.

py_test_run_all_notebooks(
size = "medium",
include = ["*.ipynb"],
exclude = ["big_data_ingestion.ipynb"],
data = ["//doc/source/data/examples:data_examples"],
tags = ["exclusive", "team:ml"],
)
52 changes: 52 additions & 0 deletions doc/source/data/examples/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
.. _datasets-examples-ref:

========
Examples
========

.. tip:: Check out the Datasets :ref:`User Guide <data_user_guide>` to learn more about
Datasets' features in-depth.

.. _datasets-recipes:

Simple Data Processing Examples
-------------------------------

Ray Datasets is a data processing engine that supports multiple data
modalities and types. Here you will find a few end-to-end examples of some basic data
processing with Ray Datasets on tabular data, text (coming soon!), and imagery (coming
soon!).

.. panels::
:container: container pb-4
:column: col-md-4 px-2 py-2
:img-top-cls: pt-5 w-75 d-block mx-auto

---
:img-top: /images/taxi.png

+++
.. link-button:: nyc_taxi_basic_processing
:type: ref
:text: Processing NYC taxi data using Ray Datasets
:classes: btn-link btn-block stretched-link

Scaling Out Datasets Workloads
------------------------------

These examples demonstrate using Ray Datasets on large-scale data over a multi-node Ray
cluster.

.. panels::
:container: container pb-4
:column: col-md-4 px-2 py-2
:img-top-cls: pt-5 w-75 d-block mx-auto

---
:img-top: /images/dataset-repeat-2.svg

+++
.. link-button:: big_data_ingestion
:type: ref
:text: Large-scale ML Ingest
:classes: btn-link btn-block stretched-link
Loading

0 comments on commit 6c0a457

Please sign in to comment.