[KED-2142] PartitionedDataset very slow in listing 5000+ partitions from Google Cloud Storage #514

PieCampi · 2020-09-16T12:34:58Z

Description

We have a partitioned dataset with 6000+ objects stored in a folder in Google Cloud Storage.
Let's say the partitioning scheme is experiment_name/YYYYMMDD/country/level where level identifies one of three possible names for parquet files.

The issue is that the first call to catalog.load("dataset_name") is slow (in the order of ~6 minutes) and considerably slowing down our pipeline execution.

Subsequent calls are faster but yet in the order of 3 minutes each.

We profiled the code using SnakeViz in IPython and the resulting call stack and times are attached in the image below.
As you can see, the majority of time is spent in the _ls method of gcsfs (used by fsspec for Google Cloud Storage), in particular reading from SSL Sockets.

I understand this may not be a problem specific to Kedro but of the underlying file system implementation, however I raise this issue since it is sensibly impacting our usage of the framework and other users may face the same situation.

Context

The call is issued automatically by Kedro while loading the dataset during pipeline execution.

The partitioned dataset concrete implementation is a pandas.ParquetDataSet and the Cloud Storage bucket is accessed using a Service Account credentials stored in the conf/local folder.

Steps to Reproduce

This issue is difficult to reproduce without the same data we are using. Nonetheless, here are general guidelines to reproduce the issue:

Create a GCS bucket with subfolders with the partitioning scheme level_1/level_2/level_3/level_4. Cardinality for level_1 is roughly 30, for level_2 around 12, for level_3 around 30 and for level_4 is 3. Average size of last level objects is 5MB.
Add a PartitionedDataSet to the data catalog with concrete implementation of pandas.ParquetDataSet (mind to include credentials for access to bucket)
Create a dummy pipeline with a node reading said dataset
Modify run.py to use a SequentialRunner in async mode is_async=True
Run the created pipeline

Expected Result

The load call should (?) be fast enough.
For comparison, a load call for a PartitionedDataSet with the same declaration but a partitioning scheme of level_1/level_2/level_3 with same cardinalities takes around 10 seconds.

Actual Result

The load call takes approx. 6 minutes.

Environment

The environment is:

Kedro version 0.16.2
Python version: 3.7.7 (default, May 6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Operating system and version: Windows 10 Pro, version 1909, build 18363.1082

Execution profile

Attached here you can find the profile file created with IPython's %prun. Unzip it and read it with the same Python version (3.7.7), otherwise the interpreter will complain of invalid profile.
profile.zip

The text was updated successfully, but these errors were encountered:

921kiyo · 2020-10-02T11:19:59Z

Thank you for letting us know! As you know, all Kedro datasets are based on fsspec, so it would be great if you could also report the performance issue in their repository too :)

stale · 2021-04-12T15:10:41Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

PieCampi added the Issue: Bug Report 🐞 Bug that needs to be fixed label Sep 16, 2020

921kiyo changed the title ~~PartitionedDataset very slow in listing 5000+ partitions from Google Cloud Storage~~ [KED-2142] PartitionedDataset very slow in listing 5000+ partitions from Google Cloud Storage Oct 2, 2020

idanov self-assigned this Dec 14, 2020

stale bot added the stale label Apr 12, 2021

stale bot closed this as completed Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KED-2142] PartitionedDataset very slow in listing 5000+ partitions from Google Cloud Storage #514

[KED-2142] PartitionedDataset very slow in listing 5000+ partitions from Google Cloud Storage #514

PieCampi commented Sep 16, 2020

921kiyo commented Oct 2, 2020

stale bot commented Apr 12, 2021

[KED-2142] PartitionedDataset very slow in listing 5000+ partitions from Google Cloud Storage #514

[KED-2142] PartitionedDataset very slow in listing 5000+ partitions from Google Cloud Storage #514

Comments

PieCampi commented Sep 16, 2020

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Environment

Execution profile

921kiyo commented Oct 2, 2020

stale bot commented Apr 12, 2021