Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-2142] PartitionedDataset very slow in listing 5000+ partitions from Google Cloud Storage #514

Closed
PieCampi opened this issue Sep 16, 2020 · 2 comments
Assignees
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@PieCampi
Copy link

Description

We have a partitioned dataset with 6000+ objects stored in a folder in Google Cloud Storage.
Let's say the partitioning scheme is experiment_name/YYYYMMDD/country/level where level identifies one of three possible names for parquet files.

The issue is that the first call to catalog.load("dataset_name") is slow (in the order of ~6 minutes) and considerably slowing down our pipeline execution.

Subsequent calls are faster but yet in the order of 3 minutes each.

We profiled the code using SnakeViz in IPython and the resulting call stack and times are attached in the image below.
As you can see, the majority of time is spent in the _ls method of gcsfs (used by fsspec for Google Cloud Storage), in particular reading from SSL Sockets.

snakeviz

I understand this may not be a problem specific to Kedro but of the underlying file system implementation, however I raise this issue since it is sensibly impacting our usage of the framework and other users may face the same situation.

Context

The call is issued automatically by Kedro while loading the dataset during pipeline execution.

The partitioned dataset concrete implementation is a pandas.ParquetDataSet and the Cloud Storage bucket is accessed using a Service Account credentials stored in the conf/local folder.

Steps to Reproduce

This issue is difficult to reproduce without the same data we are using. Nonetheless, here are general guidelines to reproduce the issue:

  1. Create a GCS bucket with subfolders with the partitioning scheme level_1/level_2/level_3/level_4. Cardinality for level_1 is roughly 30, for level_2 around 12, for level_3 around 30 and for level_4 is 3. Average size of last level objects is 5MB.
  2. Add a PartitionedDataSet to the data catalog with concrete implementation of pandas.ParquetDataSet (mind to include credentials for access to bucket)
  3. Create a dummy pipeline with a node reading said dataset
  4. Modify run.py to use a SequentialRunner in async mode is_async=True
  5. Run the created pipeline

Expected Result

The load call should (?) be fast enough.
For comparison, a load call for a PartitionedDataSet with the same declaration but a partitioning scheme of level_1/level_2/level_3 with same cardinalities takes around 10 seconds.

Actual Result

The load call takes approx. 6 minutes.

Environment

The environment is:

  • Kedro version 0.16.2
  • Python version: 3.7.7 (default, May 6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
  • Operating system and version: Windows 10 Pro, version 1909, build 18363.1082

Execution profile

Attached here you can find the profile file created with IPython's %prun. Unzip it and read it with the same Python version (3.7.7), otherwise the interpreter will complain of invalid profile.
profile.zip

@PieCampi PieCampi added the Issue: Bug Report 🐞 Bug that needs to be fixed label Sep 16, 2020
@921kiyo
Copy link
Contributor

921kiyo commented Oct 2, 2020

Thank you for letting us know! As you know, all Kedro datasets are based on fsspec, so it would be great if you could also report the performance issue in their repository too :)

@921kiyo 921kiyo changed the title PartitionedDataset very slow in listing 5000+ partitions from Google Cloud Storage [KED-2142] PartitionedDataset very slow in listing 5000+ partitions from Google Cloud Storage Oct 2, 2020
@idanov idanov self-assigned this Dec 14, 2020
@stale
Copy link

stale bot commented Apr 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 12, 2021
@stale stale bot closed this as completed Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
None yet
Development

No branches or pull requests

3 participants