You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a partitioned dataset with 6000+ objects stored in a folder in Google Cloud Storage.
Let's say the partitioning scheme is experiment_name/YYYYMMDD/country/level where level identifies one of three possible names for parquet files.
The issue is that the first call to catalog.load("dataset_name") is slow (in the order of ~6 minutes) and considerably slowing down our pipeline execution.
Subsequent calls are faster but yet in the order of 3 minutes each.
We profiled the code using SnakeViz in IPython and the resulting call stack and times are attached in the image below.
As you can see, the majority of time is spent in the _ls method of gcsfs (used by fsspec for Google Cloud Storage), in particular reading from SSL Sockets.
I understand this may not be a problem specific to Kedro but of the underlying file system implementation, however I raise this issue since it is sensibly impacting our usage of the framework and other users may face the same situation.
Context
The call is issued automatically by Kedro while loading the dataset during pipeline execution.
The partitioned dataset concrete implementation is a pandas.ParquetDataSet and the Cloud Storage bucket is accessed using a Service Account credentials stored in the conf/local folder.
Steps to Reproduce
This issue is difficult to reproduce without the same data we are using. Nonetheless, here are general guidelines to reproduce the issue:
Create a GCS bucket with subfolders with the partitioning scheme level_1/level_2/level_3/level_4. Cardinality for level_1 is roughly 30, for level_2 around 12, for level_3 around 30 and for level_4 is 3. Average size of last level objects is 5MB.
Add a PartitionedDataSet to the data catalog with concrete implementation of pandas.ParquetDataSet (mind to include credentials for access to bucket)
Create a dummy pipeline with a node reading said dataset
Modify run.py to use a SequentialRunner in async mode is_async=True
Run the created pipeline
Expected Result
The load call should (?) be fast enough.
For comparison, a load call for a PartitionedDataSet with the same declaration but a partitioning scheme of level_1/level_2/level_3 with same cardinalities takes around 10 seconds.
Actual Result
The load call takes approx. 6 minutes.
Environment
The environment is:
Kedro version 0.16.2
Python version: 3.7.7 (default, May 6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Operating system and version: Windows 10 Pro, version 1909, build 18363.1082
Execution profile
Attached here you can find the profile file created with IPython's %prun. Unzip it and read it with the same Python version (3.7.7), otherwise the interpreter will complain of invalid profile. profile.zip
The text was updated successfully, but these errors were encountered:
Thank you for letting us know! As you know, all Kedro datasets are based on fsspec, so it would be great if you could also report the performance issue in their repository too :)
921kiyo
changed the title
PartitionedDataset very slow in listing 5000+ partitions from Google Cloud Storage
[KED-2142] PartitionedDataset very slow in listing 5000+ partitions from Google Cloud Storage
Oct 2, 2020
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Description
We have a partitioned dataset with 6000+ objects stored in a folder in Google Cloud Storage.
Let's say the partitioning scheme is
experiment_name/YYYYMMDD/country/level
wherelevel
identifies one of three possible names for parquet files.The issue is that the first call to
catalog.load("dataset_name")
is slow (in the order of ~6 minutes) and considerably slowing down our pipeline execution.Subsequent calls are faster but yet in the order of 3 minutes each.
We profiled the code using SnakeViz in IPython and the resulting call stack and times are attached in the image below.
As you can see, the majority of time is spent in the
_ls
method ofgcsfs
(used byfsspec
for Google Cloud Storage), in particular reading from SSL Sockets.I understand this may not be a problem specific to Kedro but of the underlying file system implementation, however I raise this issue since it is sensibly impacting our usage of the framework and other users may face the same situation.
Context
The call is issued automatically by Kedro while loading the dataset during pipeline execution.
The partitioned dataset concrete implementation is a
pandas.ParquetDataSet
and the Cloud Storage bucket is accessed using a Service Account credentials stored in theconf/local
folder.Steps to Reproduce
This issue is difficult to reproduce without the same data we are using. Nonetheless, here are general guidelines to reproduce the issue:
level_1/level_2/level_3/level_4
. Cardinality forlevel_1
is roughly 30, forlevel_2
around 12, forlevel_3
around 30 and forlevel_4
is 3. Average size of last level objects is 5MB.PartitionedDataSet
to the data catalog with concrete implementation ofpandas.ParquetDataSet
(mind to include credentials for access to bucket)run.py
to use aSequentialRunner
in async modeis_async=True
Expected Result
The load call should (?) be fast enough.
For comparison, a load call for a
PartitionedDataSet
with the same declaration but a partitioning scheme oflevel_1/level_2/level_3
with same cardinalities takes around 10 seconds.Actual Result
The load call takes approx. 6 minutes.
Environment
The environment is:
Execution profile
Attached here you can find the profile file created with IPython's
%prun
. Unzip it and read it with the same Python version (3.7.7), otherwise the interpreter will complain of invalid profile.profile.zip
The text was updated successfully, but these errors were encountered: