Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988) (#21661)" #21894

kfstorm · 2022-01-26T12:11:20Z

This reverts commit fa5c167.

Why are these changes needed?

This PR adds pandas block format support by implementing PandasRow, PandasBlockBuilder, PandasBlockAccessor.

Note that sort_and_partition, combine, merge_sorted_blocks, aggregate_combined_blocks in PandasBlockAccessor redirects to arrow block format implementation for now. They'll be implemented in a later PR.

Related issue number

#20719

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…lementation (partial) (ray-project#20988) (ray-project#21661)" This reverts commit fa5c167.

kfstorm · 2022-01-26T12:14:09Z

@ericl @jjyao I've fixed the slowdown of test_threaded_actor.

I need more input from you guys about pg_long_running_performance_test. Such as raw log files.

kfstorm · 2022-01-26T12:16:35Z

@jjyao Is it possible to run nightly tests with a PR build?

jjyao · 2022-01-26T16:27:33Z

@kfstorm Yes, PR build also generates the wheel that can be used in the nightly tests. Let me know if you don't have the instructions to do that.

ericl · 2022-01-26T21:18:12Z

Could we also add a test that "pandas" isn't imported when you import ray? This can go into e.g., test_basic.

ericl · 2022-01-26T21:18:22Z

//python/ray/data:tests/test_dataset TIMEOUT in 3 out of 3 in 900.0s

ericl · 2022-01-26T21:19:21Z


(raylet)     from ray.data.datasource import (
  | (raylet)   File "/tmp/ray/session_2022-01-26_15-07-57_959732_1541/runtime_resources/conda/9d41a9fc1deb57000894b5d369c0ad86d32550ed/lib/python3.6/site-packages/ray/data/datasource/__init__.py", line 1, in <module>
  | (raylet)     from ray.data.datasource.datasource import (Datasource, RangeDatasource,
  | (raylet)   File "/tmp/ray/session_2022-01-26_15-07-57_959732_1541/runtime_resources/conda/9d41a9fc1deb57000894b5d369c0ad86d32550ed/lib/python3.6/site-packages/ray/data/datasource/datasource.py", line 12, in <module>
  | (raylet)     from ray.data.impl.delegating_block_builder import DelegatingBlockBuilder
  | (raylet)   File "/tmp/ray/session_2022-01-26_15-07-57_959732_1541/runtime_resources/conda/9d41a9fc1deb57000894b5d369c0ad86d32550ed/lib/python3.6/site-packages/ray/data/impl/delegating_block_builder.py", line 7, in <module>
  | (raylet)     from ray.data.impl.pandas_block import PandasRow, PandasBlockBuilder
  | (raylet)   File "/tmp/ray/session_2022-01-26_15-07-57_959732_1541/runtime_resources/conda/9d41a9fc1deb57000894b5d369c0ad86d32550ed/lib/python3.6/site-packages/ray/data/impl/pandas_block.py", line 8, in <module>
  | (raylet)     pandas = lazy_import("pandas")
  | (raylet)   File "/tmp/ray/session_2022-01-26_15-07-57_959732_1541/runtime_resources/conda/9d41a9fc1deb57000894b5d369c0ad86d32550ed/lib/python3.6/site-packages/ray/_private/utils.py", line 1165, in lazy_import
  | (raylet)     loader = importlib.util.LazyLoader(spec.loader)
  | (raylet) AttributeError: 'NoneType' object has no attribute 'loader'
 ```
 
 Hmm, is it available in all Python versions? A simpler way is to just trigger the pandas import inside the class.

clarkzinzow · 2022-01-26T22:12:34Z

python/ray/_private/utils.py

+def lazy_import(name):
+ if name in sys.modules:
+ return
+ spec = importlib.util.find_spec(name)


If the package doesn't exist (i.e. Pandas isn't installed), this will return None, which will cause the below spec.loader to fail with an AttributeError. We should check to see if this is None here and raise the canonical ModuleNotFoundError if so.

cc @ericl pretty sure this is the source of the test error

kfstorm · 2022-01-27T04:08:42Z

I've fixed lazy import. Let's see the CI results again.

ericl · 2022-01-27T07:14:23Z

or isinstance(applied, pd.core.frame.DataFrame)):
  | AttributeError: module 'pandas.core' has no attribute 'frame'
  | (_map_block_nosplit pid=11108) 2022-01-27 04:57:01,185	INFO worker.py:430 -- Task failed with retryable exception: TaskID(c5db14a0419b947bffffffffffffffffffffffff01000000).
  | (_map_block_nosplit pid=11108) Traceback (most recent call last):
  | (_map_block_nosplit pid=11108)   File "python/ray/_raylet.pyx", line 640, in ray._raylet.execute_task
  | (_map_block_nosplit pid=11108)   File "python/ray/_raylet.pyx", line 644, in ray._raylet.execute_task
  | (_map_block_nosplit pid=11108)   File "/ray/python/ray/data/impl/compute.py", line 46, in _map_block_nosplit
  | (_map_block_nosplit pid=11108)     for new_block in fn(block):
  | (_map_block_nosplit pid=11108)   File "/ray/python/ray/data/dataset.py", line 236, in transform
  | (_map_block_nosplit pid=11108)     or isinstance(applied, pd.core.frame.DataFrame)):
  | (_map_block_nosplit pid=11108) AttributeError: module 'pandas.core' has no attribute 'frame'
 <br class="Apple-interchange-newline">```

kfstorm · 2022-01-27T17:27:35Z

Yeah, that's strange. I'm still debugging.

kfstorm · 2022-01-28T13:03:54Z

Status update: This is a multi-threading issue. It can be reproduced with below script:

import sys
import importlib
import time

def lazy_import(name):
    try:
        return sys.modules[name]
    except KeyError:
        spec = importlib.util.find_spec(name)
        if not spec:
            raise ModuleNotFoundError(f"No module named '{name}'", name=name)
        module = importlib.util.module_from_spec(spec)
        loader = importlib.util.LazyLoader(spec.loader)
        loader.exec_module(module)
        # It's possible that another thread has done `import <name>`.
        # We only update sys.modules if the module is not in it already.
        module2 = sys.modules.setdefault(name, module)
        print(id(module), id(module2), "t1")
        return module2

import threading

t1 = threading.Thread(target=lazy_import, args=("pandas",), name="t1")

def foo():
    # time.sleep(0.5)
    import pandas
    print(id(pandas), "t2")
    print(pandas.core.frame.DataFrame)

t2 = threading.Thread(target=foo, name="t2")

def bar():
    time.sleep(0.5)
    import pandas
    print(id(pandas), "t3")
    print(pandas.core.frame.DataFrame)

t3 = threading.Thread(target=bar, name="t3")

for _ in [t1, t2, t3]:
    _.start()

for _ in [t1, t2, t3]:
    _.join()

Run while python python/test.py ; do echo hehe; done to reproduce it.

My guess is that when there are multiple threads trying to access any attribute of pandas, the underlying module loader's exec_module (https://github.com/python/cpython/blob/44afdbd5af4503e376148e9404b9c7a4f595b1fe/Lib/importlib/util.py#L247) will be executed multiple times.

This importlib.util.LazyLoader idea is basically for Python 3.6 support. From Python 3.7+, we can use __getattr__ on modules. So we can implement lazy import with a new approach. Mars has a good example: https://github.com/mars-project/mars/blob/7bce88a13334a89f2926d3bbb764894576faf1eb/mars/utils.py#L312. In the new approach, importlib.util.LazyLoader and the exec_module are replaced by importlib.import_module on first access of attributes. Since there are module-scope lock (https://github.com/python/cpython/blob/89fd7c34520aac493a8784a221366ed04452612b/Lib/importlib/_bootstrap.py#L163) implemented in importlib.import_module, it's thread-safe.

So I came up with some options:

Wrap spec.loader with a new class and enforce single execution of exec_module.
Drop the support for Python 3.6 and use the new approach to implement lazy import.
Do both, with Python version check.

@ericl @clarkzinzow Any comments or ideas are highly appreciated.

Reference: https://snarky.ca/lazy-importing-in-python-3-7/

ericl · 2022-01-28T19:15:08Z

@kfstorm what if we did something like this:

_pandas = None

def lazy_import_pandas():
    global _pandas
    if _pandas is None:
        import pandas
        _pandas = pandas
    return _pandas

And the in pandas_block.py, add pandas = lazy_import_pandas() prior to each use of pandas in PandasBlock?

kfstorm · 2022-01-29T02:12:47Z

@ericl It seems that your approach eagerly loads pandas. It will slow down 'import ray'.

ericl · 2022-01-29T02:14:45Z

@kfstorm , not if you only call that inside PandasBlock methods. It doesn't need to be loaded at the file level.

kfstorm · 2022-01-29T04:04:25Z

@ericl Then what's the difference with "import pandas" in every method? Will it be slightly faster?

ericl · 2022-01-29T04:21:33Z

Yes, it will be faster.

…

On Fri, Jan 28, 2022, 8:04 PM Kai Yang ***@***.***> wrote: @ericl <https://github.com/ericl> Then what's the difference with "import pandas" in every method? Will it be slightly faster? — Reply to this email directly, view it on GitHub <#21894 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSUWQRJOS5SIAWXLDJTUYNRNJANCNFSM5M23YIIA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

bveeramani · 2022-01-30T05:02:13Z

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

Install Black

pip install -I black==21.12b0

Format changed files with Black

curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh

Commit your changes.

git add --all
git commit -m "Format Python code with Black"

Merge master into your branch.

git pull upstream master

Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

…dataframe

kfstorm · 2022-01-30T15:39:09Z

@ericl @jjyao CI and nightly test passed.

kfstorm added 3 commits January 26, 2022 19:53

Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format imp…

fa28cdd

…lementation (partial) (ray-project#20988) (ray-project#21661)" This reverts commit fa5c167.

Update code after resolving conflicts

407087b

lazy import pandas

a03e3df

kfstorm requested review from ericl, jjyao and clarkzinzow January 26, 2022 12:11

kfstorm requested a review from scv119 as a code owner January 26, 2022 12:11

ericl self-assigned this Jan 26, 2022

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2022

clarkzinzow reviewed Jan 26, 2022

View reviewed changes

fix lazy_import

1980083

kfstorm added 3 commits January 30, 2022 18:47

assign pandas manually

0259da4

Format Python code with Black

51bf4e8

Merge remote-tracking branch 'upstream/master' into resubmit_dataset_…

364cce0

…dataframe

move code

cd80048

ericl approved these changes Jan 31, 2022

View reviewed changes

ericl merged commit 2038cc9 into ray-project:master Jan 31, 2022

kfstorm deleted the resubmit_dataset_dataframe branch February 7, 2022 06:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988) (#21661)" #21894

Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988) (#21661)" #21894

kfstorm commented Jan 26, 2022

kfstorm commented Jan 26, 2022

kfstorm commented Jan 26, 2022

jjyao commented Jan 26, 2022

ericl commented Jan 26, 2022

ericl commented Jan 26, 2022

ericl commented Jan 26, 2022

clarkzinzow Jan 26, 2022 •

edited

Loading

clarkzinzow Jan 26, 2022

kfstorm commented Jan 27, 2022

ericl commented Jan 27, 2022

kfstorm commented Jan 27, 2022 •

edited

Loading

kfstorm commented Jan 28, 2022

ericl commented Jan 28, 2022

kfstorm commented Jan 29, 2022

ericl commented Jan 29, 2022

kfstorm commented Jan 29, 2022

ericl commented Jan 29, 2022 via email

bveeramani commented Jan 30, 2022

kfstorm commented Jan 30, 2022

Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988) (#21661)" #21894

Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988) (#21661)" #21894

Conversation

kfstorm commented Jan 26, 2022

Why are these changes needed?

Related issue number

Checks

kfstorm commented Jan 26, 2022

kfstorm commented Jan 26, 2022

jjyao commented Jan 26, 2022

ericl commented Jan 26, 2022

ericl commented Jan 26, 2022

ericl commented Jan 26, 2022

clarkzinzow Jan 26, 2022 • edited Loading

Choose a reason for hiding this comment

clarkzinzow Jan 26, 2022

Choose a reason for hiding this comment

kfstorm commented Jan 27, 2022

ericl commented Jan 27, 2022

kfstorm commented Jan 27, 2022 • edited Loading

kfstorm commented Jan 28, 2022

ericl commented Jan 28, 2022

kfstorm commented Jan 29, 2022

ericl commented Jan 29, 2022

kfstorm commented Jan 29, 2022

ericl commented Jan 29, 2022 via email

bveeramani commented Jan 30, 2022

‼️ ACTION REQUIRED ‼️

kfstorm commented Jan 30, 2022

clarkzinzow Jan 26, 2022 •

edited

Loading

kfstorm commented Jan 27, 2022 •

edited

Loading