Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988) (#21661)" #21894

Merged
merged 8 commits into from
Jan 31, 2022

Conversation

kfstorm
Copy link
Member

@kfstorm kfstorm commented Jan 26, 2022

This reverts commit fa5c167.

Why are these changes needed?

This PR adds pandas block format support by implementing PandasRow, PandasBlockBuilder, PandasBlockAccessor.

Note that sort_and_partition, combine, merge_sorted_blocks, aggregate_combined_blocks in PandasBlockAccessor redirects to arrow block format implementation for now. They'll be implemented in a later PR.

Related issue number

#20719

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@kfstorm
Copy link
Member Author

kfstorm commented Jan 26, 2022

@ericl @jjyao I've fixed the slowdown of test_threaded_actor.

I need more input from you guys about pg_long_running_performance_test. Such as raw log files.

@kfstorm
Copy link
Member Author

kfstorm commented Jan 26, 2022

@jjyao Is it possible to run nightly tests with a PR build?

@jjyao
Copy link
Collaborator

jjyao commented Jan 26, 2022

@kfstorm Yes, PR build also generates the wheel that can be used in the nightly tests. Let me know if you don't have the instructions to do that.

@ericl ericl self-assigned this Jan 26, 2022
@ericl
Copy link
Contributor

ericl commented Jan 26, 2022

Could we also add a test that "pandas" isn't imported when you import ray? This can go into e.g., test_basic.

@ericl
Copy link
Contributor

ericl commented Jan 26, 2022

//python/ray/data:tests/test_dataset TIMEOUT in 3 out of 3 in 900.0s
 


@ericl
Copy link
Contributor

ericl commented Jan 26, 2022


(raylet)     from ray.data.datasource import (
  | (raylet)   File "/tmp/ray/session_2022-01-26_15-07-57_959732_1541/runtime_resources/conda/9d41a9fc1deb57000894b5d369c0ad86d32550ed/lib/python3.6/site-packages/ray/data/datasource/__init__.py", line 1, in <module>
  | (raylet)     from ray.data.datasource.datasource import (Datasource, RangeDatasource,
  | (raylet)   File "/tmp/ray/session_2022-01-26_15-07-57_959732_1541/runtime_resources/conda/9d41a9fc1deb57000894b5d369c0ad86d32550ed/lib/python3.6/site-packages/ray/data/datasource/datasource.py", line 12, in <module>
  | (raylet)     from ray.data.impl.delegating_block_builder import DelegatingBlockBuilder
  | (raylet)   File "/tmp/ray/session_2022-01-26_15-07-57_959732_1541/runtime_resources/conda/9d41a9fc1deb57000894b5d369c0ad86d32550ed/lib/python3.6/site-packages/ray/data/impl/delegating_block_builder.py", line 7, in <module>
  | (raylet)     from ray.data.impl.pandas_block import PandasRow, PandasBlockBuilder
  | (raylet)   File "/tmp/ray/session_2022-01-26_15-07-57_959732_1541/runtime_resources/conda/9d41a9fc1deb57000894b5d369c0ad86d32550ed/lib/python3.6/site-packages/ray/data/impl/pandas_block.py", line 8, in <module>
  | (raylet)     pandas = lazy_import("pandas")
  | (raylet)   File "/tmp/ray/session_2022-01-26_15-07-57_959732_1541/runtime_resources/conda/9d41a9fc1deb57000894b5d369c0ad86d32550ed/lib/python3.6/site-packages/ray/_private/utils.py", line 1165, in lazy_import
  | (raylet)     loader = importlib.util.LazyLoader(spec.loader)
  | (raylet) AttributeError: 'NoneType' object has no attribute 'loader'
 ```
 
 Hmm, is it available in all Python versions? A simpler way is to just trigger the pandas import inside the class.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2022
def lazy_import(name):
if name in sys.modules:
return
spec = importlib.util.find_spec(name)
Copy link
Contributor

@clarkzinzow clarkzinzow Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the package doesn't exist (i.e. Pandas isn't installed), this will return None, which will cause the below spec.loader to fail with an AttributeError. We should check to see if this is None here and raise the canonical ModuleNotFoundError if so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ericl pretty sure this is the source of the test error

@kfstorm
Copy link
Member Author

kfstorm commented Jan 27, 2022

I've fixed lazy import. Let's see the CI results again.

@ericl
Copy link
Contributor

ericl commented Jan 27, 2022

or isinstance(applied, pd.core.frame.DataFrame)):
  | AttributeError: module 'pandas.core' has no attribute 'frame'
  | (_map_block_nosplit pid=11108) 2022-01-27 04:57:01,185	INFO worker.py:430 -- Task failed with retryable exception: TaskID(c5db14a0419b947bffffffffffffffffffffffff01000000).
  | (_map_block_nosplit pid=11108) Traceback (most recent call last):
  | (_map_block_nosplit pid=11108)   File "python/ray/_raylet.pyx", line 640, in ray._raylet.execute_task
  | (_map_block_nosplit pid=11108)   File "python/ray/_raylet.pyx", line 644, in ray._raylet.execute_task
  | (_map_block_nosplit pid=11108)   File "/ray/python/ray/data/impl/compute.py", line 46, in _map_block_nosplit
  | (_map_block_nosplit pid=11108)     for new_block in fn(block):
  | (_map_block_nosplit pid=11108)   File "/ray/python/ray/data/dataset.py", line 236, in transform
  | (_map_block_nosplit pid=11108)     or isinstance(applied, pd.core.frame.DataFrame)):
  | (_map_block_nosplit pid=11108) AttributeError: module 'pandas.core' has no attribute 'frame'
 <br class="Apple-interchange-newline">```

@kfstorm
Copy link
Member Author

kfstorm commented Jan 27, 2022

Yeah, that's strange. I'm still debugging.

@kfstorm
Copy link
Member Author

kfstorm commented Jan 28, 2022

Status update: This is a multi-threading issue. It can be reproduced with below script:

import sys
import importlib
import time

def lazy_import(name):
    try:
        return sys.modules[name]
    except KeyError:
        spec = importlib.util.find_spec(name)
        if not spec:
            raise ModuleNotFoundError(f"No module named '{name}'", name=name)
        module = importlib.util.module_from_spec(spec)
        loader = importlib.util.LazyLoader(spec.loader)
        loader.exec_module(module)
        # It's possible that another thread has done `import <name>`.
        # We only update sys.modules if the module is not in it already.
        module2 = sys.modules.setdefault(name, module)
        print(id(module), id(module2), "t1")
        return module2

import threading

t1 = threading.Thread(target=lazy_import, args=("pandas",), name="t1")

def foo():
    # time.sleep(0.5)
    import pandas
    print(id(pandas), "t2")
    print(pandas.core.frame.DataFrame)

t2 = threading.Thread(target=foo, name="t2")

def bar():
    time.sleep(0.5)
    import pandas
    print(id(pandas), "t3")
    print(pandas.core.frame.DataFrame)

t3 = threading.Thread(target=bar, name="t3")

for _ in [t1, t2, t3]:
    _.start()

for _ in [t1, t2, t3]:
    _.join()

Run while python python/test.py ; do echo hehe; done to reproduce it.

My guess is that when there are multiple threads trying to access any attribute of pandas, the underlying module loader's exec_module (https://github.com/python/cpython/blob/44afdbd5af4503e376148e9404b9c7a4f595b1fe/Lib/importlib/util.py#L247) will be executed multiple times.

This importlib.util.LazyLoader idea is basically for Python 3.6 support. From Python 3.7+, we can use __getattr__ on modules. So we can implement lazy import with a new approach. Mars has a good example: https://github.com/mars-project/mars/blob/7bce88a13334a89f2926d3bbb764894576faf1eb/mars/utils.py#L312. In the new approach, importlib.util.LazyLoader and the exec_module are replaced by importlib.import_module on first access of attributes. Since there are module-scope lock (https://github.com/python/cpython/blob/89fd7c34520aac493a8784a221366ed04452612b/Lib/importlib/_bootstrap.py#L163) implemented in importlib.import_module, it's thread-safe.

So I came up with some options:

  1. Wrap spec.loader with a new class and enforce single execution of exec_module.
  2. Drop the support for Python 3.6 and use the new approach to implement lazy import.
  3. Do both, with Python version check.

@ericl @clarkzinzow Any comments or ideas are highly appreciated.

Reference: https://snarky.ca/lazy-importing-in-python-3-7/

@ericl
Copy link
Contributor

ericl commented Jan 28, 2022

@kfstorm what if we did something like this:

_pandas = None

def lazy_import_pandas():
    global _pandas
    if _pandas is None:
        import pandas
        _pandas = pandas
    return _pandas

And the in pandas_block.py, add pandas = lazy_import_pandas() prior to each use of pandas in PandasBlock?

@kfstorm
Copy link
Member Author

kfstorm commented Jan 29, 2022

@ericl It seems that your approach eagerly loads pandas. It will slow down 'import ray'.

@ericl
Copy link
Contributor

ericl commented Jan 29, 2022

@kfstorm , not if you only call that inside PandasBlock methods. It doesn't need to be loaded at the file level.

@kfstorm
Copy link
Member Author

kfstorm commented Jan 29, 2022

@ericl Then what's the difference with "import pandas" in every method? Will it be slightly faster?

@ericl
Copy link
Contributor

ericl commented Jan 29, 2022 via email

@bveeramani
Copy link
Member

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

  1. Install Black
pip install -I black==21.12b0
  1. Format changed files with Black
curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh
  1. Commit your changes.
git add --all
git commit -m "Format Python code with Black"
  1. Merge master into your branch.
git pull upstream master
  1. Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

@kfstorm
Copy link
Member Author

kfstorm commented Jan 30, 2022

@ericl @jjyao CI and nightly test passed.

@ericl ericl merged commit 2038cc9 into ray-project:master Jan 31, 2022
@kfstorm kfstorm deleted the resubmit_dataset_dataframe branch February 7, 2022 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants