Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Fix legacy do_write #32661

Merged
merged 3 commits into from
Feb 21, 2023

Conversation

matthew29tang
Copy link
Contributor

@matthew29tang matthew29tang commented Feb 17, 2023

Why are these changes needed?

With the new write added (from #32015 and #32440), Ray Data intends to support both the write and do_write functions for now. The check currently uses the hasattr() function to ensure the datasource object has a write method before using it.

However, this is insufficient for a custom datasource that inherits from Datasource since Datasource has the write method implemented. If the custom datasource only has do_write implemented, hasattr(datasource, "write") will return True since hasattr() will detect methods via inheritance.

The solution is to check if the write method was overwritten from Datasource.write. Any class that has not implemented write will have the equality check return True

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing! Could we add a test for this case?

@matthew29tang matthew29tang changed the title Fix legacy do_write [Data] Fix legacy do_write Feb 17, 2023
@matthew29tang
Copy link
Contributor Author

How would you recommend adding a test for this? I don't see existing test cases for CustomDatasource to base off of.

@jianoaix
Copy link
Contributor

You may extend the Datasource and implement a TestDatasource, similar to https://github.com/ray-project/ray/blob/master/python/ray/data/datasource/datasource.py#L318

@ericl
Copy link
Contributor

ericl commented Feb 21, 2023

Dataset tests seem to be failing (@jianoaix )

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 21, 2023
@matthew29tang
Copy link
Contributor Author

Apologies, I will investigate.

Quick sanity check, how do you run the data tests locally? I am able to run basic tests with python -m pytest -v python/ray/tests/test_mini.py with the working directory in the top level ray folder. But the data tests seem to not run with python -m pytest -v python/ray/data/tests/test_dataset.py (ModuleNotFoundError: No module named 'ray.data.tests')

@jianoaix
Copy link
Contributor

Apologies, I will investigate.

Quick sanity check, how do you run the data tests locally? I am able to run basic tests with python -m pytest -v python/ray/tests/test_mini.py with the working directory in the top level ray folder. But the data tests seem to not run with python -m pytest -v python/ray/data/tests/test_dataset.py (ModuleNotFoundError: No module named 'ray.data.tests')

Yes, that's how we run tests locally. Can you try with `python python/ray/data/tests/test_dataset.py'?

@matthew29tang
Copy link
Contributor Author

Running that in the same directory (top level ray) gives the same error:

Traceback (most recent call last):
  File "python/ray/data/tests/test_dataset.py", line 39, in <module>
    from ray.data.tests.conftest import *  # noqa
ModuleNotFoundError: No module named 'ray.data.tests'

@jianoaix
Copy link
Contributor

Running that in the same directory (top level ray) gives the same error:

Traceback (most recent call last):
  File "python/ray/data/tests/test_dataset.py", line 39, in <module>
    from ray.data.tests.conftest import *  # noqa
ModuleNotFoundError: No module named 'ray.data.tests'

The test failed is actually this one: https://buildkite.com/ray-project/oss-ci-build-pr/builds/12645

Signed-off-by: Matthew Tang <[email protected]>
Signed-off-by: Matthew Tang <[email protected]>
@matthew29tang
Copy link
Contributor Author

To my understanding, the current issue is that various front-facing Datasources do not implement write, but they inherit write from a parent (ex. FileBasedDatasource). This is likely why the initial author used hasattr() to check with inheritance.

The original problem still stands that hasattr() is vacuously True since anything that extends Datasource will thus have a write method (but raise NotImplementedError).

Still investigating a solution to this.

@jianoaix
Copy link
Contributor

To my understanding, the current issue is that various front-facing Datasources do not implement write, but they inherit write from a parent (ex. FileBasedDatasource). This is likely why the initial author used hasattr() to check with inheritance.

The original problem still stands that hasattr() is vacuously True since anything that extends Datasource will thus have a write method (but raise NotImplementedError).

Still investigating a solution to this.

Yeah, with the fix this PR, those subclasses of FileBasedDatasource will still run the legacy do_write() path, although the write() has been implemented in FileBasedDatasource.

@matthew29tang
Copy link
Contributor Author

Just pushed a new version which checks if the write method was overwritten from Datasource.write. Any class that has not implemented write will have the equality check return True

@ericl ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 21, 2023
@jianoaix
Copy link
Contributor

Just pushed a new version which checks if the write method was overwritten from Datasource.write. Any class that has not implemented write will have the equality check return True

That LGTM!

@jianoaix
Copy link
Contributor

It looks tests passing now. @matthew29tang Can you update the PR description based on current state?

@matthew29tang
Copy link
Contributor Author

Just edited the PR description.

@ericl ericl merged commit ed1f1e2 into ray-project:master Feb 21, 2023
@ericl
Copy link
Contributor

ericl commented Feb 21, 2023

Merged, thanks!

cadedaniel pushed a commit to cadedaniel/ray that referenced this pull request Mar 22, 2023
With the new `write` added (from ray-project#32015 and ray-project#32440), Ray Data intends to support both the `write` and `do_write` functions for now. The check currently uses the `hasattr()` function to ensure the datasource object has a `write` method before using it.

However, this is insufficient for a custom datasource that inherits from `Datasource` since `Datasource` has the `write` method implemented. If the custom datasource only has `do_write` implemented, `hasattr(datasource, "write")` will return True since `hasattr()` will detect methods via inheritance. 

The solution is to check if the `write` method was overwritten from `Datasource.write`. Any class that has not implemented `write` will have the equality check return True
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
With the new `write` added (from ray-project#32015 and ray-project#32440), Ray Data intends to support both the `write` and `do_write` functions for now. The check currently uses the `hasattr()` function to ensure the datasource object has a `write` method before using it.

However, this is insufficient for a custom datasource that inherits from `Datasource` since `Datasource` has the `write` method implemented. If the custom datasource only has `do_write` implemented, `hasattr(datasource, "write")` will return True since `hasattr()` will detect methods via inheritance.

The solution is to check if the `write` method was overwritten from `Datasource.write`. Any class that has not implemented `write` will have the equality check return True

Signed-off-by: Edward Oakes <[email protected]>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
With the new `write` added (from ray-project#32015 and ray-project#32440), Ray Data intends to support both the `write` and `do_write` functions for now. The check currently uses the `hasattr()` function to ensure the datasource object has a `write` method before using it.

However, this is insufficient for a custom datasource that inherits from `Datasource` since `Datasource` has the `write` method implemented. If the custom datasource only has `do_write` implemented, `hasattr(datasource, "write")` will return True since `hasattr()` will detect methods via inheritance. 

The solution is to check if the `write` method was overwritten from `Datasource.write`. Any class that has not implemented `write` will have the equality check return True
scottsun94 pushed a commit to scottsun94/ray that referenced this pull request Mar 28, 2023
With the new `write` added (from ray-project#32015 and ray-project#32440), Ray Data intends to support both the `write` and `do_write` functions for now. The check currently uses the `hasattr()` function to ensure the datasource object has a `write` method before using it.

However, this is insufficient for a custom datasource that inherits from `Datasource` since `Datasource` has the `write` method implemented. If the custom datasource only has `do_write` implemented, `hasattr(datasource, "write")` will return True since `hasattr()` will detect methods via inheritance. 

The solution is to check if the `write` method was overwritten from `Datasource.write`. Any class that has not implemented `write` will have the equality check return True
cassidylaidlaw pushed a commit to cassidylaidlaw/ray that referenced this pull request Mar 28, 2023
With the new `write` added (from ray-project#32015 and ray-project#32440), Ray Data intends to support both the `write` and `do_write` functions for now. The check currently uses the `hasattr()` function to ensure the datasource object has a `write` method before using it.

However, this is insufficient for a custom datasource that inherits from `Datasource` since `Datasource` has the `write` method implemented. If the custom datasource only has `do_write` implemented, `hasattr(datasource, "write")` will return True since `hasattr()` will detect methods via inheritance. 

The solution is to check if the `write` method was overwritten from `Datasource.write`. Any class that has not implemented `write` will have the equality check return True
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
With the new `write` added (from ray-project#32015 and ray-project#32440), Ray Data intends to support both the `write` and `do_write` functions for now. The check currently uses the `hasattr()` function to ensure the datasource object has a `write` method before using it.

However, this is insufficient for a custom datasource that inherits from `Datasource` since `Datasource` has the `write` method implemented. If the custom datasource only has `do_write` implemented, `hasattr(datasource, "write")` will return True since `hasattr()` will detect methods via inheritance.

The solution is to check if the `write` method was overwritten from `Datasource.write`. Any class that has not implemented `write` will have the equality check return True

Signed-off-by: elliottower <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants