Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix opencensus protobuf bug #31632

Merged
merged 18 commits into from
Jan 13, 2023

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Jan 12, 2023

Why are these changes needed?

Here's how the conflict happens.

  1. We pull the opencensus-proto 0.3.0 for our bazel build.
  2. opencensus-proto 0.3.0 generated python code (metric_pb2 and resource_pb2), which is used by reporter.proto.
  3. We copy all python protobuf to python/ray/core/generated.
  4. Now reporter_pb2 is available from python/ray/core/generated.

reporter_pb2 is not supposed to be imported from core workers (it is only for dashboard agent). However, this was imported (accidentally) from gcs_pubsub.py because pubsub.proto requires this module (which was unnecessary).

As a result, reporter_pb2 was imported to the core worker, which subsequently imports metric_pb2 and resource_pb2. When we import the opencensus proto using open telemetry, we import resource_pb2 again, which causes the conflict.

Theoretically, when the same package is imported (which is our case), protobuf is supposed to handle it. (I guess) This wasn't the case because our cpp version uses opencensus-proto 0.3.0, and the python only supports opencensus-proto 0.1.0 (omg https://pypi.org/project/opencensus-proto/#history).

There are 4 potential solutions.

  1. [current PR] Make sure to not import reporter_pb2 from the core worker. It is the easiest fix and have the best compatibility story. Since we don't import reporter_pb2 (which is only for dashboard agent), there's no concern of conflict. Since we have unit tests now, we can avoid the future regressions.
  2. Use downloaded opencensus-proto for all python generated code (instead of cpp one we built). This requires us to introduce opencensus-proto 0.1.0 as a dependency, so I avoided this approach.
  3. Downgrade cpp opencensus-proto version to 0.1.0. This is not a very clean solution. What if opencensus releases 0.2.0?
  4. Copy & paste opencensus-proto and build our own proto with different names. This can potentially have compatibility issue with opencensus library, but it is unlikely (cuz the project doesn't seem to be maintained). This is the fallback option if the current approach doesn't work.

Related issue number

Closes #31358

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
python/ray/_private/test_utils.py Show resolved Hide resolved
@@ -21,7 +21,6 @@ import "src/ray/protobuf/common.proto";
import "src/ray/protobuf/dependency.proto";
import "src/ray/protobuf/gcs.proto";
import "src/ray/protobuf/logging.proto";
import "src/ray/protobuf/reporter.proto";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the critical fix

@rkooo567 rkooo567 changed the title [WIP] Fix opencensus protobuf bug [Core] Fix opencensus protobuf bug Jan 12, 2023
@rkooo567
Copy link
Contributor Author

@scv119 I am 80% sure this will pass all builds, but let's see...

@rickyyx
Copy link
Contributor

rickyyx commented Jan 12, 2023

Looks like some tests are still failing?

Theoretically, when the same package is imported (which is our case), protobuf is supposed to handle it. (I guess) This wasn't the case because our cpp version uses opencensus-proto 0.3.0, and the python only supports opencensus-proto 0.1.0 (omg https://pypi.org/project/opencensus-proto/#history).

This is really really old....Looks like it's not maintained at all. Do we have any plans to migrate away from it if so?

@rickyyx rickyyx added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 12, 2023
@scv119
Copy link
Contributor

scv119 commented Jan 12, 2023

@rkooo567 is the long term solution to put related protobuf into a different namespace so it will never conflict?

Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
@rkooo567
Copy link
Contributor Author

@scv119 That's solution 4. I think the current solution is good enough in the long term (since we have a regression test now). If we ever have to import reporter.proto to core worker, solution 2 or 4 both will work.

@rkooo567
Copy link
Contributor Author

If anyone reorg proto or do something that breaks it, the test will fail, so we can detect them. If this ever should happen, we can choose option 4 (I will add TODO).

The only edge case possible is an end user imports reporter_pb2 (private) to the driver and uses opencensus proto at the same time. Tbh, I don’t think there will be anyone who will do this... Note that this issue has existed for 3 years, and this was the first issue ever raised.

Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkooo567 rkooo567 merged commit de1dbc8 into ray-project:master Jan 13, 2023
rkooo567 added a commit that referenced this pull request Jan 13, 2023
Signed-off-by: SangBin Cho <[email protected]>

Here's how the conflict happens.

We pull the opencensus-proto 0.3.0 for our bazel build.
opencensus-proto 0.3.0 generated python code (metric_pb2 and resource_pb2), which is used by reporter.proto.
We copy all python protobuf to python/ray/core/generated.
Now reporter_pb2 is available from python/ray/core/generated.
reporter_pb2 is not supposed to be imported from core workers (it is only for dashboard agent). However, this was imported (accidentally) from gcs_pubsub.py because pubsub.proto requires this module (which was unnecessary).

As a result, reporter_pb2 was imported to the core worker, which subsequently imports metric_pb2 and resource_pb2. When we import the opencensus proto using open telemetry, we import resource_pb2 again, which causes the conflict.

Theoretically, when the same package is imported (which is our case), protobuf is supposed to handle it. (I guess) This wasn't the case because our cpp version uses opencensus-proto 0.3.0, and the python only supports opencensus-proto 0.1.0 (omg https://pypi.org/project/opencensus-proto/#history).

There are 4 potential solutions.

[current PR] Make sure to not import reporter_pb2 from the core worker. It is the easiest fix and have the best compatibility story. Since we don't import reporter_pb2 (which is only for dashboard agent), there's no concern of conflict. Since we have unit tests now, we can avoid the future regressions.
Use downloaded opencensus-proto for all python generated code (instead of cpp one we built). This requires us to introduce opencensus-proto 0.1.0 as a dependency, so I avoided this approach.
Downgrade cpp opencensus-proto version to 0.1.0. This is not a very clean solution. What if opencensus releases 0.2.0?
Copy & paste opencensus-proto and build our own proto with different names. This can potentially have compatibility issue with opencensus library, but it is unlikely (cuz the project doesn't seem to be maintained). This is the fallback option if the current approach doesn't work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Jobs] Couldn't build proto file into descriptor pool! exception after installing ray in project
4 participants