Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dataframes] Call ray.init() on ray.dataframe import #1626

Merged
merged 4 commits into from
Feb 28, 2018

Conversation

kunalgosar
Copy link
Contributor

Ensure ray is initialized when dataframes are imported.

The behavior here is interesting because the dataframe import is sent to all the workers. We need to ensure that the init is only run on the main thread.

@richardliaw
Copy link
Contributor

you might want to try catch this just in case the user has already initialized ray in the code.

Also, what would be the ideal user experience in the cluster setting where ray has already started?

@kunalgosar
Copy link
Contributor Author

Wrapped it in a try/except now; thanks!

The current experience is that it prints out the webui url. Is there some way to avoid this? By checking that ray has been initialized without a call to ray.init?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4025/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4026/
Test PASSed.

@richardliaw
Copy link
Contributor

richardliaw commented Feb 27, 2018

I would actually advise against this. You might actually not want ray to take up all of the available CPU resources.

To answer your question, you would probably want to do something like ray.init(ray.services.get_node_ip() + ":6379") assuming the cluster is launched via the autoscaler.

Backtracking a little, what's the purpose of this PR? Perhaps there is a more elegant fix.

@devin-petersohn
Copy link
Member

We are expecting our users to be traditional Pandas users, and one of the main drivers of our system is that you only have to change the import statement. I don't think most users will care about reducing their number of CPUs below the default. I don't see a problem with it for now, but I do agree that there are cases where users won't want access to all CPUs. We can add some docs for it later, but for now I think it's ok.

@richardliaw
Copy link
Contributor

I see. Do you expect that the traditional user not to use multiple nodes then?

@devin-petersohn
Copy link
Member

Right now we aren't supporting it, but eventually, yes. It's not a big deal for changes. We will put something more effective together later so customization is possible.

@richardliaw
Copy link
Contributor

I see - thanks for clarifying.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4027/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4028/
Test PASSed.

Copy link
Member

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks @kunalgosar

@devin-petersohn devin-petersohn merged commit 4a15c2c into ray-project:master Feb 28, 2018
@robertnishihara
Copy link
Collaborator

We probably don't want the workers calling ray.init() right? Only the driver should do that.

When a data frame remote function is imported by a worker, the worker will try to do import ray.dataframe, which will cause the ray.init call to happen.

@devin-petersohn
Copy link
Member

Right now we're only supporting dataframes locally until we can get a ray.put(other_machine).

@robertnishihara
Copy link
Collaborator

Right, I'm referring to worker "processes" as opposed to worker "machines".

@kunalgosar
Copy link
Contributor Author

Is there some way to check that only the main thread on the main process runs the init function? It seems like the worker.py init check only looks at the name of the thread that is calling it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants