-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dashboard] Add tune API to support tune tab in new backend #11009
Conversation
cc @richardliaw |
if not self._tensor_board_dir: | ||
tb = program.TensorBoard() | ||
tb.configure(argv=[None, "--logdir", str(self._logdir)]) | ||
tb.launch() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this eventually get terminated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardliaw I just ran it and tensorboard died when I killed my ray cluster. It seems to fate share with its parent process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm then
from ray.tune import Analysis | ||
from tensorboard import program | ||
except ImportError: | ||
Analysis = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened if the Analysis
is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Analysis is none, the Tune tab availability check will fail, and the tune tab won't be displayed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The self.collect()
does not raise an exception? That will be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not, I added a check.
dashboard/modules/tune/tune_head.py
Outdated
transforms them, and serves them up over an API. | ||
|
||
Args | ||
logdir (str): Directory path to save the status information of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Args
comments can be removed? Or, can we move the comments to self._logdir
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thank you.
dashboard/modules/tune/tune_head.py
Outdated
try: | ||
experiment = req.query["experiment"] | ||
except KeyError: | ||
return await rest_response(success=False, message="Bad request") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @routes.get
decorator catches the handler exception and returns rest_response(success=False, message=<exception stack>)
. This try ... except ...
do the same thing, but makes the response much cleaner rest_response(success=False, message="Bad request")
.
dashboard/modules/tune/tune_head.py
Outdated
self._tensor_board_dir = self._logdir | ||
|
||
def _get_availability(self): | ||
return {"available": True, "trials_available": self._trials_available} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the _get_availability
is only used by the get_availability
, is it better to merge them?
dashboard/modules/tune/tune_head.py
Outdated
self._errors[str(trial)]["trial_id"] = str(trial_id) | ||
if str(trial_id) in self._trial_records.keys(): | ||
self._trial_records[str(trial_id)]["error"] = text | ||
self._trial_records[str(trial_id)][ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are too many str(trail_id)
, can we just trail_id = str(other_data["trial_id"].values[0])
?
dashboard/modules/tune/tune_head.py
Outdated
"hostname", "node_ip", "time_since_restore", | ||
"timesteps_since_restore", "iterations_since_restore", | ||
"experiment_tag", "trial_id" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't care the order of the names, can we change the default_names
to a set?
timeit.timeit("'iterations_since_restore' in default_names_set", globals=globals())
0.04095575900282711
timeit.timeit("'iterations_since_restore' in default_names", globals=globals())
0.2835074480099138
@fyrestone @richardliaw could you please take a look to see if your concerns are addressed and approve if so? |
Great - could you post a screenshot of this working? |
@richardliaw This PR doesn't make this implementation the default dashboard yet. Currently I'm shepherding through multiple PRs to be able to make a PR to switch over to this dashboard by default. Given that, this branch doesn't even have the front-end changes included for me to show you this working. I did test it on a branch where I had integrated all these changes before making the PRs. When I make a PR to deprecate the old dashboard and turn this functionality on, I'll tag you in it and provide a screenshot. Is that alright @richardliaw ? |
Sounds good! No objections for merging from me. |
I'm not authorized to merge. @rkooo567 could you please merge? |
Why are these changes needed?
These changes are needed to enable running the existing Tune tab with the new backend. There are a few changes to the tune backend to make it work with async io, but otherwise it is pretty much identical to how it was in the old dashboard backend.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.I tested this alongside many other changes in a branch that contains the full migration code. That said, this is not yet unit tested.