Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Population based training UI #1862

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft

Population based training UI #1862

wants to merge 7 commits into from

Conversation

a9p
Copy link
Contributor

@a9p a9p commented May 5, 2022

What this PR does / why we need it:

Support the discovery of modulated hyperparameters rather than attempting to find a fixed set over the entire training process within the Katib UI.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

This PR provides some initial support for PBT within the Katib UI (#1382). Required core changes are introduced in #1833.

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: a9p
To complete the pull request process, please assign johnugeorge after the PR has been reviewed.
You can assign the PR to them by writing /assign @johnugeorge in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aws-kf-ci-bot
Copy link
Contributor

Hi @a9p. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@a9p a9p mentioned this pull request May 5, 2022
1 task
Copy link
Member

@Arhell Arhell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@aws-kf-ci-bot
Copy link
Contributor

@a9p: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
kubeflow-katib-presubmit be99840 link /test kubeflow-katib-presubmit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@a9p a9p marked this pull request as draft May 23, 2022 01:54
@a9p
Copy link
Contributor Author

a9p commented May 23, 2022

Changing this request back to WIP: following integration of feedback from #1833, this PR needs to be changed to parse Labels instead of Annotations for metadata.

@johnugeorge
Copy link
Member

@a9p Can you please update with latest review comments so that we can take this in the coming release?

@a9p
Copy link
Contributor Author

a9p commented Jun 20, 2022

@johnugeorge this PR has two components:

  1. pbt additions to the new experiment form
  2. pbt tab on the experiment results view

The first one should be good to use and the changes are encompassed in 593881a and 6145922 (only the prettier needs to be run again on this commit if we want to separate it).

The second one I think could use a bit more usability / code feedback and might be a better target for a subsequent cycle.

@coveralls
Copy link

coveralls commented Jun 20, 2022

Coverage Status

Coverage decreased (-7.09%) to 73.959% when pulling 5ca9a97 on a9p:pbt-ui into f7261de on kubeflow:master.

@stale
Copy link

stale bot commented Sep 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a9p Since we release Katib 0.15 recently, do you have some free time to finish this PR ?
It would be nice to give our users ability to monitor PBT Experiments in Katib UI.
Thanks for your effort!

@a9p
Copy link
Contributor Author

a9p commented May 22, 2023

Noting here if anyone has any thoughts --
A standard job for me runs ~1k trials. The UI data query (fetch_hp_job_info) seem to take a significant amount of time to aggregate and transmit the data to the browser (order minutes). On cursory glance it might be a DB bottleneck...as we are likely introducing a second query (or augmenting the job info one) in this PR for PBT job information, I was curious if anyone else has seen this performance characteristic.

@andreyvelich
Copy link
Member

andreyvelich commented May 23, 2023

@a9p Yes, we noticed this issue before.
Katib UI operates slowly when Experiment has many Trials.
We need to investigate if it is backend or db slowness. Also, UI rendering might take time.
@a9p It would be nice if you could create a separate issue to track this performance issue.

cc @kimwnasptd @elenzio9 @apo-ger

@andreyvelich
Copy link
Member

Hi @a9p, do you have any free time to complete this PR so we can merge it as part of our next release (Katib release-0.16)?

@a9p
Copy link
Contributor Author

a9p commented Jul 25, 2023

Hi @andreyvelich, it doesn't look like I will be able to wrap up this PR before feature freeze. Could we push this to the next release target as well?

@andreyvelich
Copy link
Member

Sure, no problem @a9p!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants