-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Add Custom Usage Tags for Reporting #27061
Conversation
python/ray/serve/deployment_state.py
Outdated
@@ -1771,6 +1775,9 @@ def deploy(self, deployment_name: str, deployment_info: DeploymentInfo) -> bool: | |||
self._deployment_states[deployment_name] = self._create_deployment_state( | |||
deployment_name | |||
) | |||
record_extra_usage_tag( | |||
TagKey.SERVE_NUM_DEPLOYMENTS, str(len(self._deployment_states)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this per cluster? usage stats is per cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
# Serve | ||
SERVE_USING_V1_API = auto() | ||
SERVE_USING_V2_API = auto() | ||
SERVE_NUM_DEPLOYMENTS = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add descriptions of what this is measuring. (cc @thomasdesr @pcmoritz we shouldn't approve metrics without clear descriptions).
python/ray/serve/api.py
Outdated
|
||
return _private_api.serve_start(detached, http_options, dedicated_cpu, **kwargs) | ||
# Record after Ray has been started. | ||
record_extra_usage_tag(TagKey.SERVE_USING_V1_API, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
record_extra_usage_tag(TagKey.SERVE_USING_V1_API, "") | |
record_extra_usage_tag(TagKey.SERVE_API_VERSION, "v1") |
RLLIB_FRAMEWORK = auto() | ||
RLLIB_ALGORITHM = auto() | ||
RLLIB_NUM_WORKERS = auto() | ||
|
||
# Serve | ||
SERVE_USING_V1_API = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SERVE_USING_V1_API = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments inline.
@ericl addressed. PTAL |
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
69a1e49
to
1a2f554
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question, how will the stats added in serve working with the doc mentioning "The data will be uploaded every hour with other usage stats data in the cluster."?
Simple scenario, i call deploy() twice in one hour, will we have two stats point in the database?
@@ -1771,6 +1775,9 @@ def deploy(self, deployment_name: str, deployment_info: DeploymentInfo) -> bool: | |||
self._deployment_states[deployment_name] = self._create_deployment_state( | |||
deployment_name | |||
) | |||
record_extra_usage_tag( | |||
TagKey.SERVE_NUM_DEPLOYMENTS, str(len(self._deployment_states)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a function to accept the int instead of converting to str?
(Not python expert, but it looks like inefficient)
@@ -1856,3 +1863,8 @@ def update(self): | |||
|
|||
for tag in deleted_tags: | |||
del self._deployment_states[tag] | |||
|
|||
if len(deleted_tags): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, do we need to collect the deleted num deployment? this will make the stats inaccurate since you also gather num deployment in the deploy(),
I suggest we don't gather stats in deploy() or delete(), we can have a dedicated stats thread and push the expected stats into the kv store.
(Just checked it is a synchronous call in record_extra_usage_tag. might not be very safe to do it in the critical code path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just trying to capture the "latest" state here since it's last writer win for that hour.
Signed-off-by: Rohan138 <[email protected]>
Signed-off-by: Frank Luan <[email protected]>
Signed-off-by: Scott Graham <[email protected]>
Signed-off-by: Chong-Li <[email protected]>
Signed-off-by: Stefan van der Kleij <[email protected]>
Signed-off-by: simon-mo [email protected]
Why are these changes needed?
This PR adds two custom usage tag to track the total number of deployments in Ray Serve, as well as usages of two different versions of APIs.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.