-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] fix autoscaling json serialization issue #42339
[Serve] fix autoscaling json serialization issue #42339
Conversation
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Shoutout to @zcin for finding this issue! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The serialized class definition should not be returned in ServeInstanceDetails
at all. This is a user-facing struct that is returned in the REST API and the serialized class definition is purely an implementation detail.
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
5f0906e
to
8ee4bd6
Compare
# `serialized_policy_def` is only used internally and should not be exposed to | ||
# the REST api. This method iteratively removes it from each autoscaling config | ||
# if exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what'll happen here if the user passed a class definition directly for the autoscaling policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
policy
should carry the callable, serialized_policy_def
should continue to be hidden away from users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but a callable is not JSON-serializable, so won't this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No error, added test cases for none, string, and callable policies https://github.com/ray-project/ray/pull/42339/files#diff-1eaedadea6f2e3382aee6726429ebaad94801e3c55592b3e9f84a0d1b7cc1a14R80
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will the JSON serialized output be if the user passes in a callable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh the policy is actually not part of the response.
{ 'applications': { 'default': { 'deployed_app_config': None,
'deployments': { 'autoscaling_app': { 'deployment_config': { 'autoscaling_config': { 'downscale_delay_s': 600.0,
'downscale_smoothing_factor': None,
'look_back_period_s': 30.0,
'max_replicas': 10,
'metrics_interval_s': 10.0,
'min_replicas': 1,
'smoothing_factor': 1.0,
'target_num_ongoing_requests_per_replica': 1.0,
'upscale_delay_s': 30.0,
'upscale_smoothing_factor': None},
'graceful_shutdown_timeout_s': 20.0,
'graceful_shutdown_wait_loop_s': 2.0,
'health_check_period_s': 10.0,
'health_check_timeout_s': 30.0,
'max_concurrent_queries': 100,
'name': 'autoscaling_app',
'ray_actor_options': { 'num_cpus': 1.0,
'runtime_env': { }},
'user_config': None},
'message': '',
'name': 'autoscaling_app',
'replicas': [ { 'actor_id': '171b2a617ac029cb037c941c01000000',
'actor_name': 'SERVE_REPLICA::default#autoscaling_app#9fydf52f',
'log_file_path': '/serve/replica_default_autoscaling_app_9fydf52f.log',
'node_id': '39d23fbf6cc4e348a7eae48648ed36c1644951a06daf12d3ba43658e',
'node_ip': '127.0.0.1',
'pid': 39584,
'replica_id': 'default#autoscaling_app#9fydf52f',
'start_time_s': 1705093459.648952,
'state': <ReplicaState.RUNNING: 'RUNNING'>,
'worker_id': '699cf6124f76f5c6902956ef7e859c59f580c98202c121109eb88815'}],
'status': <DeploymentStatus.HEALTHY: 'HEALTHY'>,
'status_trigger': <DeploymentStatusTrigger.CONFIG_UPDATE_COMPLETED: 'CONFIG_UPDATE_COMPLETED'>}},
'docs_path': None,
'last_deployed_time_s': 1705093459.4853802,
'message': '',
'name': 'default',
'route_prefix': '/',
'status': <ApplicationStatus.RUNNING: 'RUNNING'>}},
'controller_info': { 'actor_id': '9582543ca6801417fcd6370601000000',
'actor_name': 'SERVE_CONTROLLER_ACTOR',
'log_file_path': '/serve/controller_39559.log',
'node_id': '39d23fbf6cc4e348a7eae48648ed36c1644951a06daf12d3ba43658e',
'node_ip': '127.0.0.1',
'worker_id': 'f4f421a9b26e8f281f3c2f40cffd6c96069e4bedc4dc00937d3cc0a6'},
'grpc_options': {},
'http_options': {'host': '0.0.0.0'},
'proxies': { '39d23fbf6cc4e348a7eae48648ed36c1644951a06daf12d3ba43658e': { 'actor_id': '35092b45d842d61dab6acc9701000000',
'actor_name': 'SERVE_PROXY_ACTOR-39d23fbf6cc4e348a7eae48648ed36c1644951a06daf12d3ba43658e',
'log_file_path': '/serve/proxy_127.0.0.1.log',
'node_id': '39d23fbf6cc4e348a7eae48648ed36c1644951a06daf12d3ba43658e',
'node_ip': '127.0.0.1',
'status': <ProxyStatus.HEALTHY: 'HEALTHY'>,
'worker_id': '3e4653db9129065ab21604b19aed6c4e22dc464120a8b0ab96346317'}},
let me see why that is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is due to we do not include policy
in the AutoscalingConfig
ray/src/ray/protobuf/serve.proto
Line 65 in f76081f
bytes serialized_policy_def = 12; |
policy_name
in the status. Let me know what you think. This is how it looks after
{ 'applications': { 'default': { 'deployed_app_config': None,
'deployments': { 'autoscaling_app': { 'deployment_config': { 'autoscaling_config': { 'downscale_delay_s': 600.0,
'downscale_smoothing_factor': None,
'look_back_period_s': 30.0,
'max_replicas': 10,
'metrics_interval_s': 10.0,
'min_replicas': 1,
'policy_name': 'replica_queue_length_autoscaling_policy', ## <---new field
'smoothing_factor': 1.0,
'target_num_ongoing_requests_per_replica': 1.0,
'upscale_delay_s': 30.0,
'upscale_smoothing_factor': None},
'graceful_shutdown_timeout_s': 20.0,
'graceful_shutdown_wait_loop_s': 2.0,
'health_check_period_s': 10.0,
'health_check_timeout_s': 30.0,
'max_concurrent_queries': 100,
'name': 'autoscaling_app',
'ray_actor_options': { 'num_cpus': 1.0,
'runtime_env': { }},
'user_config': None},
'message': '',
'name': 'autoscaling_app',
'replicas': [ { 'actor_id': '2777ad457fa57bbb2480273301000000',
'actor_name': 'SERVE_REPLICA::default#autoscaling_app#qk092qkb',
'log_file_path': '/serve/replica_default_autoscaling_app_qk092qkb.log',
'node_id': '2f61e8786194040ab4a43972a15dc2d1de41e368a36a1061bbbc9e38',
'node_ip': '127.0.0.1',
'pid': 43655,
'replica_id': 'default#autoscaling_app#qk092qkb',
'start_time_s': 1705099874.064668,
'state': <ReplicaState.RUNNING: 'RUNNING'>,
'worker_id': 'a91e56f6fcd2d97bf8ec0a1cb0d036a3066bd4527ca5ccd9ba9a0821'}],
'status': <DeploymentStatus.HEALTHY: 'HEALTHY'>,
'status_trigger': <DeploymentStatusTrigger.CONFIG_UPDATE_COMPLETED: 'CONFIG_UPDATE_COMPLETED'>}},
'docs_path': None,
'last_deployed_time_s': 1705099873.9227,
'message': '',
'name': 'default',
'route_prefix': '/',
'status': <ApplicationStatus.RUNNING: 'RUNNING'>}},
'controller_info': { 'actor_id': 'abe00c23c7d83f4b75ad022a01000000',
'actor_name': 'SERVE_CONTROLLER_ACTOR',
'log_file_path': '/serve/controller_43653.log',
'node_id': '2f61e8786194040ab4a43972a15dc2d1de41e368a36a1061bbbc9e38',
'node_ip': '127.0.0.1',
'worker_id': '0531e60d5965a616fca6baee32ac311bdfdabf420a744b77f8b6ee28'},
'grpc_options': {},
'http_options': {'host': '0.0.0.0'},
'proxies': { '2f61e8786194040ab4a43972a15dc2d1de41e368a36a1061bbbc9e38': { 'actor_id': 'e3a47ef12f2a2c8848e5875e01000000',
'actor_name': 'SERVE_PROXY_ACTOR-2f61e8786194040ab4a43972a15dc2d1de41e368a36a1061bbbc9e38',
'log_file_path': '/serve/proxy_127.0.0.1.log',
'node_id': '2f61e8786194040ab4a43972a15dc2d1de41e368a36a1061bbbc9e38',
'node_ip': '127.0.0.1',
'status': <ProxyStatus.HEALTHY: 'HEALTHY'>,
'worker_id': 'dd0916116d09ae0bb83a8b7b53459465bb60101a04048805a14105ca'}},
'proxy_location': <ProxyLocation.HeadOnly: 'HeadOnly'>,
'target_capacity': None}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including a field in the response that isn't part of the input config schema is a little odd and breaks the existing behavior where this output can be directly used as input to apply the config
I'd propose the following:
- Include the
autoscaling_policy
field. - If the user gave an import path, we return it in the field.
- If the user gave a
Callable
, we get its import path to the best of our ability (f"{cls.__module__}.{cls.__name__}"
) and return it in the field
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edoakes @sihanwang41 Addressed all the comments, PTAL again when you have a chance :)
json.dumps(details) | ||
|
||
# ensure internal field, serialized_policy_def, is not exposed | ||
application = details["applications"]["app1"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the test case name is checking json serialization, suggest to add expected result and do strict compare between json output and expected output, so that we can prevent other fields in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the test should only check the behavior we expect, in this case the output should be json serializable and not containing internally field (serialized_policy_def). Checking the exact output would be too rigid and hard to maintain. If not a strong preference I would keep this as it :)
@pytest.mark.parametrize( | ||
"policy", [None, DEFAULT_AUTOSCALING_POLICY, default_autoscaling_policy] | ||
) | ||
def test_get_serve_instance_details_json_serializable(serve_instance, policy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same here. we don't need to check serialized_policy_def specifically.
Signed-off-by: Gene Su <[email protected]>
… autoscaling config Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Why are these changes needed?
TDD ensure to add test cases that will reveal the issue and fix the issue by serialize bytes object into base64 encoded string.
Related issue number
Closes: #42319
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.