Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] fix autoscaling json serialization issue #42339

Merged

Conversation

GeneDer
Copy link
Contributor

@GeneDer GeneDer commented Jan 11, 2024

Why are these changes needed?

TDD ensure to add test cases that will reveal the issue and fix the issue by serialize bytes object into base64 encoded string.

Related issue number

Closes: #42319

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@GeneDer GeneDer marked this pull request as ready for review January 11, 2024 18:38
@GeneDer GeneDer requested a review from a team January 11, 2024 18:38
@GeneDer
Copy link
Contributor Author

GeneDer commented Jan 11, 2024

Shoutout to @zcin for finding this issue!

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serialized class definition should not be returned in ServeInstanceDetails at all. This is a user-facing struct that is returned in the REST API and the serialized class definition is purely an implementation detail.

@GeneDer GeneDer force-pushed the fix-autoscaling-config-json-serializable branch from 5f0906e to 8ee4bd6 Compare January 11, 2024 22:19
Comment on lines +1050 to +1052
# `serialized_policy_def` is only used internally and should not be exposed to
# the REST api. This method iteratively removes it from each autoscaling config
# if exists.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what'll happen here if the user passed a class definition directly for the autoscaling policy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

policy should carry the callable, serialized_policy_def should continue to be hidden away from users

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but a callable is not JSON-serializable, so won't this error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will the JSON serialized output be if the user passes in a callable?

Copy link
Contributor Author

@GeneDer GeneDer Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh the policy is actually not part of the response.

{ 'applications': { 'default': { 'deployed_app_config': None,
                                 'deployments': { 'autoscaling_app': { 'deployment_config': { 'autoscaling_config': { 'downscale_delay_s': 600.0,
                                                                                                                      'downscale_smoothing_factor': None,
                                                                                                                      'look_back_period_s': 30.0,
                                                                                                                      'max_replicas': 10,
                                                                                                                      'metrics_interval_s': 10.0,
                                                                                                                      'min_replicas': 1,
                                                                                                                      'smoothing_factor': 1.0,
                                                                                                                      'target_num_ongoing_requests_per_replica': 1.0,
                                                                                                                      'upscale_delay_s': 30.0,
                                                                                                                      'upscale_smoothing_factor': None},
                                                                                              'graceful_shutdown_timeout_s': 20.0,
                                                                                              'graceful_shutdown_wait_loop_s': 2.0,
                                                                                              'health_check_period_s': 10.0,
                                                                                              'health_check_timeout_s': 30.0,
                                                                                              'max_concurrent_queries': 100,
                                                                                              'name': 'autoscaling_app',
                                                                                              'ray_actor_options': { 'num_cpus': 1.0,
                                                                                                                     'runtime_env': { }},
                                                                                              'user_config': None},
                                                                       'message': '',
                                                                       'name': 'autoscaling_app',
                                                                       'replicas': [ { 'actor_id': '171b2a617ac029cb037c941c01000000',
                                                                                       'actor_name': 'SERVE_REPLICA::default#autoscaling_app#9fydf52f',
                                                                                       'log_file_path': '/serve/replica_default_autoscaling_app_9fydf52f.log',
                                                                                       'node_id': '39d23fbf6cc4e348a7eae48648ed36c1644951a06daf12d3ba43658e',
                                                                                       'node_ip': '127.0.0.1',
                                                                                       'pid': 39584,
                                                                                       'replica_id': 'default#autoscaling_app#9fydf52f',
                                                                                       'start_time_s': 1705093459.648952,
                                                                                       'state': <ReplicaState.RUNNING: 'RUNNING'>,
                                                                                       'worker_id': '699cf6124f76f5c6902956ef7e859c59f580c98202c121109eb88815'}],
                                                                       'status': <DeploymentStatus.HEALTHY: 'HEALTHY'>,
                                                                       'status_trigger': <DeploymentStatusTrigger.CONFIG_UPDATE_COMPLETED: 'CONFIG_UPDATE_COMPLETED'>}},
                                 'docs_path': None,
                                 'last_deployed_time_s': 1705093459.4853802,
                                 'message': '',
                                 'name': 'default',
                                 'route_prefix': '/',
                                 'status': <ApplicationStatus.RUNNING: 'RUNNING'>}},
  'controller_info': { 'actor_id': '9582543ca6801417fcd6370601000000',
                       'actor_name': 'SERVE_CONTROLLER_ACTOR',
                       'log_file_path': '/serve/controller_39559.log',
                       'node_id': '39d23fbf6cc4e348a7eae48648ed36c1644951a06daf12d3ba43658e',
                       'node_ip': '127.0.0.1',
                       'worker_id': 'f4f421a9b26e8f281f3c2f40cffd6c96069e4bedc4dc00937d3cc0a6'},
  'grpc_options': {},
  'http_options': {'host': '0.0.0.0'},
  'proxies': { '39d23fbf6cc4e348a7eae48648ed36c1644951a06daf12d3ba43658e': { 'actor_id': '35092b45d842d61dab6acc9701000000',
                                                                             'actor_name': 'SERVE_PROXY_ACTOR-39d23fbf6cc4e348a7eae48648ed36c1644951a06daf12d3ba43658e',
                                                                             'log_file_path': '/serve/proxy_127.0.0.1.log',
                                                                             'node_id': '39d23fbf6cc4e348a7eae48648ed36c1644951a06daf12d3ba43658e',
                                                                             'node_ip': '127.0.0.1',
                                                                             'status': <ProxyStatus.HEALTHY: 'HEALTHY'>,
                                                                             'worker_id': '3e4653db9129065ab21604b19aed6c4e22dc464120a8b0ab96346317'}},

let me see why that is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is due to we do not include policy in the AutoscalingConfig

bytes serialized_policy_def = 12;
so when it's serialized the data back to deployment info, it doesn't contain that field. Kinda try out a few things, and just decided to add a new user facing field policy_name in the status. Let me know what you think. This is how it looks after

{ 'applications': { 'default': { 'deployed_app_config': None,
                                 'deployments': { 'autoscaling_app': { 'deployment_config': { 'autoscaling_config': { 'downscale_delay_s': 600.0,
                                                                                                                      'downscale_smoothing_factor': None,
                                                                                                                      'look_back_period_s': 30.0,
                                                                                                                      'max_replicas': 10,
                                                                                                                      'metrics_interval_s': 10.0,
                                                                                                                      'min_replicas': 1,
                                                                                                                      'policy_name': 'replica_queue_length_autoscaling_policy',  ## <---new field
                                                                                                                      'smoothing_factor': 1.0,
                                                                                                                      'target_num_ongoing_requests_per_replica': 1.0,
                                                                                                                      'upscale_delay_s': 30.0,
                                                                                                                      'upscale_smoothing_factor': None},
                                                                                              'graceful_shutdown_timeout_s': 20.0,
                                                                                              'graceful_shutdown_wait_loop_s': 2.0,
                                                                                              'health_check_period_s': 10.0,
                                                                                              'health_check_timeout_s': 30.0,
                                                                                              'max_concurrent_queries': 100,
                                                                                              'name': 'autoscaling_app',
                                                                                              'ray_actor_options': { 'num_cpus': 1.0,
                                                                                                                     'runtime_env': { }},
                                                                                              'user_config': None},
                                                                       'message': '',
                                                                       'name': 'autoscaling_app',
                                                                       'replicas': [ { 'actor_id': '2777ad457fa57bbb2480273301000000',
                                                                                       'actor_name': 'SERVE_REPLICA::default#autoscaling_app#qk092qkb',
                                                                                       'log_file_path': '/serve/replica_default_autoscaling_app_qk092qkb.log',
                                                                                       'node_id': '2f61e8786194040ab4a43972a15dc2d1de41e368a36a1061bbbc9e38',
                                                                                       'node_ip': '127.0.0.1',
                                                                                       'pid': 43655,
                                                                                       'replica_id': 'default#autoscaling_app#qk092qkb',
                                                                                       'start_time_s': 1705099874.064668,
                                                                                       'state': <ReplicaState.RUNNING: 'RUNNING'>,
                                                                                       'worker_id': 'a91e56f6fcd2d97bf8ec0a1cb0d036a3066bd4527ca5ccd9ba9a0821'}],
                                                                       'status': <DeploymentStatus.HEALTHY: 'HEALTHY'>,
                                                                       'status_trigger': <DeploymentStatusTrigger.CONFIG_UPDATE_COMPLETED: 'CONFIG_UPDATE_COMPLETED'>}},
                                 'docs_path': None,
                                 'last_deployed_time_s': 1705099873.9227,
                                 'message': '',
                                 'name': 'default',
                                 'route_prefix': '/',
                                 'status': <ApplicationStatus.RUNNING: 'RUNNING'>}},
  'controller_info': { 'actor_id': 'abe00c23c7d83f4b75ad022a01000000',
                       'actor_name': 'SERVE_CONTROLLER_ACTOR',
                       'log_file_path': '/serve/controller_43653.log',
                       'node_id': '2f61e8786194040ab4a43972a15dc2d1de41e368a36a1061bbbc9e38',
                       'node_ip': '127.0.0.1',
                       'worker_id': '0531e60d5965a616fca6baee32ac311bdfdabf420a744b77f8b6ee28'},
  'grpc_options': {},
  'http_options': {'host': '0.0.0.0'},
  'proxies': { '2f61e8786194040ab4a43972a15dc2d1de41e368a36a1061bbbc9e38': { 'actor_id': 'e3a47ef12f2a2c8848e5875e01000000',
                                                                             'actor_name': 'SERVE_PROXY_ACTOR-2f61e8786194040ab4a43972a15dc2d1de41e368a36a1061bbbc9e38',
                                                                             'log_file_path': '/serve/proxy_127.0.0.1.log',
                                                                             'node_id': '2f61e8786194040ab4a43972a15dc2d1de41e368a36a1061bbbc9e38',
                                                                             'node_ip': '127.0.0.1',
                                                                             'status': <ProxyStatus.HEALTHY: 'HEALTHY'>,
                                                                             'worker_id': 'dd0916116d09ae0bb83a8b7b53459465bb60101a04048805a14105ca'}},
  'proxy_location': <ProxyLocation.HeadOnly: 'HeadOnly'>,
  'target_capacity': None}

Copy link
Contributor

@edoakes edoakes Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including a field in the response that isn't part of the input config schema is a little odd and breaks the existing behavior where this output can be directly used as input to apply the config

I'd propose the following:

  • Include the autoscaling_policy field.
  • If the user gave an import path, we return it in the field.
  • If the user gave a Callable, we get its import path to the best of our ability (f"{cls.__module__}.{cls.__name__}") and return it in the field

Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Copy link
Contributor Author

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes @sihanwang41 Addressed all the comments, PTAL again when you have a chance :)

json.dumps(details)

# ensure internal field, serialized_policy_def, is not exposed
application = details["applications"]["app1"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the test case name is checking json serialization, suggest to add expected result and do strict compare between json output and expected output, so that we can prevent other fields in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the test should only check the behavior we expect, in this case the output should be json serializable and not containing internally field (serialized_policy_def). Checking the exact output would be too rigid and hard to maintain. If not a strong preference I would keep this as it :)

@pytest.mark.parametrize(
"policy", [None, DEFAULT_AUTOSCALING_POLICY, default_autoscaling_policy]
)
def test_get_serve_instance_details_json_serializable(serve_instance, policy):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same here. we don't need to check serialized_policy_def specifically.

@edoakes edoakes merged commit 0e02dbf into ray-project:master Jan 16, 2024
9 checks passed
@GeneDer GeneDer deleted the fix-autoscaling-config-json-serializable branch January 17, 2024 00:07
zcin pushed a commit to zcin/ray that referenced this pull request Jan 17, 2024
zcin pushed a commit to zcin/ray that referenced this pull request Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[serve] GET rest api fails when there are autoscaling deployments
3 participants