Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference autoscaling #109667

Merged
merged 53 commits into from
Jul 9, 2024
Merged

Inference autoscaling #109667

merged 53 commits into from
Jul 9, 2024

Conversation

jan-elastic
Copy link
Contributor

@jan-elastic jan-elastic commented Jun 13, 2024

Notes:

  • x-pack/dev-tools contains some scripts for testing / experimentation. The most important script (simulate-autoscaling) comes from this PR.
  • tests are missing (I'm working on that right now)
  • it's named "inference autoscaling" now, but I think some people want to rebrand it; i'll postpone renaming until the name is final
  • API documentation is missing (can go to follow-up PR)
  • telemetry is missing (can go to follow-up PR)
  • integration test is missing (I'll discuss with QA)

Squashed commit of the following:

commit d98bd3d39d833329ab83a8274885473db41ed08a
Author: Jan Kuipers <[email protected]>
Date:   Mon May 13 17:27:38 2024 +0200

    Increase measurement interval to 10secs

commit e808ae5be52c5ea4d5ff8ccb881a4a80de0254f9
Author: Jan Kuipers <[email protected]>
Date:   Mon May 13 17:09:33 2024 +0200

    jump -> jumps

commit c38cbdebfcec43e6982bb8bd1670519293161154
Author: Jan Kuipers <[email protected]>
Date:   Mon May 13 14:32:42 2024 +0200

    Remove unused estimator

commit 16101f32b539481cd4d648ebb5637a3309853552
Author: Jan Kuipers <[email protected]>
Date:   Mon May 13 14:31:30 2024 +0200

    Measure latency periodically + documentation

commit bc73bf29fde1d772701f0b71a7c8a0908669eb0f
Author: Jan Kuipers <[email protected]>
Date:   Mon May 13 12:53:19 2024 +0200

    Init variance to None

commit 0e73fa836fa9deec6ba55ef1161cc0dd71f35044
Author: Jan Kuipers <[email protected]>
Date:   Mon May 13 11:18:21 2024 +0200

    No autodetection of dynamics changes for latency

commit 75924a744d26a72835529598a6df1a2d22bdaddc
Author: Jan Kuipers <[email protected]>
Date:   Mon May 13 10:10:34 2024 +0200

    Move autoscaling code to own class

commit 23553bb8cccd6ed80ac667b12ec38a6d5562dd29
Author: Jan Kuipers <[email protected]>
Date:   Wed May 8 18:01:55 2024 +0200

    Improved autoscaling simulation

commit 2db606b2bba69d741fa231f369c633ea793294d5
Author: Tom Veasey <[email protected]>
Date:   Tue Apr 30 15:01:40 2024 +0100

    Correct the dependency on allocations

commit 0e45cfbaf901cf9d440efa9b404058a67d000653
Author: Tom Veasey <[email protected]>
Date:   Tue Apr 30 11:11:05 2024 +0100

    Tweak

commit a0f23a4a05875cd5df3863e5ad067b46a67c8cda
Author: Tom Veasey <[email protected]>
Date:   Tue Apr 30 11:09:30 2024 +0100

    Correction

commit f9cdb140d298bd99c64c79f020c058d60bfba134
Author: Tom Veasey <[email protected]>
Date:   Tue Apr 30 09:57:59 2024 +0100

    Allow extrapolation

commit 57eb1a661a2b97412f479606c23c54dfb7887f52
Author: Tom Veasey <[email protected]>
Date:   Tue Apr 30 09:55:17 2024 +0100

    Simplify and estimate average duration rather than rate

commit 36dff17194f2bcf816013b112cf07d70c9eec161
Author: Tom Veasey <[email protected]>
Date:   Mon Apr 29 21:42:25 2024 +0100

    Kalman filter for simple state model for average inference duration as a function of time and allocation count

commit a1b85bd0deeabd5162f2ccd5a28672299025cee5
Author: Jan Kuipers <[email protected]>
Date:   Mon Apr 29 12:15:59 2024 +0200

    Improvements

commit 51040655fcfbfd221f2446542a955fb0f19fb145
Author: Jan Kuipers <[email protected]>
Date:   Mon Apr 29 09:33:10 2024 +0200

    Account for virtual cores / hyperthreading

commit 7a93407ecae6b6044108299a1d05f72cdf0d752a
Author: Jan Kuipers <[email protected]>
Date:   Fri Apr 26 16:58:25 2024 +0200

    Simulator for inference autoscaling.
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v8.15.0 labels Jun 13, 2024
@jan-elastic jan-elastic marked this pull request as draft June 13, 2024 11:49
@jan-elastic jan-elastic added :ml Machine learning Team:ML Meta label for the ML team and removed needs:triage Requires assignment of a team area label labels Jun 19, 2024
@mark-vieira mark-vieira removed the request for review from a team July 2, 2024 18:52
Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, left some minor comments

@@ -307,6 +334,12 @@ public ActionRequestValidationException validate() {
if (threadsPerAllocation < 1) {
validationException.addValidationError("[" + THREADS_PER_ALLOCATION + "] must be a positive integer");
}
ActionRequestValidationException autoscaleException = adaptiveAllocationsSettings == null
? null
: adaptiveAllocationsSettings.validate();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any invalid combination of the adaptive allocation settings and existing allocations/threads settings that should be checked for here. For example if min_number_of_allocations > number_of_allocations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity, the behaviour is:

  • the initial number of allocation is number_of_allocations (the existing setting)
  • the scaler will adjust it to something in the range [min, max]

I don't think it's worth the effort to enforce num to be in the range of [min, max] in case enabled=true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't the exact syntax and spelling but a start deployment request looks something like:

{
    number_of_allocations: 1,
    threads_per_allocation: 1,
    autoscaling: {
        enabled: true,
        min : 1
        max: 8
   }
}

If the user was to make a request where number_of_allocations is outside the [min, max] range then they have made an error and should be told that the values are contradictory. This is helpful to the user to prevent mistakes.

{
    number_of_allocations: 12,
    threads_per_allocation: 1,
    autoscaling: {
        enabled: true,
        min : 1
        max: 8
   }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, guess you're right. I've updated the API like this: when adaptive allocations is enabled, setting number_of_allocations is not allowed. It's useless anyway, because the system will undo your changes in 10 secs.

@jan-elastic
Copy link
Contributor Author

@davidkyle Processed your comments. PTAL

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jan-elastic jan-elastic merged commit 4cfe6cc into main Jul 9, 2024
17 checks passed
@jan-elastic jan-elastic deleted the inference-autoscaling branch July 9, 2024 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing >feature :ml Machine learning Team:ML Meta label for the ML team v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants