Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod nework bandwidth usage #3602

Open
SilinPavel opened this issue Jul 10, 2024 · 1 comment
Open

Pod nework bandwidth usage #3602

SilinPavel opened this issue Jul 10, 2024 · 1 comment
Labels
kind/enhancement New feature or request

Comments

@SilinPavel
Copy link
Member

SilinPavel commented Jul 10, 2024

Background
Let's introduce mechanism of the monitoring, notification and (if possible) regulation of usage of network by pods.

This mechanism should be similar to the one we have for IDLE RUN monitoring.
Network usage data can be obtained from heapster elk

Realization of the monitoring could be done in the following way:

  • Introduce set of System Preferences to be able to configure behaviour (system.pod.bandwidth.limit, system.pod.bandwidth.action, system.pod.bandwidth.action.backoff.period)
  • Introduce new email notification HIGH_CONSUMED_NETWORK_BANDWIDTH
  • Based on the new preferences implement similar logic to idle run monitoing
    • if system.pod.bandwidth.limit== 0, skip check
    • Notify users (based on email notification setting) with an email if pod consume network > system.pod.bandwidth.limit for configured period of time + put a label on the run
    • If after system.pod.bandwidth.action.backoff.period consumption still in place - perform an action
@SilinPavel
Copy link
Member Author

SilinPavel commented Jul 18, 2024

Additional consideration about an action

Lets implement the next approach:

  • New API method POST /run/{id}/network/limit?boundary=<int>

    • This method will set a special tag for a run based on boundary param: NETWORK_LIMIT: <boundary>
    • Only Admins should be able to call it and set this label
  • Scheduled daemon on the API that will perform actual limiting

    • Daemon should be active only of API Leader
    • Daemon will run each <system.pod.bandwidth.limit.daemon.timeout> (to reconfigure daemon start, please, reuse Observable mechanism for SystemPreferences, f.e. see AbstractSchedulingManager )
    • If run marked as NETWORK_LIMIT: <boundary>, daemon should execute DockerContainerOperationManager which then will execute ssh command on a target node to actually limit bandwidth
    • After successful run of limiting script daemon will set additional tag NETWORK_LIMIT_<SystemPreferences.SYSTEM_RUN_TAG_DATE_SUFFIX>: <timestamp> to give a hint that limitation was actually performed and when
    • If run doesn't have NETWORK_LIMIT: <boundary> anymore, daemon should disable limitation also
  • Script to limit bandwidth on a node

SilinPavel added a commit that referenced this issue Aug 22, 2024
…tionOnNetworkConsumingRun to reuse existing method (#3648)
SilinPavel added a commit that referenced this issue Aug 23, 2024
…idthMonitoringServiceCore for special tag NETWORK_LIMIT + SystemPreferences.SYSTEM_RUN_TAG_DATE_SUFFIX
SilinPavel added a commit that referenced this issue Aug 28, 2024
…tionOnNetworkConsumingRun to reuse existing method (#3648)

(cherry picked from commit 26ef381)
SilinPavel added a commit that referenced this issue Aug 28, 2024
SilinPavel added a commit that referenced this issue Aug 28, 2024
…idthMonitoringServiceCore for special tag NETWORK_LIMIT + SystemPreferences.SYSTEM_RUN_TAG_DATE_SUFFIX

(cherry picked from commit 66b5ba7)
SilinPavel added a commit that referenced this issue Aug 28, 2024
NShaforostov added a commit that referenced this issue Sep 2, 2024
* (Issue #3619) 'Runs archiving' doc
* (Issue #3573) 'Container limits' doc
* (Issue #3568) 'Compose a Dockerfile' doc
* (Issue #3576) 'GPU statistics monitor' doc
* (Issue #3602) 'Pod network consumption alert and restriction' doc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant