Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About failures due to exceeded resource quota #1913

Closed
CermakM opened this issue Jan 7, 2020 · 30 comments · Fixed by #2385
Closed

About failures due to exceeded resource quota #1913

CermakM opened this issue Jan 7, 2020 · 30 comments · Fixed by #2385

Comments

@CermakM
Copy link
Contributor

CermakM commented Jan 7, 2020

Motivation

Hello there! Some time back I asked a question on the Slack channel and still haven't got any advice on the problem.

I need some help and advice about the way Argo handles resource quotas. We're hitting the problem repeatedly in our namespace, a workflow fails because of quota limits and is not retried later on.

An example Workflow result:

pods "workflow-test-1578387522-82e06118" is forbidden: exceeded quota: diamand-quota, requested: limits.cpu=2, used: limits.cpu=23750m, limited: limits.cpu=24

Is there any advice with respect to the Workflow reconciliation? Any existing solutions? Does / should workflow-controller take care of that?

Summary

All in all, I need to know

a) whether the problem is on our site
b) whether there is an easy solution how to get around failed Workflows due to resource quotas
c) whether somebody else is hitting the issue
d) whether there are any plans from the Argo site regarding this and/or how can I contribute

I am ready and willing to go ahead and see to the implementation myself, just not experienced enough to be able to tell whether this is something that can be implemented and how to go about this. again, any pointers are welcome! :)

Cheers,
Marek

@simster7
Copy link
Member

simster7 commented Jan 7, 2020

As far as I'm aware, resource quotas are a K8s concept that Argo does not know about. They are manged by a cluster admin and restrict how much resources can a specific namespace use. If running workflows is providing this error, might it be that your specific namespace in your specific cluster is running out of quotas? If so, this is not a problem that Argo could solve.

@CermakM
Copy link
Contributor Author

CermakM commented Jan 8, 2020

Hello @simster7 , thank you for the answer!

Correct, it is a K8s concept.

If running workflows is providing this error, might it be that your specific namespace in your specific cluster is running out of quotas?

That is precisely the issue, the namespace is temporarily out of quotas. For example, due to lots of pods being currently running or lack of memory.

this is not a problem that Argo could solve

I disagree. I believe the correct behaviour would be to wait for the resource quota to be available before executing the next step of a Workflow. Argo is, after all, a workload management engine, isn't it? How I am I supposed to use Argo for container orchestration if it doesn't allow me to follow the very basic Kubernetes rule (which is that pods are created when there are resources to do so) and fails instantly?

Consider the following example:

I have a workflow in which I only submit resources. In the Workflow step, I do submit a resource (let's say a Job) and then the Workflow immediately fails because it is not able to create the Pod for it. However, 1 second later, the Job is run by K8s because the quota had been freed for it. The Workflow, however, stays failed nevertheless. And that, in my opinion, is quite a problem and it makes it nearly impossible to use Argo Workflow in an environment with strict quotas. I do want to use it tho because Argo's wicked if it weren't for this damn thing! :D

Cheers,
Marek

@simster7
Copy link
Member

simster7 commented Jan 8, 2020

(I edited your comment above to distinguish the quotes and your responses)

Ah, I see what you're saying now. If that is the case as you've described, I agree that we need some sort of spec/logic to fix this.

I haven't tried it here yet, but would using retryStrategy here help at all? If not, what do you think we could add to it to solve this issue?

@CermakM
Copy link
Contributor Author

CermakM commented Jan 8, 2020

Thanks for the edit! :)

I actually do see your point now as well after giving it a deeper thought and I think you might be right in the sense that Argo is probably not to blame ( but it might be the saviour :) )

I'll get back to the retryStrategy in a sec. Please, correct me if I am wrong in my reasoning. Let's consider the case of k8s resources first (which is the most trickier one I suppose, because it's basically decoupled from the Workflow):

  • Workflow resource is created
  • A pod is created -> which executes kubectl create command -> and here one of the two things happens:
    a) this command fails right ahead because of quota violation
    b) this command is successful, it submits the resource to the cluster BUT the (for example with Job) resource condition is temporarily set to Failed due to exceeded quotas

In both cases, the Workflow is considered failed (unless treated otherwise).

Now, in case of a)
This is, in my opinion, something which should be handled on the Argo side. The reason should be detected and there should be a mechanism (which may or may not be configurable) to wait for the required resources.

the b) case is where we're doomed, because it's completely decoupled and I agree that in that particular case, Argo's done it's job and submitted the resource. However, what's missing here is the ability to act upon these failures. I think that the successCondition and failureCondition is not sufficient and neither is retryStrategy because I have no way to detect why the failure happened and I end up retrying workflow steps whose failure was justifiable due to application misbehaviour (also, it is quite confusing, even from the UI, because it looks like the step was failing for some time, but in fact, it is just waiting for the quota).

I think a potential solution to combat the b) problem to introduce a better decision mechanism. I can elaborate a little bit more if you will, but the rough idea would be to have a onSuccess and onFailure branching like:

steps:
- - name: resource-submission
    onFailure:
      template: retry-if-exceeded-quota

and probably to improve the successCondition, resp. failureCondition to allow a more fine-grained control over these?

I am looking forward to your response and comments,
Mark

@simster7
Copy link
Member

What sort of condition would {success, failure}Condition use to determine if the issue is caused by resource quotas?

I think this falls in the domain of retryStrategy although I agree it might be a bit off, because this is technically a failure while scheduling, and not a failure of the pod. Any ideas for what would be a good change to retryStrategy to alleviate this?

@jamhed
Copy link
Contributor

jamhed commented Jan 24, 2020

I second that, we use argo workflows with gpus, and as of now having a limit on namespace causes workflow to fail. To make it work one needs to put number of retries to unlimited, and this is very bad -- suppose there is an error in the pod itself.
I'd like argo controller to be a bit more mindful wrt resource allocation and take pod requests and namespace limits into consideration.

@xogeny
Copy link

xogeny commented Feb 3, 2020

I'm relatively new to Argo, but I've already seen similar issues. I'm running computational jobs. I know that these will occupy, for example, 1 CPU. If I have a cluster with, say, 10 CPUs but my Workflow will create 100 pods, I'd like to be able to specify the restriction at the Workflow level in the spec that the work described will require 1 CPU.

I see these resource based constraints as important because the current constraints on parallelism (as far as I know) don't really address this issue. I can set a limit on the total number of Workflows, but that is difficult because I don't know what resources each of the workflows will require. I can set a parallelism in the Workflow itself, but then I don't know what the cluster is capable of.

What is really needed to scale things is for the workflows themselves to specify their resource constraints and have the scheduler prioritize these nodes and then schedule what the cluster can handle. Is there some way to accomplish this currently?

@jamhed
Copy link
Contributor

jamhed commented Feb 3, 2020

@xogeny I'm going to implement it, if you'd like you can be one of beta-testers :) Basically the idea is to alter how pods are scheduled, not to fail when there are no resources, but rather indicate a "Pending" state.

@xogeny
Copy link

xogeny commented Feb 3, 2020

@jamhed I'd be very happy to try it out. This will be relatively important for us. BTW, I see you are in Prague. Turns out I'm in Prague this week. Let me know if you want to grab lunch this week and talk Argo, my treat. I'm near Karlin, BTW.

@jamhed
Copy link
Contributor

jamhed commented Feb 3, 2020

@xogeny sure, would love to, check your mail. actually, i have more things i'd like to discuss with this regard :)

@CermakM
Copy link
Contributor Author

CermakM commented Feb 4, 2020

@jamhed consider myself in for the testing :)

@jamhed
Copy link
Contributor

jamhed commented Feb 4, 2020

@CermakM sure :)

@jamhed
Copy link
Contributor

jamhed commented Feb 13, 2020

@CermakM jamhed/workflow-controller:v2.4.3-1
If it can't schedule the pod, it just keeps it in Pending state.
Details: v2.4.3...jamhed:v2.4.3-1

@CermakM
Copy link
Contributor Author

CermakM commented Feb 26, 2020

@jamhed that sounds interesting! Will give it a shot :)

@jamhed
Copy link
Contributor

jamhed commented Feb 26, 2020

@CermakM, this is pretty much the changes you need to do in argo helm chart:

images:
  namespace: jamhed
  tag: v2.4.3-1

@CermakM
Copy link
Contributor Author

CermakM commented Mar 5, 2020

@jamhed Just tested it out. Works like a charm! What a relief to see that ... are there any plans for merging this to the upstream?!

@simster7 🙏

@jamhed
Copy link
Contributor

jamhed commented Mar 5, 2020

@CermakM let me backport it to 2.6.1, and i'll open a pull request.

@jamhed
Copy link
Contributor

jamhed commented Mar 7, 2020

@CermakM @simster7 #2385

@alexec alexec linked a pull request Mar 8, 2020 that will close this issue
4 tasks
@alexec
Copy link
Contributor

alexec commented Aug 18, 2020

We are considering making this default behaviour in v2.11. Thoughts?

@fridex
Copy link

fridex commented Aug 18, 2020

We are considering making this default behaviour in v2.11. Thoughts?

I would say it's much better behavior in comparison to the current one. So +1 on my side for this usability improvement.

@alexec
Copy link
Contributor

alexec commented Aug 18, 2020

Thank you. I've created a new image for testing if you would like to try it: argoproj/workflow-controller:fix-3791 .

@alexec
Copy link
Contributor

alexec commented Aug 22, 2020

I've created another test image: argoproj/workflow-controller:fix-3791.

Can you please try it out to confirm it fixes your problem?

@YourTechBud
Copy link
Contributor

Hey whats the current status on this? Has this feature been released? I can see the flag in the API reference but the default behaviour isn't documented.

@fridex
Copy link

fridex commented Aug 26, 2020

I've created another test image: argoproj/workflow-controller:fix-3791.

Can you please try it out to confirm it fixes your problem?

We've installed your build argoproj/workflow-controller:fix-3791 (sha256:2cc4166ce). I can confirm the workflow acts much more stable with respect to resources in comparison to v2.9.5 (I haven't tested newer releases than that).

Is there anything to observe in logs (I didn't see any relevant messages)?

@alexec
Copy link
Contributor

alexec commented Aug 26, 2020

Thank you. I wanted to verify it worked better.

@fridex
Copy link

fridex commented Aug 27, 2020

Thank you. I wanted to verify it worked better.

Thank you for this fix.

In what release of argo we can expect this feature to be present?

@alexec
Copy link
Contributor

alexec commented Aug 27, 2020

v2.11

@fridex
Copy link

fridex commented Sep 2, 2020

Thank you. I've created a new image for testing if you would like to try it: argoproj/workflow-controller:fix-3791 .

@alexec After some time we spotted an issue. Workflows fail (interestingly, they do not get deleted based on the ttl strategy configuration) and stay in the cluster. I can see pod deleted as a message. This happens for a pod that requires a relatively large amount of resources and that are not available as other workflows use them.

Checking cluster events, there was nothing suspicious. Workflow controler produces the following log:

time="2020-09-01T08:43:41Z" level=info msg="Processing workflow" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="Updated phase  -> Running" namespace=thoth-backend-stage workflow=adviser-c9267fd5
E0901 08:43:41.029779       1 event.go:263] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"adviser-c9267fd5.16309c64000c72ab", GenerateName:"", Namespace:"thoth-backend-stage", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Workflow", Namespace:"thoth-backend-stage", Name:"adviser-c9267fd5", UID:"b44d8e83-14f6-482e-b001-42b0c8e1a09b", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"195448328", FieldPath:""}, Reason:"WorkflowRunning", Message:"Workflow Running", Source:v1.EventSource{Component:"workflow-controller", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcba04f41ab50ab, ext:508109826557890, loc:(*time.Location)(0x2994080)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcba04f41ab50ab, ext:508109826557890, loc:(*time.Location)(0x2994080)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:thoth-backend-stage:argo-server" cannot create resource "events" in API group "" in the namespace "thoth-backend-stage"' (will not retry!)
time="2020-09-01T08:43:41Z" level=info msg="DAG node adviser-c9267fd5 initialized Running" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="All of node adviser-c9267fd5.advise dependencies [] completed" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="Pod node adviser-c9267fd5-3669748408 initialized Pending" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="Mark node adviser-c9267fd5.advise as Pending, due to: pods \"adviser-c9267fd5-3669748408\" is forbidden: exceeded quota: thoth-backend-stage-quota, requested: limits.memory=6400Mi, used: limits.memory=14824Mi, limited: limits.memory=20Gi" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="node adviser-c9267fd5-3669748408 message: Pending 27.047885ms" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="Released all acquired locks" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:41Z" level=info msg="Workflow update successful" namespace=thoth-backend-stage phase=Running resourceVersion=195448332 workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Processing workflow" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=warning msg="pod adviser-c9267fd5-3669748408 deleted" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Skipped node adviser-c9267fd5-2425184343 initialized Omitted (message: omitted: depends condition not met)" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Skipped node adviser-c9267fd5-1685984280 initialized Omitted (message: omitted: depends condition not met)" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Skipped node adviser-c9267fd5-1448319234 initialized Omitted (message: omitted: depends condition not met)" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Skipped node adviser-c9267fd5-3104014563 initialized Omitted (message: omitted: depends condition not met)" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Outbound nodes of adviser-c9267fd5 set to [adviser-c9267fd5-1685984280 adviser-c9267fd5-1448319234 adviser-c9267fd5-3104014563]" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="node adviser-c9267fd5 phase Running -> Error" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="node adviser-c9267fd5 finished: 2020-09-01 08:43:51.160005601 +0000 UTC" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Checking daemoned children of adviser-c9267fd5" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Updated phase Running -> Error" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Marking workflow completed" namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Checking daemoned children of " namespace=thoth-backend-stage workflow=adviser-c9267fd5
time="2020-09-01T08:43:51Z" level=info msg="Released all acquired locks" namespace=thoth-backend-stage workflow=adviser-c9267fd5
E0901 08:43:51.161571       1 event.go:263] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"adviser-c9267fd5.16309c665bf6cacc", GenerateName:"", Namespace:"thoth-backend-stage", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"workflows.argoproj.io/node-name":"adviser-c9267fd5", "workflows.argoproj.io/node-type":"DAG"}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Workflow", Namespace:"thoth-backend-stage", Name:"adviser-c9267fd5", UID:"b44d8e83-14f6-482e-b001-42b0c8e1a09b", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"195448332", FieldPath:""}, Reason:"WorkflowNodeError", Message:"Error node adviser-c9267fd5", Source:v1.EventSource{Component:"workflow-controller", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcba051c989c4cc, ext:508119958577099, loc:(*time.Location)(0x2994080)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcba051c989c4cc, ext:508119958577099, loc:(*time.Location)(0x2994080)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:thoth-backend-stage:argo-server" cannot create resource "events" in API group "" in the namespace "thoth-backend-stage"' (will not retry!)
E0901 08:43:51.166772       1 event.go:263] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"adviser-c9267fd5.16309c665bf7e5f2", GenerateName:"", Namespace:"thoth-backend-stage", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Workflow", Namespace:"thoth-backend-stage", Name:"adviser-c9267fd5", UID:"b44d8e83-14f6-482e-b001-42b0c8e1a09b", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"195448332", FieldPath:""}, Reason:"WorkflowFailed", Message:"", Source:v1.EventSource{Component:"workflow-controller", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcba051c98adff2, ext:508119958649585, loc:(*time.Location)(0x2994080)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcba051c98adff2, ext:508119958649585, loc:(*time.Location)(0x2994080)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:thoth-backend-stage:argo-server" cannot create resource "events" in API group "" in the namespace "thoth-backend-stage"' (will not retry!)
time="2020-09-01T08:43:51Z" level=info msg="Workflow update successful" namespace=thoth-backend-stage phase=Error resourceVersion=195448520 workflow=adviser-c9267fd5

Not sure if this line can have any impact on the issue (I suspect not?!):

'events is forbidden: User "system:serviceaccount:thoth-backend-stage:argo-server" cannot create resource "events" in API group "" in the namespace "thoth-backend-stage"' (will not retry!)

@alexec
Copy link
Contributor

alexec commented Sep 2, 2020

Pods may be deleted manually, due to scaledown events in your cluster, or other reasons. When this happens, unless you have 'resubmitPendingPods: true`, the nodes fails. See #3918

@fridex
Copy link

fridex commented Sep 3, 2020

Pods may be deleted manually, due to scaledown events in your cluster, or other reasons. When this happens, unless you have 'resubmitPendingPods: true`, the nodes fails. See #3918

Thanks! It looks like setting resubmitPendingPods: true on the template level did the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants