-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About failures due to exceeded resource quota #1913
Comments
As far as I'm aware, resource quotas are a K8s concept that Argo does not know about. They are manged by a cluster admin and restrict how much resources can a specific namespace use. If running workflows is providing this error, might it be that your specific namespace in your specific cluster is running out of quotas? If so, this is not a problem that Argo could solve. |
Hello @simster7 , thank you for the answer! Correct, it is a K8s concept.
That is precisely the issue, the namespace is temporarily out of quotas. For example, due to lots of pods being currently running or lack of memory.
I disagree. I believe the correct behaviour would be to wait for the resource quota to be available before executing the next step of a Workflow. Argo is, after all, a workload management engine, isn't it? How I am I supposed to use Argo for container orchestration if it doesn't allow me to follow the very basic Kubernetes rule (which is that pods are created when there are resources to do so) and fails instantly? Consider the following example: I have a workflow in which I only submit resources. In the Workflow step, I do submit a resource (let's say a Job) and then the Workflow immediately fails because it is not able to create the Pod for it. However, 1 second later, the Job is run by K8s because the quota had been freed for it. The Workflow, however, stays failed nevertheless. And that, in my opinion, is quite a problem and it makes it nearly impossible to use Argo Workflow in an environment with strict quotas. I do want to use it tho because Argo's wicked if it weren't for this damn thing! :D Cheers, |
(I edited your comment above to distinguish the quotes and your responses) Ah, I see what you're saying now. If that is the case as you've described, I agree that we need some sort of spec/logic to fix this. I haven't tried it here yet, but would using |
Thanks for the edit! :) I actually do see your point now as well after giving it a deeper thought and I think you might be right in the sense that Argo is probably not to blame ( but it might be the saviour :) ) I'll get back to the
In both cases, the Workflow is considered failed (unless treated otherwise). Now, in case of a) the b) case is where we're doomed, because it's completely decoupled and I agree that in that particular case, Argo's done it's job and submitted the resource. However, what's missing here is the ability to act upon these failures. I think that the I think a potential solution to combat the b) problem to introduce a better decision mechanism. I can elaborate a little bit more if you will, but the rough idea would be to have a steps:
- - name: resource-submission
onFailure:
template: retry-if-exceeded-quota and probably to improve the I am looking forward to your response and comments, |
What sort of condition would I think this falls in the domain of |
I second that, we use argo workflows with gpus, and as of now having a limit on namespace causes workflow to fail. To make it work one needs to put number of retries to unlimited, and this is very bad -- suppose there is an error in the pod itself. |
I'm relatively new to Argo, but I've already seen similar issues. I'm running computational jobs. I know that these will occupy, for example, 1 CPU. If I have a cluster with, say, 10 CPUs but my Workflow will create 100 pods, I'd like to be able to specify the restriction at the Workflow level in the I see these resource based constraints as important because the current constraints on parallelism (as far as I know) don't really address this issue. I can set a limit on the total number of Workflows, but that is difficult because I don't know what resources each of the workflows will require. I can set a parallelism in the Workflow itself, but then I don't know what the cluster is capable of. What is really needed to scale things is for the workflows themselves to specify their resource constraints and have the scheduler prioritize these nodes and then schedule what the cluster can handle. Is there some way to accomplish this currently? |
@xogeny I'm going to implement it, if you'd like you can be one of beta-testers :) Basically the idea is to alter how pods are scheduled, not to fail when there are no resources, but rather indicate a "Pending" state. |
@jamhed I'd be very happy to try it out. This will be relatively important for us. BTW, I see you are in Prague. Turns out I'm in Prague this week. Let me know if you want to grab lunch this week and talk Argo, my treat. I'm near Karlin, BTW. |
@xogeny sure, would love to, check your mail. actually, i have more things i'd like to discuss with this regard :) |
@jamhed consider myself in for the testing :) |
@CermakM sure :) |
@CermakM jamhed/workflow-controller:v2.4.3-1 |
@jamhed that sounds interesting! Will give it a shot :) |
@CermakM, this is pretty much the changes you need to do in argo helm chart:
|
@CermakM let me backport it to 2.6.1, and i'll open a pull request. |
We are considering making this default behaviour in v2.11. Thoughts? |
I would say it's much better behavior in comparison to the current one. So +1 on my side for this usability improvement. |
Thank you. I've created a new image for testing if you would like to try it: |
I've created another test image: argoproj/workflow-controller:fix-3791. Can you please try it out to confirm it fixes your problem? |
Hey whats the current status on this? Has this feature been released? I can see the flag in the API reference but the default behaviour isn't documented. |
We've installed your build argoproj/workflow-controller:fix-3791 (sha256:2cc4166ce). I can confirm the workflow acts much more stable with respect to resources in comparison to Is there anything to observe in logs (I didn't see any relevant messages)? |
Thank you. I wanted to verify it worked better. |
Thank you for this fix. In what release of argo we can expect this feature to be present? |
v2.11 |
@alexec After some time we spotted an issue. Workflows fail (interestingly, they do not get deleted based on the ttl strategy configuration) and stay in the cluster. I can see Checking cluster events, there was nothing suspicious. Workflow controler produces the following log:
Not sure if this line can have any impact on the issue (I suspect not?!):
|
Pods may be deleted manually, due to scaledown events in your cluster, or other reasons. When this happens, unless you have 'resubmitPendingPods: true`, the nodes fails. See #3918 |
Thanks! It looks like setting |
Motivation
Hello there! Some time back I asked a question on the Slack channel and still haven't got any advice on the problem.
I need some help and advice about the way Argo handles resource quotas. We're hitting the problem repeatedly in our namespace, a workflow fails because of quota limits and is not retried later on.
An example Workflow result:
Is there any advice with respect to the Workflow reconciliation? Any existing solutions? Does / should workflow-controller take care of that?
Summary
All in all, I need to know
a) whether the problem is on our site
b) whether there is an easy solution how to get around failed Workflows due to resource quotas
c) whether somebody else is hitting the issue
d) whether there are any plans from the Argo site regarding this and/or how can I contribute
I am ready and willing to go ahead and see to the implementation myself, just not experienced enough to be able to tell whether this is something that can be implemented and how to go about this. again, any pointers are welcome! :)
Cheers,
Marek
The text was updated successfully, but these errors were encountered: