Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating the workflow archive entry for a retried workflow fails with "current transaction is aborted" #2427

Closed
4 tasks
danxmoran opened this issue Mar 13, 2020 · 5 comments · Fixed by #2434
Closed
4 tasks

Comments

@danxmoran
Copy link
Contributor

Checklist:

  • I've included the version.
  • I've included reproduction steps.
  • I've included the workflow YAML.
  • I've included the logs.

What happened:
One of my workflows failed with a transient error. I retried it through the UI and all of its steps succeeded, but the root node (a DAG template) was still marked as an Error. I looked through the logs and found this error in the UPPERIO debug logs:

pq: duplicate key value violates unique constraint "argo_archived_workflows_pkey"

I'm not sure if this is the cause of the Error state, but either way it looks like a problem.

What you expected to happen:
I expected the workflow controller to perform and update-or-insert when adding info to argo_archived_workflows, and not fail with a PK error on workflow retries.

How to reproduce it (as minimally and precisely as possible):
I'm not sure if this depends on the archived workflow being in a failed state, but I'm able to reproduce this every time I retry a failed workflow that's already been archived.

Anything else we need to know?:
The log with the pq error shows that the controller is trying to run a SQL UPDATE, so I don't think the current implementation is too far off from what I'd expect. The SQL I see is:

UPDATE "argo_archived_workflows" SET "workflow" = $1, "phase" = $2, "startedat" = $3, "finishedat" = $4 WHERE ("clustername" = $5 AND "uid" = $6)

The arguments I see include the (large) workflow JSON. Wit that payload abbreviated, they are:

<workflow-json>, "Error", time.Time{wall:0x0, ext:63719642700, loc:(*time.Location)(0x257c120)}, time.Time{wall:0x18025983, ext:63719666144, loc:(*time.Location)(nil)}, "command-center", "cfc3c39f-8fc8-4a8c-b358-e80438392288"

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@alexec alexec self-assigned this Mar 13, 2020
@alexec alexec added this to the v2.7 milestone Mar 13, 2020
@alexec
Copy link
Contributor

alexec commented Mar 13, 2020

I'll investigate..

@alexec
Copy link
Contributor

alexec commented Mar 13, 2020

@danxmoran you are correct, the archive should insert a new record, if it gets a primary key, we assume that somehow we're trying to archive the workflow twice, so we need to do an update.

You haven't included the error message from the workflow. Could I please ask for it as it would really help narrow this down?

@danxmoran
Copy link
Contributor Author

I think the error in my workflow was a red herring. The WF spec was invalid, and the message was output parameters must have a valueFrom specified (we haven't figured out our CI story for linting the workflow we deploy with Helm, yet).

Looking at the logs again, I see what you mean about first trying to INSERT, then performing an UPDATE. I think the UPDATE step is failing, though, because it's trying to use the same transaction as the failed INSERT.

I see this (abbreviated) in the logs:

Session ID: 00099
Transaction ID: 00097
Query:          INSERT INTO "argo_archived_workflows" ("clustername", "finishedat", "name", "namespace", "phase", "startedat", "uid", "workflow") VALUES ($1, $2, $3, $4, $5, $6, $7, $8) RETURNING "clustername", "uid"
Error:          pq: duplicate key value violates unique constraint "argo_archived_workflows_pkey"

Session ID: 00099
Transaction ID: 00097
Query: UPDATE "argo_archived_workflows" SET "workflow" = $1, "phase" = $2, "startedat" = $3, "finishedat" = $4 WHERE ("clustername" = $5 AND "uid" = $6)
Error: pq: current transaction is aborted, commands ignored until end of transaction block

time="2020-03-13T03:14:19Z" level=error msg="Failed to archive workflow" err="pq: current transaction is aborted, commands ignored until end of transaction block"

@danxmoran
Copy link
Contributor Author

I'm going to update the issue title since the original behavior I saw is expected to happen in this case 😄

@danxmoran danxmoran changed the title DB duplicate key error on workflow retry Updating the workflow archive entry for a retried workflow fails with "current transaction is aborted" Mar 13, 2020
@alexec
Copy link
Contributor

alexec commented Mar 13, 2020

So

  1. Yes, there is a bug.
  2. It is an edge-case.
  3. It should be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants