Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] TasksIT testTasksCancellation failing #109686

Closed
piergm opened this issue Jun 13, 2024 · 3 comments · Fixed by #109929
Closed

[CI] TasksIT testTasksCancellation failing #109686

piergm opened this issue Jun 13, 2024 · 3 comments · Fixed by #109929
Assignees
Labels
:Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI

Comments

@piergm
Copy link
Member

piergm commented Jun 13, 2024

Build scan:
https://gradle-enterprise.elastic.co/s/ioneklonkjxya/tests/:server:internalClusterTest/org.elasticsearch.action.admin.cluster.node.tasks.TasksIT/testTasksCancellation

Reproduction line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.action.admin.cluster.node.tasks.TasksIT.testTasksCancellation" -Dtests.seed=1C3BE4591272AE4B -Dtests.locale=es-BO -Dtests.timezone=Africa/Bamako -Druntime.java=17

Applicable branches:
main

Reproduces locally?:
No

Failure history:
Failure dashboard for org.elasticsearch.action.admin.cluster.node.tasks.TasksIT#testTasksCancellation

Failure excerpt:

java.lang.AssertionError: expected:<0> but was:<1>

  at __randomizedtesting.SeedInfo.seed([1C3BE4591272AE4B:5AA6B28DEC7007B0]:0)
  at org.junit.Assert.fail(Assert.java:89)
  at org.junit.Assert.failNotEquals(Assert.java:835)
  at org.junit.Assert.assertEquals(Assert.java:647)
  at org.junit.Assert.assertEquals(Assert.java:633)
  at org.elasticsearch.action.admin.cluster.node.tasks.TasksIT.testTasksCancellation(TasksIT.java:513)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)

@piergm piergm added :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. >test-failure Triaged test failures from CI Team:Distributed Meta label for distributed team labels Jun 13, 2024
@elasticsearchmachine elasticsearchmachine added the needs:risk Requires assignment of a risk label (low, medium, blocker) label Jun 13, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner DaveCTurner self-assigned this Jun 15, 2024
@DaveCTurner
Copy link
Contributor

I've seen this fail in both 8.14 and main (5e191a0) but it's very rare, took ~15k iterations both times.

DaveCTurner added a commit that referenced this issue Jun 19, 2024
@DaveCTurner
Copy link
Contributor

Ah, it's a consequence of #94865, there's now a chance that the task still exists even after the response has been received. This'll reproduce it reliably:

diff --git a/server/src/main/java/org/elasticsearch/transport/TaskTransportChannel.java b/server/src/main/java/org/elasticsearch/transport/TaskTransportChannel.java
index d6a5b3436c9..ecc25c0eb77 100644
--- a/server/src/main/java/org/elasticsearch/transport/TaskTransportChannel.java
+++ b/server/src/main/java/org/elasticsearch/transport/TaskTransportChannel.java
@@ -43,6 +43,11 @@ public class TaskTransportChannel implements TransportChannel {
         try {
             channel.sendResponse(exception);
         } finally {
+            try {
+                Thread.sleep(500);
+            } catch (InterruptedException e) {
+                throw new RuntimeException(e);
+            }
             onTaskFinished.close();
         }
     }

DaveCTurner added a commit that referenced this issue Jun 19, 2024
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 19, 2024
The tasks are removed from the task manager _after_ sending the
response, so we cannot reliably assert they're done. With this commit we
wait for them to complete properly first.

Closes elastic#109686
elasticsearchmachine pushed a commit that referenced this issue Jun 19, 2024
The tasks are removed from the task manager _after_ sending the
response, so we cannot reliably assert they're done. With this commit we
wait for them to complete properly first.

Closes #109686
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 19, 2024
The tasks are removed from the task manager _after_ sending the
response, so we cannot reliably assert they're done. With this commit we
wait for them to complete properly first.

Closes elastic#109686
elasticsearchmachine pushed a commit that referenced this issue Jun 19, 2024
* Fix `TasksIT#testTasksCancellation` (#109929)

The tasks are removed from the task manager _after_ sending the
response, so we cannot reliably assert they're done. With this commit we
wait for them to complete properly first.

Closes #109686

* Introduce safeGet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants