Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-21346][build system] Adding timeout to all tasks #15546

Closed
wants to merge 6 commits into from

Conversation

dawidwys
Copy link
Contributor

@dawidwys dawidwys commented Apr 9, 2021

What is the purpose of the change

I took over #14834

When a test gets stuck, it's usually killed after 10 min of inactivity. #13260 excluded all tests that still show log activity from this timeout to allow test runs that take longer than 10 min.

However, some stuck tests are not deadlocked but rather caught in some kind of livelock. These tests now timeout with AZP timeout (4h) without any logs. This commit also adds a timeout to all build tasks in the same way as e2e tests are timeboxed. On reaching the timeout, the task logs are uploaded together with a stack trace.

Verifying this change

Manually verified.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@dawidwys dawidwys requested review from rmetzger and AHeise April 9, 2021 08:45
@dawidwys
Copy link
Contributor Author

dawidwys commented Apr 9, 2021

Hey, @AHeise @rmetzger Do you mind having a look when you have time? Thanks!

@flinkbot
Copy link
Collaborator

flinkbot commented Apr 9, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 6213fbc (Fri May 28 09:08:20 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Apr 9, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@dawidwys
Copy link
Contributor Author

dawidwys commented Apr 9, 2021

The Azure failure is expected to show the layout of uploaded files. Once the change is reviewed I will remove the temp commits and rerun azure.

@dawidwys
Copy link
Contributor Author

Regarding manual compression:

I think it is safe to remove the manual compression as it is done automatically by azure:
without manual compression:

Content upload statistics:
Total Content: 116.6 MB
Physical Content Uploaded: 7.0 MB
Logical Content Uploaded: 58.3 MB
Compression Saved: 51.3 MB
Deduplication Saved: 58.3 MB
Number of Chunks Uploaded: 593

with manual compression:

Content upload statistics:
Total Content: 14.0 MB
Physical Content Uploaded: 6.6 MB
Logical Content Uploaded: 7.0 MB
Compression Saved: 0.4 MB
Deduplication Saved: 7.0 MB
Number of Chunks Uploaded: 89

Stats from: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16278&view=logs&j=d44f43ce-542c-597d-bf94-b0718c71e5e8&t=800a4f47-e103-5bd8-7e1e-053f424b0a53

Copy link
Contributor

@AHeise AHeise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for picking that PR up!

Downloaded misc and there is a superfluous file in there.
Downloaded e2e and all looks good.
I like that we got rid of the double compression. Makes it easier to check.

@@ -81,7 +81,7 @@ jobs:
condition: and(succeeded(), not(eq(variables['MODE'], 'e2e')))
pool: ${{parameters.test_pool_definition}}
container: ${{parameters.container}}
timeoutInMinutes: 240
timeoutInMinutes: 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the final PR this should be set to 2h or so (lower than the 4h we used before).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we lower it? We did not make the tests run faster...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test didn't need 4h and they run quicker into a potential timeout this way. Just check the build times on recent builds (btw 4h is the default).

tools/azure-pipelines/uploading_watchdog.sh Outdated Show resolved Hide resolved
@dawidwys
Copy link
Contributor Author

Which superfluous file do you have in mind? Is it mvn-${sys:mvn.forkNumber}.log? It's an unrelated issue of some tests in that profile. Current setup also produces this file. See e.g. https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16410&view=results

displayName: Upload Logs
inputs:
targetPath: $(ARTIFACT_DIR)
artifact: logs-${{parameters.stage_name}}-e2e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the stage name to the artifact name to avoid duplicate file issues in the nightly CI profiles. It seems that the stage name is done again based on timestamps, which has the risk of artifacts not being uploaded, because they exist already (IIRC it has happened that files had exactly the same timestamp (since builds are triggered at the same time)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reintroduced the stage name.

@rmetzger
Copy link
Contributor

I think it is safe to remove the manual compression as it is done automatically by azure:

I wonder if the compression is only done for the transfer, or if the files are stored as archives. As an authenticated user, it looks like files are stored without compression:
image

I'm not saying that we need to change it. The only concern I have is that we might run into a resource limitation, if Azure enables them at some point.

@rmetzger
Copy link
Contributor

Besides these two comments, I didn't spot anything in the changes.

@dawidwys
Copy link
Contributor Author

As for the compression, let's try to remove it for now as it makes it a bit easier to work with the files. It lets you also download a single file from a build. If we hit a problem with a resource limitation we can easily reintroduce the compression.

Arvid Heise added 2 commits April 14, 2021 09:41
…stacktraces.

The test/e2e task is killed before timeout and artifacts are published in the regular way. If killed jps traces and watchdog output are additionally attached to mvn.logs.
@dawidwys dawidwys force-pushed the FLINK-21234 branch 3 times, most recently from a9291f1 to 5141fb3 Compare April 15, 2021 11:54
Copy link
Contributor

@AHeise AHeise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % squash fixups. Thank you for your contributions.

@dawidwys dawidwys closed this in 6b430e6 Apr 16, 2021
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change breaks the local e2e test setup on my machine (MacOS). The problem is that $FLINK_DIR/logs does not exist.

@@ -32,30 +32,24 @@ if [ -z "$FLINK_DIR" ] ; then
exit 1
fi

if [ -z "$FLINK_LOG_DIR" ] ; then
export FLINK_LOG_DIR="$FLINK_DIR/logs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change breaks the e2e setup on my machine (MacOS) because the directory does not exist.

@@ -46,6 +46,10 @@ if [ -z "$FLINK_DIR" ] ; then
exit 1
fi

if [ -z "$FLINK_LOG_DIR" ] ; then
export FLINK_LOG_DIR="$FLINK_DIR/logs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -26,6 +26,10 @@ if [[ -z $FLINK_DIR ]]; then
exit 1
fi

if [ -z "$FLINK_LOG_DIR" ] ; then
export FLINK_LOG_DIR="$FLINK_DIR/logs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

@AHeise
Copy link
Contributor

AHeise commented Aug 25, 2021

I think this change breaks the local e2e test setup on my machine (MacOS). The problem is that $FLINK_DIR/logs does not exist.

Can you please fix it? Dawid is MIA afaik.
The main question is where are your logs? They should be there somewhere after your run even if there is a failure in job submission.
I see a log in my flink-dist but this feels wrong somehow. Maybe that is the actual bug.

Or maybe we should have a mkdirs $FLINK_LOG_DIR before starting stuff.

@tillrohrmann
Copy link
Contributor

Maybe, I will just fix the problem by setting FLINK_LOG_DIR=FLINK_DIR/log. Changing this to something can be done in a follow-up ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants