Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reliability of portal testnet tests #1166

Closed
KonradStaniec opened this issue Jul 19, 2022 · 9 comments
Closed

Improve reliability of portal testnet tests #1166

KonradStaniec opened this issue Jul 19, 2022 · 9 comments

Comments

@KonradStaniec
Copy link
Contributor

Currently https://github.com/status-im/nimbus-eth1/blob/master/fluffy/scripts/test_portal_testnet.nim tests are pretty flaky on CI and sometimes fail, mainly on i386 architectures.

This task is about:

  • making them non flaky
  • researching if something fishy in not going on and fluffy node do not hang/crash in some conditions
@kdeme
Copy link
Contributor

kdeme commented Jul 21, 2022

There is a test which timed out after 6 hours: https://github.com/status-im/nimbus-eth1/runs/7427491704?check_suite_focus=true

This is likely only possible now because of the added withRetries: https://github.com/status-im/nimbus-eth1/pull/1164/files#diff-395536eba9bf85209a5e8f6138c94bdb5074116e05f33a326c2ba7427b8e43faR50

It is possible that when all these requests fail (nodes hang or crashed?) that this has to go through all the retries for (almost?) every request. This withRetries would make it last very long instead of failing fast like before.

@KonradStaniec
Copy link
Contributor Author

KonradStaniec commented Jul 26, 2022

There is a test which timed out after 6 hours: https://github.com/status-im/nimbus-eth1/runs/7427491704?check_suite_focus=true

There was bug in retry code which did not increase retry count when provided check function failed. Fixed this in #1170.

@jangko
Copy link
Contributor

jangko commented Jul 26, 2022

I've try to debug this problem locally, but have not able to reproduce it. The "only" difference is I'm using ubuntu 20.04 while the ci is using ubuntu 18.04.

@KonradStaniec
Copy link
Contributor Author

I am currently using virtual machine with linux 16.04.6 i386, 2 cores and 4gm ram (my host machine is macOS) to try to fix it, and I am able to reproduce it 50% of the time. What is courious, even though I have given vm 2 cores, the rate of failures increases when I also restrict performance of host machine (for example by reducing number of cores from 12 to 2)

@jangko
Copy link
Contributor

jangko commented Jul 26, 2022

I see, maybe I should reduce core numbers and ram for my WSL2 linux(windows host machine). Currently I'm using 6 cores and 16GB ram on my linux instance.

@KonradStaniec
Copy link
Contributor Author

Yup that is possible. I have never seen those test fails on macOs in CI only on Linux and Windows (especially i386 versions), and according to https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources, macOs machines in github actions are a bit beefier than linux and windows (1 more core + additional 7gb of ram).

@jangko
Copy link
Contributor

jangko commented Jul 28, 2022

I guess there is something wrong besides linux i386. 3 times in a day, I got failing test on Windows-64bit, Linux-64Bit, and MacOs

@KonradStaniec
Copy link
Contributor Author

In general probability of those tests failing increases with decrease of cores on the machine. I was able to trigger it on my mac when restricting it to one core.

Failure comes from triggering a lot of different timeouts:

If enough of those timeouts trigger, it may happen that piece of content will never be spread in the network and test fail.

#1176 - should improve situation a bit by increasing timeout for reading content as it was super agressive especially taking into account how our utp interacts with offer-accept loop.

Some follow up task I will do :

  • make those timeout configurable on fluffy level so we can increase them only on CI
  • make timeouts on discovery also configurable

@KonradStaniec
Copy link
Contributor Author

fixed by: #1176 and #1178 which increased timouts. For now tests seems to be stable, if there will be more issues those will be adressed in separate task/issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants