-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve reliability of portal testnet tests #1166
Comments
There is a test which timed out after 6 hours: https://github.com/status-im/nimbus-eth1/runs/7427491704?check_suite_focus=true This is likely only possible now because of the added It is possible that when all these requests fail (nodes hang or crashed?) that this has to go through all the retries for (almost?) every request. This |
There was bug in retry code which did not increase retry count when provided |
I've try to debug this problem locally, but have not able to reproduce it. The "only" difference is I'm using ubuntu 20.04 while the ci is using ubuntu 18.04. |
I am currently using virtual machine with linux 16.04.6 i386, 2 cores and 4gm ram (my host machine is macOS) to try to fix it, and I am able to reproduce it 50% of the time. What is courious, even though I have given vm 2 cores, the rate of failures increases when I also restrict performance of host machine (for example by reducing number of cores from 12 to 2) |
I see, maybe I should reduce core numbers and ram for my WSL2 linux(windows host machine). Currently I'm using 6 cores and 16GB ram on my linux instance. |
Yup that is possible. I have never seen those test fails on macOs in CI only on Linux and Windows (especially i386 versions), and according to https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources, macOs machines in github actions are a bit beefier than linux and windows (1 more core + additional 7gb of ram). |
I guess there is something wrong besides linux i386. 3 times in a day, I got failing test on Windows-64bit, Linux-64Bit, and MacOs |
In general probability of those tests failing increases with decrease of cores on the machine. I was able to trigger it on my mac when restricting it to one core. Failure comes from triggering a lot of different timeouts:
If enough of those timeouts trigger, it may happen that piece of content will never be spread in the network and test fail. #1176 - should improve situation a bit by increasing timeout for reading content as it was super agressive especially taking into account how our utp interacts with offer-accept loop. Some follow up task I will do :
|
Currently https://github.com/status-im/nimbus-eth1/blob/master/fluffy/scripts/test_portal_testnet.nim tests are pretty flaky on CI and sometimes fail, mainly on i386 architectures.
This task is about:
The text was updated successfully, but these errors were encountered: