Improve reliability of portal testnet tests #1166

KonradStaniec · 2022-07-19T08:09:25Z

Currently https://github.com/status-im/nimbus-eth1/blob/master/fluffy/scripts/test_portal_testnet.nim tests are pretty flaky on CI and sometimes fail, mainly on i386 architectures.

This task is about:

making them non flaky
researching if something fishy in not going on and fluffy node do not hang/crash in some conditions

kdeme · 2022-07-21T21:25:57Z

There is a test which timed out after 6 hours: https://github.com/status-im/nimbus-eth1/runs/7427491704?check_suite_focus=true

This is likely only possible now because of the added withRetries: https://github.com/status-im/nimbus-eth1/pull/1164/files#diff-395536eba9bf85209a5e8f6138c94bdb5074116e05f33a326c2ba7427b8e43faR50

It is possible that when all these requests fail (nodes hang or crashed?) that this has to go through all the retries for (almost?) every request. This withRetries would make it last very long instead of failing fast like before.

KonradStaniec · 2022-07-26T08:29:33Z

There is a test which timed out after 6 hours: https://github.com/status-im/nimbus-eth1/runs/7427491704?check_suite_focus=true

There was bug in retry code which did not increase retry count when provided check function failed. Fixed this in #1170.

jangko · 2022-07-26T10:38:40Z

I've try to debug this problem locally, but have not able to reproduce it. The "only" difference is I'm using ubuntu 20.04 while the ci is using ubuntu 18.04.

KonradStaniec · 2022-07-26T10:45:37Z

I am currently using virtual machine with linux 16.04.6 i386, 2 cores and 4gm ram (my host machine is macOS) to try to fix it, and I am able to reproduce it 50% of the time. What is courious, even though I have given vm 2 cores, the rate of failures increases when I also restrict performance of host machine (for example by reducing number of cores from 12 to 2)

jangko · 2022-07-26T11:25:03Z

I see, maybe I should reduce core numbers and ram for my WSL2 linux(windows host machine). Currently I'm using 6 cores and 16GB ram on my linux instance.

KonradStaniec · 2022-07-26T11:31:25Z

Yup that is possible. I have never seen those test fails on macOs in CI only on Linux and Windows (especially i386 versions), and according to https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources, macOs machines in github actions are a bit beefier than linux and windows (1 more core + additional 7gb of ram).

jangko · 2022-07-28T06:25:14Z

I guess there is something wrong besides linux i386. 3 times in a day, I got failing test on Windows-64bit, Linux-64Bit, and MacOs

KonradStaniec · 2022-07-29T12:37:29Z

In general probability of those tests failing increases with decrease of cores on the machine. I was able to trigger it on my mac when restricting it to one core.

Failure comes from triggering a lot of different timeouts:

timeout of reading one item of content (https://github.com/status-im/nimbus-eth1/blob/master/fluffy/network/wire/portal_stream.nim#L27)
timeout for talk-req in discovery (https://github.com/status-im/nim-eth/blob/master/eth/p2p/discoveryv5/protocol.nim#L124)
timeout for nodes handshake in discovery (https://github.com/status-im/nim-eth/blob/master/eth/p2p/discoveryv5/protocol.nim#L122)

If enough of those timeouts trigger, it may happen that piece of content will never be spread in the network and test fail.

#1176 - should improve situation a bit by increasing timeout for reading content as it was super agressive especially taking into account how our utp interacts with offer-accept loop.

Some follow up task I will do :

make those timeout configurable on fluffy level so we can increase them only on CI
make timeouts on discovery also configurable

KonradStaniec · 2022-08-04T10:19:40Z

fixed by: #1176 and #1178 which increased timouts. For now tests seems to be stable, if there will be more issues those will be adressed in separate task/issue.

KonradStaniec added the Fluffy label Jul 19, 2022

kdeme added this to the Portal Alpha History Network Launch milestone Jul 19, 2022

KonradStaniec closed this as completed Aug 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reliability of portal testnet tests #1166

Improve reliability of portal testnet tests #1166

KonradStaniec commented Jul 19, 2022

kdeme commented Jul 21, 2022

KonradStaniec commented Jul 26, 2022 •

edited

Loading

jangko commented Jul 26, 2022

KonradStaniec commented Jul 26, 2022

jangko commented Jul 26, 2022

KonradStaniec commented Jul 26, 2022

jangko commented Jul 28, 2022 •

edited

Loading

KonradStaniec commented Jul 29, 2022

KonradStaniec commented Aug 4, 2022

Improve reliability of portal testnet tests #1166

Improve reliability of portal testnet tests #1166

Comments

KonradStaniec commented Jul 19, 2022

kdeme commented Jul 21, 2022

KonradStaniec commented Jul 26, 2022 • edited Loading

jangko commented Jul 26, 2022

KonradStaniec commented Jul 26, 2022

jangko commented Jul 26, 2022

KonradStaniec commented Jul 26, 2022

jangko commented Jul 28, 2022 • edited Loading

KonradStaniec commented Jul 29, 2022

KonradStaniec commented Aug 4, 2022

KonradStaniec commented Jul 26, 2022 •

edited

Loading

jangko commented Jul 28, 2022 •

edited

Loading