Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard fault in networking stack. #136

Closed
fenconsultants opened this issue Jan 21, 2020 · 5 comments
Closed

Hard fault in networking stack. #136

fenconsultants opened this issue Jan 21, 2020 · 5 comments

Comments

@fenconsultants
Copy link

Running the uIP based web server on an STM32F4 with Ethernet connection, I get errors and eventually a hard fault provoked by loading multiple pages (opening multiple connections on port 80) in quick succession, e.g. by loading a web page that includes a couple of css files, a couple of js files and a couple of images.

The errors are:

tcp_datahandler: ERROR: Failed to create new I/O buffer chain

This is from iob_tryalloc failing to allocate a buffer. It is repeated multiple times per page load. More rarely I see:

tcp_datahandler: ERROR: Failed to add data to the I/O buffer chain: -12

Initially this resulted in most of the connections failing and httpd processes hanging, but after setting CONFIG_NET_TCPBACKLOG_CONNS to 8 the page almost always fully loads (connections succeed) despite the errors.

However within a minute or two of repeatedly reloading pages, I get a hard fault.

up_hardfault: PANIC!!! Hard fault: 40000000
up_assert: Assertion failed at file:armv7-m/up_hardfault.c line: 148 task: <pthread>
up_registerdump: R0: 00000000 200153e0 20010820 00000000 00000000 20012a60 00000000 ffffffff
up_registerdump: R8: ffffffff 00000000 2001bd62 00000002 000012c0 200163d8 080197eb 08012860
up_registerdump: xPSR: 00000000 PRIMASK: 00000001 CONTROL: 00000000
up_registerdump: EXC_RETURN: fffffff9
up_dumpstate: sp:     200014a8
up_dumpstate: IRQ stack:
up_dumpstate:   base: 20001500
up_dumpstate:   size: 00000800
up_stackdump: 200014a0: 20001500 08000cd7 00000001 00000000 00000002 000012c0 200163d8 080197eb
up_stackdump: 200014c0: 08012860 08021178 00000003 0800110d 080010cd 2000d8c0 2001638c 080014fd
up_stackdump: 200014e0: 00000001 080011f5 00000001 2001638c 20012a60 00000000 ffffffff 08000dc3
up_dumpstate: sp:     200163d8
up_dumpstate: User stack:
up_dumpstate:   base: 20016478
up_dumpstate:   size: 00000ffc
up_stackdump: 200163c0: 20010820 00000000 000012c0 080197eb 08012860 00000000 2000bba8 20012ab8
up_stackdump: 200163e0: 20012ab8 0800f275 20012ab8 20012a60 2001add6 2001ac70 08028030 0800e399
up_stackdump: 20016400: 2000bba8 20012ab8 2001add6 0800e4e7 2001bda4 08014087 2001bda4 08007ff9
up_stackdump: 20016420: 20015360 08019255 00000000 0000000b 2001be34 2001adb8 00000000 20015360
up_stackdump: 20016440: 20015460 20012880 2001288c 00000000 00000000 00000000 00000000 08018aa9
up_stackdump: 20016460: 00000000 00000000 00000000 00000000 00000000 00000000 000101aa 00000001

R15 is inside _net_timedwait and R14 is inside nxsem_wait.

I've attached the map.

System.map.txt

Also my config:

defconfig.txt

@acassis
Copy link
Contributor

acassis commented Jan 21, 2020

Hi, try to increase the CONFIG_IOB_NBUFFERS and CONFIG_IOB_NCHAINS and let us know if it helped.

@patacongo
Copy link
Contributor

patacongo commented Jan 21, 2020

"R15 is inside _net_timedwait and R14 is inside nxsem_wait."
That is the classic example of memory corruption. That usually occurs because of a stack overrun or sometimes from using a bad memory location.

In this case, you are looking at the VICTIM of the corruption, not the CULPRIT that caused the corruption. This classic failure occurs like this:

  • Task A, the VICTIM, calls nxsem_wait() and is suspended, perhaps for a long period of time.
  • During that long period of time, Task B, the CULPRIT, runs and exceeds it stack limit, clobbering task resources of the VICTIM.
  • When the victim is restarted in nxsem_wait(), it crashes immediately because its stack or some other resource has been corrupted.

The usually fix that works 80% of the time is to increase the stack size of Task B, the CULPRIT or, sometimes Task A, the VICTIM.

Task A's stack must be increased in the case when Task A is both the VICTIM and CUPRIT. This happens when Task A's stack is too small so when it is suspended, its critical stack area hold the current state lies outside of its stack limits and in some other tasks memory memory space. The failure scenario is the same as above except that Task B is not the CULPRIT. It innocently clobbers Task A's stack.

@fenconsultants
Copy link
Author

I guess I shouldn't look at bugs before having my coffee. PC and LR clearly make no sense in that order. I'll colour the stacks, check for depth and close the bug if it was that. Thanks.

@patacongo
Copy link
Contributor

patacongo commented Jan 21, 2020

I assumed that you just reversed them. They do make perfect sense in the opposite order.

But if the stack was corrupted when the task restarted, nother in the register dump or stack dump may be correct.

@fenconsultants
Copy link
Author

I've had no hard faults in the last 20 hours or so, so that's good. After increasing the number of IOB buffers (which was the first thing I tried before posting the ticket) I still get a lot of "Failed to create new I/O buffer chain" errors while loading a page but most of the time the content does appear to load correctly. Also NuttX generates a lot of nerr output when sockets are closed from under it (e.g. by closing a web browser or hitting refresh before a page loads) which are strictly accurate but seem like a heavy response to a normal behaviour, potentially masking other things that are happening with a lot of debug output going on.

I'll close the bug later today. Thanks for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants