Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

16KB cache + 48KB IRAM (IRAM) doesn't seem to make extra RAM available #9033

Open
4 of 6 tasks
TD-er opened this issue Nov 17, 2023 · 15 comments
Open
4 of 6 tasks

16KB cache + 48KB IRAM (IRAM) doesn't seem to make extra RAM available #9033

TD-er opened this issue Nov 17, 2023 · 15 comments

Comments

@TD-er
Copy link
Contributor

TD-er commented Nov 17, 2023

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

  • Hardware: [ESP-12]
  • Core Version: [latest git hash & 3.0.0
  • Development Env: [Platformio]
  • Operating System: [Windows]

Settings in IDE

  • Module: [Generic ESP8266 Module|Wemos D1 mini r2|Nodemcu|other]
  • Flash Size: [4MB]
  • lwip Variant: [v2 Higher Bandwidth]
  • CPU Frequency: [80Mhz]
  • Upload Using: [SERIAL]
  • Upload Speed: [115200] (serial upload only)

Tried with these in PlatformIO:

                            -DPIO_FRAMEWORK_ARDUINO_MMU_CUSTOM
                            -DMMU_IRAM_SIZE=0xC000 
                            -DMMU_ICACHE_SIZE=0x4000

or PIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48

Also tried with these in the build section of the board definition:

    "mmu_iram_size": "0xC000",
    "mmu_icache_size": "0x4000"

Problem Description

The latest version of this ESP8266 Arduino (3.1.2) doesn't leave enough RAM to run stable.

I also spent quite a lot of time on implementing 2nd heap.
This does add some relief to the amount of free RAM, but it seems to depend on the board whether it is usable.

On a Wemos D1 mini board (really old one, so might be one of those "true" Wemos boards) it runs quite OK.
But on a Sonoff POW r2 it runs like what we would say in Dutch: "Like a drunk lad on roller skates".

So I assumed it might have something to do with the extra burden on the flash due to the cache misses?
But using the same PIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48_SECHEAP_SHARED option without actually allocating anything on the 2nd heap, it runs as stable as you can expect with the limited amount of free memory.

Therefore I tried to see whether switching to 16k cache and 48k iRAM would be useable, but I don't seem to get any extra free memory.

Found this comment where this user showed an incredible amount of free DRAM.
I used the same platform_packages as he did for this post, but I don't seem to get any extra free memory.

So what am I doing wrong here?
Or isn't this made available to the normal allocator as this extra memory can perhaps only be addressed with 32-bit alignment?

It would already be a great help if you can store rather static allocated chunks in there like the buffer for MQTT PubSubClient, or a frame buffer for a display.

MCVE Sketch

See: https://community.platformio.org/t/esp8266-mmu-increase-heap/21488/3

@mhightower83
Copy link
Contributor

On a Wemos D1 mini board (really old one, so might be one of those "true" Wemos boards) it runs quite OK.
But on a Sonoff POW r2 it runs like what we would say in Dutch: "Like a drunk lad on roller skates".

Looking at boards.txt the Wemos boards use DIO and we always assume DOUT for the Sonoff two-layer boards. This could give the Wemos an advantage.

So I assumed it might have something to do with the extra burden on the flash due to the cache misses?
But using the same PIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48_SECHEAP_SHARED option without actually allocating anything on the 2nd heap, it runs as stable as you can expect with the limited amount of free memory.

I think the exception handling overhead of non-32-bit data accesses to iRAM may play a big role in performance or lack of.
If all data accesses are 32-bit, the exception overhead is gone. However, umm_malloc accessing Heap headers will cause a hit, but those should be very infrequent.

Therefore I tried to see whether switching to 16k cache and 48k iRAM would be useable, but I don't seem to get any extra free memory.

That option does not create a 2nd heap (At least, as it is defined with Arduino IDE). You have 48k of iRAM no/Heap. This was intended for those who just needed a bigger iRAM space for code and did not want any extra code space used for handling a 2nd Heap.

@TD-er
Copy link
Contributor Author

TD-er commented Nov 17, 2023

I did want to have more memory and since the 2nd heap isn't working as stable as I would like, I wanted to try the 48k option, but I don't see more available free memory when this is enabled.

So is this option intended to have more available memory?
Or only intended for those who need more ISR code for interrupt handling stuff etc?

Edit:
Just checked and for ESP8266 I always use DOUT for flash access.
So that should not mean it does have faster flash access.
Still there can be a difference in flash brands.

@mhightower83
Copy link
Contributor

Or only intended for those who need more ISR code for interrupt handling stuff etc?

Yes, 16k cache and 48k iRAM is to provide larger code space for iRAM-hungry ISRs.

Assuming they both use the same flash speed, I don't know why they would perform differently,

I did want to have more memory and since the 2nd heap isn't working as stable as I would like,

By "stable" are you referring to performance issues or crashing?

@TD-er
Copy link
Contributor Author

TD-er commented Nov 17, 2023

I did have another look at the uint8_t mmu_get_uint8(const void *p8) code.
And also did another dive into some topics when I first looked into this topic along with @mcspr (implementing 2nd heap and the quest for more free RAM)
Related discussions:

Anyway back to the mmu_get_uint8 function.
Shouldn't those 32-bit aligned pointers be marked as volatile ?
Like this:

  volatile void *v32 = (void *)((uintptr_t)p8 & ~(uintptr_t)3u);

N.B. this is about the idea of using 2nd heap, but if it can be done much simpler by having 48k available then that would be the preferred option.
But I got the feeling I do have the wrong ideas about what this "16k + 48k" option really does and if so, then the 2nd heap issue becomes relevant again.

@TD-er
Copy link
Contributor Author

TD-er commented Nov 17, 2023

By "stable" are you referring to performance issues or crashing?

Crashing.
All crashes I've seen were Watchdog crashes.
It seems like the client isn't fetching the data sent by the ESP and thus the webserver buffers fill up.

I've seen something similar happening on a test node hardly doing anything which suddenly seemed to stop outputting serial port data. This is usually an indication of a pending watchdog reboot, but on this node the serial log finally recovered and flushed its log data and the memory usage of the default heap seemed to recover.

Right now I have some stuff running on this Wemos unit with 2nd heap (and I can now also let ESPEasy draw these nice charts :) )

image

As you can see, the typical action of page loading does add like 1 or 2 TCP packets of RAM usage to the default heap. (enforced to use the DRAM heap, not IRAM)
But whenever the client stalls in fetching this data, it will of course add up quickly and this does seem to happen quite a bit more often when using 2nd heap compared to running a single heap (even with 16k cache, 48k for heap + 2nd heap config)

My current implementation is already quite a bit more stable now that I only switch to the 2nd heap allocator very briefly and make sure that the web serving class is explicitly enforcing to only use DRAM.
But still it feels a bit like "hit or miss" as some seemingly unrelated changes in the code may cause a lot more crashes while other builds are quite hard to get them to crash.

Edit:
Happened again.
Single page load took 6021.568 msec where all seemed to have stopped responding. (each sample is 1 sec interval)
image

@mhightower83
Copy link
Contributor

Shouldn't those 32-bit aligned pointers be marked as volatile ?

I don't see the need unless you use it from an ISR. All references/accesses are in plan view to the compiler.

But I got the feeling I do have the wrong ideas about what this "16k + 48k" option really does and if so, then the 2nd heap issue becomes relevant again.

Okay, I may need to reword some of this. Below is from the doc/mmu.rst - where do you get confused?

Option Summary

The Arduino IDE Tools menu option, MMU has the following selections:

  1. 32KB cache + 32KB IRAM (balanced)
    • This is the legacy ratio.
    • Try this option 1st.
  2. 16KB cache + 48KB IRAM (IRAM)
    • With just 16KB cache, execution of code out of flash may be slowed by more cache misses when compared to 32KB. The slowness will vary with the sketch.
    • Use this if you need a little more IRAM space, and you have enough DRAM space.
  3. 16KB cache + 48KB IRAM and 2nd Heap (shared)
    • This option builds on the previous option and creates a 2nd Heap made with IRAM.
    • The 2nd Heap size will vary with free IRAM.
    • This option is flexible. IRAM usage for code can overflow into the additional 16KB IRAM region, shrinking the 2nd Heap below 16KB. Or IRAM can be under 32KB, allowing the 2nd Heap to be larger than 16KB.
    • Installs a Non-32-Bit Access handler for IRAM. This allows for byte and 16-bit aligned short access.
    • This 2nd Heap is supported by the standard malloc APIs.
    • Heap selection is handled through a HeapSelect class. This allows a specific heap selection for the duration of a scope.
    • Use this option, if you are still running out of DRAM space after you have moved as many of your constant strings/data elements that you can to PROGMEM.
  4. 16KB cache + 32KB IRAM + 16KB 2nd Heap (not shared)
    • Not managed by the umm_malloc heap library
    • If required, non-32-Bit Access for IRAM must be enabled separately.
    • Enables a 16KB block of unmanaged IRAM memory
    • Data persist across reboots, but not deep sleep.
    • Works well for when you need a simple large chunk of memory. This option will reduce the resources required to support a shared 2nd Heap.

All crashes I've seen were Watchdog crashes.

Hardware, Software, or both?

I can only offer guesses here.

My first thought is stack overflow. If the stack space is already tight the extra space to handle the exception could push things over the top. Access from an ISR would be even worse.

There are about 94 places in the SDK (v3.0.5) that produce deliberate Software WDT. I saw one of these 13 days ago that was not preceded by OOM. Debug reported the last gasp as pm 2060. I am guessing or concerned that the intense use of the exception handler may be slowing things down such that the SDK does not get to run often enough to properly handle the WiFi. For me these are rare and I cannot associate them with anything other than traffic stress.

None of these offers a strong suggestion as to why a difference between the Wemos and Sonoff ... hmm, maybe the Sonoff is having more WiFi issues and falls behind and the SDK runs out of internal resources. A lot of guessing. :(

@TD-er
Copy link
Contributor Author

TD-er commented Nov 18, 2023

The only thing I could think of why the Sonoff POW r2 may be acting differently is because it does receive a lot of continuous serial data at quite low bitrate of 4800 baud.
So this may just add to the number of hardware interrupts being dealt with which may just push the ESP over some limit?

I don't have the logs/dumps anymore, so I'm not 100% sure what was the reported reboot reason of the Sonoff.
On another unit which has GPS @ 9600 baud on HW Serial0 and a SenseAir module on SW serial (also 9600 baud), the last reboot reason was HW Watchdog. (running 2nd heap build, uptime is now 2 days on that one)

What would cause exceptions to be handled by the way?
All I try to do is store large strings in the 2nd heap, some frame buffer data and the buffer of MQTT PubSubClient.
When serving data to the webserver, which isn't a String, I do use the mmu inline function to access the data.

And the reason I was thinking about having the pointer declared volatile was because there is now such an odd direct call to __builtin_memcpy just to make sure there aren't any compiler optimizations for memcpy.
But if you just declare the void *v32 as volatile, you shouldn't have to worry about this, wouldn't you?

Another thing I did change in the last 2 days in my code is that I try to have the 2nd heap active as short as possible by only activating it in some functions to help move my data or reserve data on the 2nd heap.
The idea behind this is that whenever you have a callback function, you have no control over when this will happen and thus it might happen with the 2nd heap being active.
Or is there some protection against this?

At least the builds have become quite a bit more stable since I changed this. (even on the non-Sonoff units)

About the unclear description:

16KB cache + 48KB IRAM (IRAM)
With just 16KB cache, execution of code out of flash may be slowed by more cache misses when compared to 32KB. The slowness will vary with the sketch.
Use this if you need a little more IRAM space, and you have enough DRAM space.

The 16k cache remark is perfectly clear.
But the 2nd one isn't.
Especially when you look at the reported output in this comment which I mentioned before (and could not reproduce myself) where it appears the available heap size is much more.
Even more than 48k now that I think of it...

@TD-er
Copy link
Contributor Author

TD-er commented Nov 19, 2023

I installed a 2nd heap build on one of the oldest NodeMCU I have and it also crashes quite often without much to do (nothing accessing the web interface, just some MQTT traffic)
The reboots are all HW watchdog.
So I guess maybe the revision of the ESP may have something to do with it.
I even have set it to DIO mode instead of DOUT. (Vendor: 0x20 Device: 0x4016)

@TD-er
Copy link
Contributor Author

TD-er commented Nov 19, 2023

Did some more testing and also removed the PubSubClient buffer from the 2nd heap as it really only accesses the memory per byte.
I may want to take a look at it later to see if I can make it behave more nicely with 32-bit access and using 4-byte aligned buffer.

So I only store arrays of floats in this 2nd heap and large strings.
These arrays of float should always be 4-byte aligned, so that's no issue.
However the String functions do not seem to be really aware of the 2nd heap's memory access requirements.

For example reserve() doesn't seem to allocate with 4-byte alignment.
The newSize is modulo 16 bytes, but umm_malloc_core doesn't seem to 'round up' to multiple of 4 bytes when on 2nd heap. So there is no guarantee nothing else is stored there causing an alignment offset.

No idea if memcpy, memmove or memset do access this memory like they should?

@TD-er
Copy link
Contributor Author

TD-er commented Nov 19, 2023

Ah it seems like memmove_P doesn't handle iram correctly:
https://github.com/earlephilhower/newlib-xtensa/blob/e1db641ecaddb1fe9310d64e613f3fb20229ce00/newlib/libc/sys/xtensa/string_pgmspace.c#L184-L190

If I'm not mistaken, the iram starts at:

#define XCHAL_INSTRAM1_VADDR		0x40100000

So I think the test should be something like this:

    if ( ((const char *)src >= (const char *)0x40000000) || ((const char *)dest >= (const char *)0x40000000) )

or use the inFlash as suggested here: #8671

Also there are some left-over memmove and memcpy in WString.cpp which I think should be memmove_P and memcpy_P.
Maybe we also need some variant like this for memset?

@TD-er
Copy link
Contributor Author

TD-er commented Nov 19, 2023

Hmm quite a lot of functions in xtensa/string_pgmspace.c can be done quite a bit more efficient and also may need some TLC when it comes to 32-bit access.

strnlen_P does still check each byte in uint32_t w = *pmem;
But you could use the 'magic' used in strncpy_P: https://github.com/earlephilhower/newlib-xtensa/blob/ebc967552ce827f21fc579fd8c437037c1b472ab/newlib/libc/sys/xtensa/string_pgmspace.c#L244-L246
Then all which is left in that function is the already present block of checking per byte.
Worst case scenario is that you may need to perform 1 extra pgm_read_byte, but execution speed and code size will be improved.

memcpy_P should check whether it is writing to iram and if so should use mmu_set_uint8
This can of course also be made a lot faster by using some stack-allocated buffer which can then be used to match alignments.
But this will make the code a bit larger.
On the other hand if malloc does allocate only using 4-byte aligned addresses when on 2nd heap, then the existing code may already be fast enough.

memcmp_P is just slow and not using mmu_get_uint8 when buf1 is on 2nd heap.

memccpy_P is also not accessing the dest buffer correctly when on 2nd heap.

memmem_P is also not accessing buf in the correct manner.

strncpy_P is also not accessing write (actually dest) in the correct manner for 2nd heap.

strncat_P is also not accessing write (actually dest) in the correct manner for 2nd heap.

strncmp_P is not reading in the correct way from str1

strncasecmp_P ditto.

@mhightower83
Copy link
Contributor

However the String functions do not seem to be really aware of the 2nd heap's memory access requirements.

String, byte, and short types rely on the exception handler for accessing IRAM heap data. This consumes an additional 256+ bytes of the stack.

For example reserve() doesn't seem to allocate with 4-byte alignment.

umm_malloc inherently will handle alignments. Its headers express a block count. Each block is 8 bytes. Thus, it has to work with 64-bit blocks of aligned memory and always returns a 32-bit aligned address.

The idea behind this is that whenever you have a callback function, you have no control over when this will happen and thus it might happen with the 2nd heap being active.
Or is there some protection against this?

Yes, the Arduino sketch needs to ensure the heap selection in callbacks.
However, the SDK and lwIP heap requests are always satisfied with DRAM. They use the pvPortMalloc family of Heap calls which are always served DRAM. (except for SDK v3.0.5 which can explicitly request 2nd heap.)
Also, when we yield back to the SDK, the Heap context is set to DRAM. Thus, callbacks performed by the SDK will default to DRAM.
Allocations from an ISR default to DRAM. An exception would be realloc which is not supported in an ISR context; however, it is coded to resize memory matching the original allocations Heap context.

And the reason I was thinking about having the pointer declared volatile was because there is now such an odd direct call to __builtin_memcpy just to make sure there aren't any compiler optimizations for memcpy.
But if you just declare the void *v32 as volatile, you shouldn't have to worry about this, wouldn't you?

This concern was handled by

// Use an empty ASM to reference the 32-bit value. This will block the
// compiler from immediately optimizing to an 8-bit or 16-bit load instruction
// against IRAM memory. (This approach was inspired by
// https://github.com/esp8266/Arduino/pull/7780#discussion_r548303374)
// This issue was seen when using a constant address with the GCC 10.3
// compiler.
// As a general practice, I think referencing by way of Extended ASM R/W
// output register will stop the the compiler from reloading the value later
// as 8-bit load from IRAM.
asm volatile ("" :"+r"(val)); // inject 32-bit dependency

But the 2nd one isn't.
Especially when you look at the reported output in this comment which I mentioned before (and could not reproduce myself) where it appears the available heap size is much more.
Even more than 48k now that I think of it...

The example you referenced is building with the 16KB cache + 48KB IRAM and 2nd Heap (shared) option. I tried it and got similar results.

Does this make more sense?

  1. 16KB cache + 48KB IRAM and 2nd Heap (shared)
    • This option builds on the previous option and creates a 2nd Heap made with from unused IRAM.
    • The 2nd Heap size will varys with free IRAM.
      • As the IRAM code size increases the available 2nd Heap size decreases.
      • And, likewise when the IRAM code size shrinks, more space is available in the 2nd Heap.
    • This option is flexible. IRAM usage for code can overflow into the additional 16KB IRAM region, shrinking the 2nd Heap below 16 KB. Or IRAM can be under 32KB, allowing the 2nd Heap to be larger than 16 KB.
    • Installs a Non-32-Bit Access handler for IRAM. This allows for byte and 16-bit aligned short access.
    • This 2nd Heap is supported by the standard malloc APIs.
    • Heap selection is handled through a HeapSelect class. This allows a specific heap selection for the duration of a scope.
    • Use this option, if you are still running out of DRAM space after moving as many of your constant strings/data elements that you can to PROGMEM.

memcmp_P is just slow and not using mmu_get_uint8 when buf1 is on 2nd heap.

Hmm, the PROGMEM family of functions was written to read from flash/ICACHE not to write to IRAM. So, writing to buf1 is handled through the exception handler. This will be true of all the PROGMEM APIs. These APIs are able to read from PROGMEM or IRAM; however, they will have to rely on the exception handler to write to IRAM.

@TD-er
Copy link
Contributor Author

TD-er commented Nov 19, 2023

Hmm, the PROGMEM family of functions was written to read from flash/ICACHE not to write to IRAM. So, writing to buf1 is handled through the exception handler. This will be true of all the PROGMEM APIs. These APIs are able to read from PROGMEM or IRAM; however, they will have to rely on the exception handler to write to IRAM.

Yep, but the String class does use these PROGMEM functions. so if you try to store large String objects on the 2nd heap, then you will see those access patterns.

Is it OK to try and extend the String class to properly use the 2nd heap and make an PR for it?
or is the 2nd heap not intended to be used for this?

@CurlyMoo
Copy link

2nd heap works fine here. Maybe you can use it for inspiration:
https://github.com/CurlyMoo/rules/

@TD-er
Copy link
Contributor Author

TD-er commented Jan 22, 2024

It highly depends on the flashchip/ESP module you're using.
I do have a node here running for weeks without a crash and it is quite packed with SW serial devices, HW serial GPS, several I2C devices, logging to flash and uploading recorded data.
And some other node is hardly able to run idle for over a few hours, while it is running perfectly stable on builds without 2nd heap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants