Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA of large files results in device hangs due to bootloader compare error #7458

Closed
arendst opened this issue Jul 15, 2020 · 31 comments
Closed

Comments

@arendst
Copy link

arendst commented Jul 15, 2020

After a recent core change Tasmota fails to load binaries over 560k due to new bootloader functionality.

After the file is loaded this is reported:

load 0x4010f000, len 3664, room 16 
tail 0
chksum 0xee
csum 0xee
v39c79d9b
@cp:0
cmp:
 ets Jan  8 2013,rst cause:3, boot mode:(3,7)

ets_main.c 

It seems the bootloader fails to compare the copied binary with the uploaded binary (for obvious reasons as part of the uploaded file are being overwritten by the bootloader copy routine).

A recent commit 51daecc introduces this cmp functionality not present in previous bootloader versions and leads to the device hangs.

Pls remove the compare functionality as it breaks OTA functionality

EDIT: Using core 2.7.1 the OTA uploads work fine (as it has no compare functionality)

@earlephilhower
Copy link
Collaborator

Oh poop, I think I see what you're saying.

You have a firmware of 100K and a 1MB flash chip. You OTA a 200K app placed at 100K-300K and copied down to 0-200K...overwriting the 1st 100K of the original OTA space. Is that the general gist of the issue?

@arendst
Copy link
Author

arendst commented Jul 15, 2020

From what I see here:

    if (cmd.action == ACTION_COPY_RAW) {
        ets_putc('c'); ets_putc('p'); ets_putc(':');

        ets_wdt_disable();
        res = copy_raw(cmd.args[0], cmd.args[1], cmd.args[2], false);
        ets_wdt_enable();

        ets_putc('0'+res); ets_putc('\n');

        // Verify the copy
        ets_putc('c'); ets_putc('m'); ets_putc('p'); ets_putc(':');
        if (res == 0) {
            ets_wdt_disable();
            res = copy_raw(cmd.args[0], cmd.args[1], cmd.args[2], true);
            ets_wdt_enable();
            }

it first does a copy_raw loop moving the downloaded data to it's destination. It then calls the copy_raw loop again comparing the data.

@arendst
Copy link
Author

arendst commented Jul 15, 2020

Yes. That's what I meant

@arendst
Copy link
Author

arendst commented Jul 16, 2020

I suggest to get rid of the verify anyway. If verify fails it will stall as it does now. Without verify it may run but terminate with exceptions allowing to re-ota a correct binary. It provides additional functionality which a simple verify doesn't.

BTW until now, three years in the process, I never observed any bootloader binary failure so I think there is no need for this functionality at all.

As an additional advantage it also reboots faster without the verify ;-)

If you still want verify why not make a hash of the source binary while copying it to it's final destination and then re-read the copied locations and compare to the hash.

@mcspr
Copy link
Collaborator

mcspr commented Jul 16, 2020 via email

@devyte
Copy link
Collaborator

devyte commented Jul 16, 2020

I never observed any bootloader binary failure

Have you encountered flash chips like Puya or XMC? Paranoia like this wasn't needed until write failures were reported, and after lots of pain we figured out it had to do with the chip brand.

why not make a hash

Because the hash algorithm would need to be in the eboot code, and there's almost no more space (4KB size max). In theory we could go above 4KB, but that would jump to 8KB of bootloader, and most of it would be empty. And we're not sure it will work.

@arendst
Copy link
Author

arendst commented Jul 16, 2020

It feels like this also could be immediate read of the written data? By changing ‘verify’ option to select between ‘just write the data’ and ‘write the data, read it back, compare’

It could but then it might be you're reading the ram part (cache) of the flash (if it exists) instead of the real flash area. The hash is a better verify process I think. It shouldn't be that tricky something like:

uint32_t GetCfgCrc32(uint8_t *bytes, uint32_t size)
{
  // https://create.stephan-brumme.com/crc32/#bitwise
  uint32_t crc = 0;

  while (size--) {
    crc ^= *bytes++;
    for (uint32_t j = 0; j < 8; j++) {
      crc = (crc >> 1) ^ (-int(crc & 1) & 0xEDB88320);
    }
  }
  return ~crc;
}

Have you encountered flash chips like Puya or XMC? Paranoia like this wasn't needed until write failures were reported, and after lots of pain we figured out it had to do with the chip brand.

I have met Puya but they never failed me on restart. They missed the binary OTA upload indeed but then they still rebooted as the current implementation will halt waiting for a power cycle.

@devyte
Copy link
Collaborator

devyte commented Jul 16, 2020

The point is that the verify is supposed to detect other potential cases like that which are currently unknown. New flash brands are showing up every so often, they can have quirks like this, and we've already bled twice as a result, so removing all verification is not an option.
Having said that, overwriting the beginning of the source image in the empty space may be valid. HOWEVER: you should be aware that after such an OTA upgrade, the next OTA must be a smaller image, and you may not be able to OTA again if you don't prepare for that (see below).

About what @earlephilhower said:

You have a firmware of 100K and a 1MB flash chip. You OTA a 200K app placed at 100K-200K, and copied down to 0-200K

The received image is end-aligned in the empty space area, so what would happen is something like this:
"You have a firmware of 100K and a 1MB flash chip. You OTA a 200K app placed at 800K-1000K (minus FS + config block, etc), and copied down to 0-200K"

So, in your case:

  • 1MB flash size
  • 560K bin size
  • assume no FS and negligible config etc
  • assume 420KB current bin size (just some number)
  • assume no compression (to simplify)

Then:

  • Your current bin is placed at 0-420KB
  • Your new image is received at 440-1000KB (highest 560KB)
  • New image is copied to 0-560KB (overwrites beginning of empty area image 440-560KB => verify fails)
  • Your new empty space is 560-1000KB = 440KB

So for your next OTA you're constrained to 440KB size, down from 560KB. Can you handle that? ln general, hitting this case means that future updates must be done with a 2-step process by first flashing a small stub app, in which case you won't have this verify problem anyways.

@earlephilhower
Copy link
Collaborator

@devyte , FWIW there are some hashing functions in ROM (SHA1, MD5) which could be used. Might still be issues with fitting in the code on our end, but it's not clear it's going to cross the 4k boundary.

@arendst
Copy link
Author

arendst commented Jul 16, 2020

@devyte Yes I can handle that.

Tasmota uses OTAMagic. This detects that the requested image won't fit in the space allocated and switches to an intermediate binary called tasmota-minimal which then in turn loads the requested binary. All without user intervention.

This mechanism worked fine for over a year until the bootloader verify option came in play; with the minimal binary active all requested OTA binaries would be larger and were bound to fail.

@mcspr
Copy link
Collaborator

mcspr commented Jul 16, 2020

It could but then it might be you're reading the ram part (cache) of the flash (if it exists) instead of the real flash area. The hash is a better verify process I think. It shouldn't be that tricky something like:

Just to be on the same page, the idea is for eboot to continuously update hash based on write buffer? Doesn't this still hit the caching problem when re-reading the data to calculate the hash for a 2nd time? Speed is another question though, as right now it is only a single memcmp.
I have tested my own suggestion with write->read->check with 1MB board, so at least that works. I would agree that new behavior is not really user friendly :/

As a sidenote, I was wondering about caching mechanisms, it looks like SDK rom .ld has something that looks like related to the flash operations - Cache_Read_{Enable,Disable}. There are also same functions with _2 suffix, without arguments, but those are not listed in headers, but are included in the spi_flash.o of the SDK. Cache_Read_Enable(u8, u8, u8) needs some specific arguments which does not seem to be documented anywhere, and _2-suffixed also don't really work with bootleg version I c/p from the RTOS sources

Even then, is caching even enabled by default (i.e. is it a hw property) or does sdk 'enable' it only after the app is loaded?

@mhightower83
Copy link
Contributor

As a sidenote, I was wondering about caching mechanisms, ...

@mcspr, As I understand it, ICACHE is out of the picture at boot time. You have 64K of IRAM with eboot sitting at the end. At least until Cache_Read_Enable is called by the NONOS-SDK initialization. By the time the SDK calls user_init in the core, Cache_Read_Enable has been called and IRAM is now 32K and ICACHE is 32K. (At least for now 😃)

After this, to do flash read/writes you need to use SDK system APIs to handle flash read/writes. Direct ROM flash API calls will fail. The SDK system calls use calls to Cache_Read_Enable_2 and Cache_Read_Disable_2 as needed.

My memory is a little fuzzy, Cache_Read_Enable_2 and Cache_Read_Disable_2 are more like a suspend/resume kind of operation for ICACHE. Where Cache_Read_Enable and Cache_Read_Disable have more logic dedicated to setup and teardown.

If you want to see some examples using it, I have 3 active PRs that turn on Cache_Read_Enable early, as in before the SDK runs and turns it off before the SDK is launched. I try not to surprise the SDK. Method 3 in PoC for handling Erase WiFi Setting after OTA #6965 uses Cache_Read_Enable_2 and Cache_Read_Disable_2.

One note, Cache_Read_Disable is not a 100% restore to the previous state. The cache window size does not shrink; however, it can grow. If you were at 32K when you disable you cannot go back to 16K at a new enable. For this reason, I only use the 16K cache size option. All is properly restored with a reboot.

In summary, while eboot is running there should not be any cache interference while accessing the flash with ROM API calls.

@Jason2866
Copy link
Contributor

@devyte

Have you encountered flash chips like Puya or XMC? Paranoia like this wasn't needed until write failures were reported, and
after lots of pain we figured out it had to do with the chip brand.

Yes, problems in normal use of the flash. NEVER when writting with the bootloader routines.
See the revert of the uneeded XMC PR for eboot.

@d-a-v
Copy link
Collaborator

d-a-v commented Jul 18, 2020

Is it sufficient to revert #7317 from 2.7.2 for a clean 2.7.3 ?

d-a-v added a commit that referenced this issue Jul 18, 2020
@arendst
Copy link
Author

arendst commented Jul 18, 2020

Fine with me.

@d-a-v
Copy link
Collaborator

d-a-v commented Jul 18, 2020

Can someone with an XMC flash chip can check if a4b3105 is working with OTA (where 2.7.2 is failing) ?

That one is ready to be published and is 2.7.2 without #7317.

@devyte
Copy link
Collaborator

devyte commented Jul 18, 2020

As discussed internally, reverting #7317 entirely is not an option. Only the verify step added in that merge must be disabled.

devyte added a commit that referenced this issue Jul 18, 2020
Meant for #7458 , but still requires a recompiled eboot.elf.
@d-a-v
Copy link
Collaborator

d-a-v commented Jul 18, 2020

2.7.3 branch has been force-updated.
Testers are welcome before publishing the hotfix release !

Edit: link to the changes (including recompiled eboot.elf)

@mcspr
Copy link
Collaborator

mcspr commented Jul 19, 2020

No issues so far, OTA or otherwise. Both xmc d1mini clone and esp8285 board

@d-a-v
Copy link
Collaborator

d-a-v commented Jul 19, 2020

@mcspr did you try the new 2.7.3 branch (not the link provided above which I just "disabled " ?

@mcspr
Copy link
Collaborator

mcspr commented Jul 19, 2020

Last one posted, where 59bbfc7 is the last commit.

$ git describe --dirty --abbrev=40
2.7.2-4-g59bbfc70610c68e2728175cd5ac9f0519bb10992

@Jason2866
Copy link
Contributor

Jason2866 commented Jul 19, 2020

@d-a-v the branch 2.7.3 need to set build flags for the compiler version.

build_unflags        -std=c17 -std=gnu++17

build_flags            -std=gnu99 -std=c++11

@Jason2866
Copy link
Contributor

Jason2866 commented Jul 19, 2020

The branch generates warnings which are not in core 2.7.2.

In file included from /home/gitpod/.platformio/packages/framework-arduinoespressif8266/libraries/Ticker/src/Ticker.cpp:26:0:
/home/gitpod/.platformio/packages/framework-arduinoespressif8266/libraries/Ticker/src/Ticker.h: In member function 'void Ticker::attach(float, void (*)(TArg), TArg)':
/home/gitpod/.platformio/packages/framework-arduinoespressif8266/libraries/Ticker/src/Ticker.h:78:32: warning: unknown option after '#pragma GCC diagnostic' kind [-Wpragmas]
 #pragma GCC diagnostic ignored "-Wcast-function-type"
                                ^
/home/gitpod/.platformio/packages/framework-arduinoespressif8266/libraries/Ticker/src/Ticker.h: In member function 'void Ticker::attach_ms(uint32_t, void (*)(TArg), TArg)':
/home/gitpod/.platformio/packages/framework-arduinoespressif8266/libraries/Ticker/src/Ticker.h:89:32: warning: unknown option after '#pragma GCC diagnostic' kind [-Wpragmas]
 #pragma GCC diagnostic ignored "-Wcast-function-type"
                                ^
/home/gitpod/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/gdb_hooks.cpp:37:32: warning: unknown option after '#pragma GCC diagnostic' kind [-Wpragmas]
 #pragma GCC diagnostic ignored "-Wattribute-alias"
                                ^
/home/gitpod/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/gdb_hooks.cpp:38:32: warning: unknown option after '#pragma GCC diagnostic' kind [-Wpragmas]
 #pragma GCC diagnostic ignored "-Wmissing-attributes"
                                ^

@Jason2866
Copy link
Contributor

Sorry guys. Forget my posts above. I messed up something in my fork with git commands...
Everything is working without a warning!

@d-a-v
Copy link
Collaborator

d-a-v commented Jul 19, 2020

With your two confirmations we are going to release it.
It's an unusual (not automated) release process so let's cross fingers.

@ChocolateFrogsNuts
Copy link
Contributor

A few thoughts - yes just commenting the verify call is the way to go at least for a quick fix.
Having some form of verify is probably a good idea though.... how about generating a hash from the write buffer and then generating a second hash from reading the flash and comparing. ROM code could be used, or there were about 200 bytes of eboot space gained from removing the XMC code from eboot (which was never actually needed in the end). Its not much but might be enough.
The alternative is to store a hash of the uncompressed data in the downloaded image reducing the workload during the write, however about 90% of the write time is actually spent waiting for the flash chip to store the data so I see no real gain in that.

@ChocolateFrogsNuts
Copy link
Contributor

Alternately - what about simply flagging that data was written beyond the start address of the compressed image, and skipping the verify when that happens?

@Tech-TX
Copy link
Contributor

Tech-TX commented Jul 20, 2020

Sorry, I've been offline a few days. I'll test it this evening, as I have a bunch of modules with XMC flash.

Haven't had OTA issues since the last XMC fix, but I'm not up to current build.

@Tech-TX
Copy link
Contributor

Tech-TX commented Jul 21, 2020

Earle rebuilt the eboot.elf with current git + #7468 and I've tested it with two different D1 Minis with XMC, both at 80 & 160 and going back and forth. No errors with my test code, which is nowhere near as complex as Tasmota is. Mine's only 326032 bytes, 31% of 1meg program storage space on 4meg XMC flash.

I also tested with d-a-v's eboot.elf, same results: no problem with 2 boards at any compile speed.

devyte added a commit that referenced this issue Jul 21, 2020
* Comment out verify step in eboot.c

Meant for #7458 , but still requires a recompiled eboot.elf.

* Rebuild eboot.elf from changed source

Co-authored-by: Earle F. Philhower, III <[email protected]>
@earlephilhower
Copy link
Collaborator

2.7.3 is now released with the patch in #7468 (it's also in master), so I'm closing this. If there still is any issue, please let us know.

@arendst
Copy link
Author

arendst commented Jul 29, 2020

Thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants