Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using TRICE_RING_BUFFER still crashes if "Tricing a lot" #462

Closed
KammutierSpule opened this issue May 15, 2024 · 34 comments
Closed

Using TRICE_RING_BUFFER still crashes if "Tricing a lot" #462

KammutierSpule opened this issue May 15, 2024 · 34 comments
Assignees
Labels
quality improve code quality

Comments

@KammutierSpule
Copy link

KammutierSpule commented May 15, 2024

I'm using TRICE_RING_BUFFER
however I notice if I flood with Trice's my software will crash.

I was assuming TRICE_RING_BUFFER was buffer overflow safe (as opposite to TRICE_DOUBLE_BUFFER) or Am I mistake?

What could be wrong?

Edit: I found if I use TRICE_FRAMING_COBS instead of TRICE_FRAMING_TCOBS it takes long until I get a crash.

@rokath
Copy link
Owner

rokath commented May 15, 2024

There is no protection against overflow. If you produce in the average more data than transmittable or have a data burst, which exceeds the provided buffer size, you will end up with a crash. Adding some overflow protection would lead to the question, which data to throw away or how to slow down the code. Also it would increase the code size and decrease the performance. The recommended way is to firstly make the buffer size reasonable big and after running a while to check the TriceRingBufferDepthMax value for example. This needs TRICE_DIAGNOSTICS to be 1 in your triceConfig.h file. At the end of trice/examples/exampleData/triceCheck.c file you can find function TriceLogDiagnosticValues() as example, which you could call cyclically every minute or so. Finally you know the needs and can adjust the buffer size.

@KammutierSpule
Copy link
Author

but if TRICE_RING_BUFFER is circular, what is making it to overflow?

@rokath
Copy link
Owner

rokath commented May 15, 2024

The ring buffer overflow does not mean, that other memory regions are destroyed. Inside the ring buffer older messages not transmitted yet, are overwritten. This could lead to incomplete messages or simply missed messages. About missed messages the trice tool will complain with a cycle counter mismatch and incomplete messages will cause also error messages.

@KammutierSpule
Copy link
Author

The ring buffer overflow does not mean, that other memory regions are destroyed.

In this case, clarifying my original question, I'm experience my firmware crashing when flooding with Trices (eg calling Trice() inside some regular loop in the code that will work regular when not calling a Trice line).

If not ring buffer does not corrupt other memory regions, could be other case?

@rokath
Copy link
Owner

rokath commented May 15, 2024

Please have a look at issue #294. Could that be the reason? What is the buffer size and what does the max depth value say? How many date bytes you produce per time unit and how much you can transmit (what is your baud rate)? A trice without time stamp and without values is only 4 bytes. But if you have a loop of 1000 for example you produce 4KB date in nearly no time. Can you reduce the data amount just experimental to 1/10 or 1/100 just to understand what happens? You could check the buffer borders too (maybe there is a bug in the code). That is a bit tricky because the linker is free to arrange the memory regions. Use your debug tool for that. AND: Do not forget about your interrupts!

@KammutierSpule
Copy link
Author

Please have a look at issue #294. Could that be the reason?

I don't think it is the case here.

What is the buffer size and what does the max depth value say? How many date bytes you produce per time unit and how much you can transmit (what is your baud rate)?

On this test I'm deliberating flooding, i.e: with the propose to exceed the size of the buffer.
I'm testing the case if "something goes wrong in the code" at some point that could flood the trice buffer.

So, a part from this deliberated test, Trice is working ok.

I will debug for buffer overflow.

@rokath
Copy link
Owner

rokath commented May 15, 2024

I like that you do such harsh tests.

Maybe the trice tool -s CLI switch can help a bit. There is also a -debug switch for the trice tool.

@KammutierSpule
Copy link
Author

I did this:

//! TriceRingBuffer is a kind of heap for trice messages.
uint32_t TriceRingBuffer[(TRICE_DEFERRED_BUFFER_SIZE>>2) + 16] = {0};

Then I initialized the buffer marks:
Screenshot from 2024-05-15 14-17-57

Please note that (TRICE_DEFERRED_BUFFER_SIZE>>2) is 1 position out of the buffer size (0 indexes array)

I place a breakpoint if the memory position outsize buffer was changed.

I got a breakpoint on tcobsv1Encode here:
OUTB( b_1 ) // aa, -- xx. yy ...

Then I checked and add a break point, analyze memory:
Screenshot from 2024-05-15 14-21-45

It looks there are 3 writes outsize range.
This was testing without flooding.

Also, I don't understand this line of code here:
TriceBufferWritePosition = (TriceBufferWritePosition + (TRICE_BUFFER_SIZE>>2)) <= triceRingBufferLimit ? TriceBufferWritePosition : TriceRingBuffer; \

Doesn't it means that it will be (almost?) all the time?
I can't understand from the code, where (if it should?) it is "ringing" the buffer. I was expected it to be circular.

@rokath
Copy link
Owner

rokath commented May 15, 2024

Just to explain this for now:

TriceBufferWritePosition = (TriceBufferWritePosition + (TRICE_BUFFER_SIZE>>2)) <= triceRingBufferLimit ? TriceBufferWritePosition : TriceRingBuffer; \

The TriceBufferWritePosition stays unchanged, when there is enough space for the next trice at the current write position. Because the the size of the next trice message is unknown here, the biggest value is assumed, that is TRICE_BUFFER_SIZE bytes. If this space is not given anymore, the TriceBufferWritePosition is reset to the start address of the ring buffer. This implementation is a bit different than a ring buffer is usually implemented.

This explanation is now added to the source code.

@rokath
Copy link
Owner

rokath commented May 15, 2024

Unfortunately I have no time right now for looking deeper and tomorrow I am mostly offline. The problem you found could be a part of the TCOBS code. So one shot could be, that you switch for investigation to COBS. Just change the setting in triceConfig.h and apply the appropriate trice tool CLI switch.

@rokath
Copy link
Owner

rokath commented May 15, 2024

Could it be, that you mixed TRICE_DEFERRED_BUFFER_SIZE with TRICE_BUFFER_SIZE in your investigations?

I would do it like this:

Change

uint32_t TriceRingBuffer[TRICE_DEFERRED_BUFFER_SIZE>>2] ;

into

uint32_t TriceRingBuffer[(TRICE_DEFERRED_BUFFER_SIZE>>2)+4] ;

and initialize

TriceRingBuffer[(TRICE_DEFERRED_BUFFER_SIZE>>2)+0] = 0xdeadb33f ;
TriceRingBuffer[(TRICE_DEFERRED_BUFFER_SIZE>>2)+1] = 0xdeadb33f ;
TriceRingBuffer[(TRICE_DEFERRED_BUFFER_SIZE>>2)+2] = 0xdeadb33f ;
TriceRingBuffer[(TRICE_DEFERRED_BUFFER_SIZE>>2)+3] = 0xdeadb33f ;

and watch if any of these change.

To check that complete you need to do that also at the ring buffer start.

Maybe like this:

uint32_t checkedTriceRingBuffer[4+(TRICE_DEFERRED_BUFFER_SIZE>>2)+4] ;
uint32_t* TriceRingBuffer = &checkedTriceRingBuffer[4];

and initialize

checkedTriceRingBuffer[0] = 0xdeadb33f ;
checkedTriceRingBuffer[1] = 0xdeadb33f ;
checkedTriceRingBuffer[2] = 0xdeadb33f ;
checkedTriceRingBuffer[3] = 0xdeadb33f ;
checkedTriceRingBuffer[(TRICE_DEFERRED_BUFFER_SIZE>>2)+0] = 0xdeadb33f ;
checkedTriceRingBuffer[(TRICE_DEFERRED_BUFFER_SIZE>>2)+1] = 0xdeadb33f ;
checkedTriceRingBuffer[(TRICE_DEFERRED_BUFFER_SIZE>>2)+2] = 0xdeadb33f ;
checkedTriceRingBuffer[(TRICE_DEFERRED_BUFFER_SIZE>>2)+3] = 0xdeadb33f ;

and watch if any of these change.

THIS CODE IS COMPLETELY UNTESTED! IT IS JUST AN IDEA.

@rokath
Copy link
Owner

rokath commented May 16, 2024

You wrote: "Please note that (TRICE_DEFERRED_BUFFER_SIZE>>2) is 1 position out of the buffer size (0 indexes array)". I cannot clearly understand what you mean. Please explain it a bit more.

@KammutierSpule
Copy link
Author

KammutierSpule commented May 16, 2024

Firstly, sorry I mistake TRICE_BUFFER_SIZE vs TRICE_DEFERRED_BUFFER_SIZE, they have completely different means :)

Now I changed for the correct debug:

void CheckBuffer()
{
    if( (TriceRingBuffer[0 + 0] != 0xDEADB33F) ||
        (TriceRingBuffer[0 + 1] != 0xDEADB33F) || 
        (TriceRingBuffer[0 + 2] != 0xDEADB33F) ||
        (TriceRingBuffer[0 + 3] != 0xDEADB33F) ||
        (TriceRingBuffer[4 + (TRICE_DEFERRED_BUFFER_SIZE>>2) + 0] != 0xDEADB33F) ||
        (TriceRingBuffer[4 + (TRICE_DEFERRED_BUFFER_SIZE>>2) + 1] != 0xDEADB33F) ||
        (TriceRingBuffer[4 + (TRICE_DEFERRED_BUFFER_SIZE>>2) + 2] != 0xDEADB33F) ||
        (TriceRingBuffer[4 + (TRICE_DEFERRED_BUFFER_SIZE>>2) + 3] != 0xDEADB33F) ) {
        const size_t dummy = strlen( TriceRingBuffer ); // breakpoint here
    }
}
init:
    TriceRingBuffer[0 + 0] = 0xDEADB33F;
    TriceRingBuffer[0 + 1] = 0xDEADB33F;
    TriceRingBuffer[0 + 2] = 0xDEADB33F;
    TriceRingBuffer[0 + 3] = 0xDEADB33F;

    for( int i = 0; i < 128; i++) {
    TriceRingBuffer[4 + (TRICE_DEFERRED_BUFFER_SIZE>>2) + i] = 0xDEADB33F;
    }
+        CheckBuffer();
         TriceTransfer(); // serve deferred output
+        CheckBuffer();
 -   TriceBufferWritePosition = (TriceBufferWritePosition + (TRICE_BUFFER_SIZE>>2)) <= triceRingBufferLimit ? TriceBufferWritePosition : TriceRingBuffer; \
+    TriceBufferWritePosition = (TriceBufferWritePosition + (TRICE_BUFFER_SIZE>>2)) <= triceRingBufferLimit ? TriceBufferWritePosition : (TriceRingBuffer+4); \
//! TriceRingBuffer is a kind of heap for trice messages.
-uint32_t TriceRingBuffer[TRICE_DEFERRED_BUFFER_SIZE>>2] = {0};
uint32_t TriceRingBuffer[4 + (TRICE_DEFERRED_BUFFER_SIZE>>2) + 128] = {0};

//! TriceBufferWritePosition is used by the TRICE_PUT macros.
-uint32_t* TriceBufferWritePosition = TriceRingBuffer; 
uint32_t* TriceBufferWritePosition = TriceRingBuffer + 4;

//! triceBufferWriteLimit is the first address behind TriceRingBuffer. 
-uint32_t* const triceRingBufferLimit = &TriceRingBuffer[TRICE_DEFERRED_BUFFER_SIZE>>2];
+uint32_t* const triceRingBufferLimit = &TriceRingBuffer[4 + TRICE_DEFERRED_BUFFER_SIZE>>2];

//! to ga correct value for the very first call of triceNextRingBufferRead
-uint32_t* TriceRingBufferReadPosition = TriceRingBuffer - (TRICE_DATA_OFFSET>>2); //lint !e428 Warning 428: negative subscript (-4) in operator 'ptr-int'
+uint32_t* TriceRingBufferReadPosition = TriceRingBuffer + 4 - (TRICE_DATA_OFFSET>>2); //lint !e428 Warning 428: negative subscript (-4) in operator 'ptr-int'
//ARM5 #pragma  pop

    if( (TriceRingBufferReadPosition + (TRICE_BUFFER_SIZE>>2)) > triceRingBufferLimit ){
-        TriceRingBufferReadPosition = TriceRingBuffer;
+        TriceRingBufferReadPosition = TriceRingBuffer + 4;
    }

@rokath
Copy link
Owner

rokath commented May 16, 2024

This looks good. Maybe you define the dummy value as a global variable, initialize it with 0 and set it inside the check to 1 in case of an overwritten memory. This would not allow the compiler to optimise it out. Also you could run the test without debug for a long time.

@KammutierSpule
Copy link
Author

I'm debugging with TRICE_FRAMING_NONE and I experience it writing (way) outside the buffer on this

    #elif TRICE_DEFERRED_OUT_FRAMING == TRICE_FRAMING_NONE
    memmove( enc, buf, len );
    encLen = len;
    #else

so the issue must come before this

@KammutierSpule
Copy link
Author

KammutierSpule commented May 17, 2024

Some more experiments,
If I change this, it improved (but not sure if still crashes):

-uint32_t* const triceRingBufferLimit = &TriceRingBuffer[TRICE_DEFERRED_BUFFER_SIZE>>2];
+uint32_t* const triceRingBufferLimit = &TriceRingBuffer[TRICE_DEFERRED_BUFFER_SIZE>>2 - 1]; // note: >>(2-1), this is actually the same as SIZE/2

My code won't crash and I get overflow message while using diagnostic:

triceSingleDepthMax = 104 of 144, SingleTricesRingCountMax = 5, triceRingBufferDepthMax =4294967268 of 256 (overflow!)

4294967268 == FFFFFFE4

@KammutierSpule
Copy link
Author

Two questions arise me:

  • On trice write, if it reaches the end of the buffer, it will wrap to TriceRingBuffer begin and continue. If it overflow (i.e the write is > than the read position) it will corrupt the current (and the next) read position. I believe the Write should also detect overflows and update the read position for the next "good read" (that is still safe for a write to fill the maximum allowed).
  • On a trice read, it gets the triceNextRingBufferRead from the lastWordCount. lastWordCount it comes from TriceIDAndBuffer. If the read data is corrupted, it will give a triceType invalid which results on a random wordCount (as int wordCount; it not initialized). Also the data corruption could lead to triceDataLen to get a len that will write out of the buffer.

I believe that once the buffer gets corrupted, it can't recover anymore?

@rokath
Copy link
Owner

rokath commented May 20, 2024

Hopefully I will find time to look a bit deeper into this within the next 2 days. Just to be curious: There is a note in https://github.com/rokath/trice/blob/master/docs/TriceOverRTT.md#10-possible-issues. When performing excessive duration tests (within the transfer capabilities), some of the STM32 evaluation boards flashed with OB J-Link started to not work properly anymore over RTT, even after re-flashing the firmware image. I could get them back to normal operation only with a power cycle. My assumption is, that the OB J-Link software went somehow in a instability, probably due to internal buffer overruns, what required a OB J-Link reset. This is probably not related to this issue, as I understood, that you use an UART. But maybe some other software problem exists.

@KammutierSpule
Copy link
Author

Thanks for having a look on this.
In order to exclude other possible issue sources, one way would be to run and debug Trice on PC. This way you can just analyze the buffers and you don't event need output.

@KammutierSpule
Copy link
Author

I found another issue to address:
TriceDeferredEncode, it uses the same buffer to write the encoded data.

  • Is it safe that it knows that should wrap around the ring buffer? or safe that won't write out of the buffer. eg: if the current read position is almost at end of the buffer?
  • Also related to previous points mention here: Using TRICE_RING_BUFFER still crashes if "Tricing a lot" #462 (comment), as there are no checks on the write, it will overwrite and corrupt a possible "on going" encoded data/transmission.

@rokath
Copy link
Owner

rokath commented May 23, 2024

My very first assumption is, that the code seems to work correctly, when and only when less data are produced then transmittable. In the case you investigate, more data are produced than transmitted. That finally produces data garbage inside the ring buffer and the results of function TriceIDAndBuffer are not predictable what explains your observations. The only fixing idea I have for now, is to throw away new Trices until there is enough transmitted to accept new Trice data.

@KammutierSpule
Copy link
Author

KammutierSpule commented May 24, 2024

The only fixing idea I have for now, is to throw away new Trices until there is enough transmitted to accept new Trice data.

const bool isUnderLimit = (TriceBufferWritePosition + (TRICE_BUFFER_SIZE>>2)) <= triceRingBufferLimit; \
const bool willOverflow = (isUnderLimit && ((TriceBufferWritePosition <= TriceRingBufferReadPosition) && \
((TriceBufferWritePosition + (TRICE_BUFFER_SIZE>>2)) >= TriceRingBufferReadPosition)))|| \
((!isUnderLimit) && (TriceRingBufferReadPosition <= (TriceRingBuffer + (TRICE_BUFFER_SIZE>>2)))); \
if(!willOverflow) { \
TriceBufferWritePosition =  isUnderLimit? TriceBufferWritePosition : TriceRingBuffer; \

Does this makes sense?

@rokath
Copy link
Owner

rokath commented May 24, 2024

I asume this will not work. The check needs to be done before TRICE_ENTER is reached an then the whole new trice needs to be discarded. What needs to be evaluated is the depth of the ring buffer. Please give me a few days for a fix in a new branch. This will also cover the double buffer option.

@KammutierSpule
Copy link
Author

The check needs to be done before TRICE_ENTER

//! TRICE_ENTER is the start of TRICE macro.
#define TRICE_ENTER \
    TRICE_ENTER_CRITICAL_SECTION { \

On any case, the check would need to be "thread safe" (this is why I thought put after the critical section, after the trice enter)

@rokath
Copy link
Owner

rokath commented May 25, 2024

There is now branch experimentalProtectOption containing a first idea how to add a protection option. I was unable to test this so far but will do asap. I think, you can already do your own tests with it. To activate the fix, please add #define TRICE_PROTECT to your triceConfig.h file. This will result in a bit more code and a bit lack of speed but the user is free to enable it.

@KammutierSpule
Copy link
Author

KammutierSpule commented May 27, 2024

Thanks for the update, I tested only the ring buffer,
I believe this should be the correct check:

const bool isEnoughSpace = (depth32 + (TRICE_SINGLE_MAX_SIZE >> 2)) < ((TRICE_DEFERRED_BUFFER_SIZE >> 2)-(TRICE_SINGLE_MAX_SIZE >> 2));

it must be < (instead of <=) and it must give room to the current ongoing transmission.

I also added this for diagnostic:

    if( !isEnoughSpace ){
        TriceOverflowCounter++;
    }

trice.c
//! Overflow counter
unsigned TriceOverflowCounter = 0;

For Doublebuffer (I didn't test) but the comparison should be < too

@rokath
Copy link
Owner

rokath commented May 27, 2024

Probably the branch implementation is correct. depth is the actual used space and when adding the max possible Trice size, it is allowed to reach the buffer size. But testing will give the right answer. BTW: You do not need to create plenty of data. Just slow down the Trice data transmission.

@KammutierSpule
Copy link
Author

But testing will give the right answer.

On my tests it was showing errors the Trice log (meaning that trice buffer is become corrupted).

@rokath
Copy link
Owner

rokath commented May 27, 2024

Concerning the TriceOverflowCounter l am not sure. The embedded code should only contain really needed stuff. The Trice tool will detect cycle counter errors in case of dropped Trices and does count these events automatically.

@rokath
Copy link
Owner

rokath commented Jun 1, 2024

Your proposed check will probably work and you can relax it a bit also. Please refer to the comment in #468 or directly to the comments in the triceEnoughSpace function on the experimental branch. For the double buffer the check is adapted too, but it looks different there.

@rokath
Copy link
Owner

rokath commented Jun 1, 2024

About the TriceOverflowCounter proposal:

If the device runs a long long time without the connted trice tool, most Trices are blown in the air and who cares then about potentially missed ones? If the trice tool is connected, it will report and count cycle counter errors. If you mistrust the overflow protection, you can add such counter to your test code. Or are there any other arguments ?

EDIT:

Yes, TriceOverflowCounter proposal is a GOOD idea! Tonight I could do some first tests and figured out, that there is no cycle counter error in case some Trices are discarded. That is, because, if a trice is not produced, it cannot cause a cycle error. The cycle counter is not touched in such a case. Incrementing it would be a possibility, but a separate TriceOverflowCounter is a better choice.

@rokath
Copy link
Owner

rokath commented Jun 3, 2024

In experimentaProtectOption branch in example project trice/examples/vsCode_NucleoF030R8_instrumented is now successful tested overflow code including an TriceOverflowCounter.

@rokath rokath added the quality improve code quality label Jun 3, 2024
@rokath rokath self-assigned this Jun 3, 2024
@KammutierSpule
Copy link
Author

Or are there any other arguments ?

TriceOverflowCounter would be used for diagnostic, so we know that we need to slowdown or do something else to fix it. Should be #conditional build

@rokath
Copy link
Owner

rokath commented Jul 16, 2024

Solved in main branch.

@rokath rokath closed this as completed Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality improve code quality
Projects
None yet
Development

No branches or pull requests

2 participants