Fix #1073, software bus locking #1092

jphickey · 2021-01-13T01:57:22Z

Describe the contribution

The root cause of the mysterious message delivery issues was tracked down to inconsistent locking. When running on multi core systems a variety of buffer corruption issues were observed. This fix is really a rework of all CFE SB API implementations toward the goal of more consistent locking patterns to solve this issue.

Because SB APIs make heavy use of event IDs, this presents a challenge because events can only be sent while unlocked. In order to ensure that all events are generated and all counters are incremented consistently, the functions should run through to the end and not return early, and use a "PendingEventID" register to record what event ID should be sent per the current operation, rather than simply sending the event at the time the condition is identified.

The general pattern becomes:

Initial checks without lock
Lock SB
Perform query/modification of SB pipe/routing tables as needed while locked
Unlock SB
Call into other subsystems as needed (e.g. OSAL)
(Optional) Re-lock SB global to update based on the result of (5), and unlock again
Send pending event ID.

This also employs the CFE_ES_ResourceID_t type and related patterns for managing the SB Pipe IDs. In particular this (intentionally) makes it not possible to use this directly an array index, and will break code which directly accessed these items without going through the lookup function.

Fixes #985
Fixes #1073
Fixes #1096

Testing performed
Build and sanity check CFE
Run all unit tests
Also Let CFE run for approx 72 hours, and observed no more "invalid msgid" or memory pool errors.

Expected behavior changes

The CFE_SB_PipeId_t type is no longer usable as a direct array index. It is also increased in size from 8 bits to 32 bits. But as a result is now consistent with all other ID types in both behavior and size. As long as the typedef is used and PipeIds are used only as intended,, this should not be noticeable to apps.
The "pipe stats" structure in the Pipe TLM is also changed. This structure contained a CFE_SB_PipeId_t value, hence why it had to be updated because the type is now bigger. The spare bytes are also moved to the end of the struct.

Neither of these are expected to be particularly problematic - just stuff to be aware of.

System(s) tested on
Ubuntu 20.04

Additional context
Not sure if the change to CFE_SB_PipeId_t size (and the TLM message that contains this type) might require an update to cFS-GroundSystem.

Contributor Info - All information REQUIRED for consideration of pull request
Joseph Hickey, Vantage Systems, Inc.

EDIT - Fix #948 - eliminates identified double locks.

jphickey · 2021-01-13T13:58:34Z

Note this also fixes #1078 -as it removes the problematic routine entirely.

astrogeco · 2021-01-13T17:59:52Z

CCB:2021-01-13

Fixes concurrency issues in SB.

Will probably follow up with moving this to a module since there's multiple ways to implement the Resource ID functionality.

Accessing tables using PipeID will no longer be available to external access.

ResourceIDs are no longer recycled and removes risk of PipeID aliasing

Update cFS-GroundSystem to reflect changes in structure sizes.

@jphickey Create "retroactive" issue to document bugs with event and error messages inconsistencies due to lock handling.

Remove lingering unnecessary debug artifacts

Link this to the "remove pipename from pipe structure" #287

CDKnightNASA · 2021-01-13T18:18:12Z

This is excellent work, @jphickey ! I agree with all of your changes. Given the scope of these changes, how should we test and roll this out? SBN touches a fair bit of SB, and I've been trying to bring SBN up to "caelum". I may try this branch against my code but do we plan to merge into main soon?

(As an aside, I started at NASA as a contractor for a company called Caelum.)

jphickey · 2021-01-13T20:11:07Z

BTW- regarding discussion of old versions that are affected by this bug.... I wanted to add one important detail:

Although some general race conditions certainly have existed across SB for quite some time, this particular issue (or at the least the specific race condition that was actually causing failures described in #1073) was introduced as part of the SB API refactor that was done a month or two ago - that is in the transition to CFE_SB_TransmitMsg() instead of CFE_SB_SendMsg(). Specifically the call to CFE_SB_GetBufferFromPool() to get the buffer was not serialized as it needed to be. In the old CFE_SB_SendMsgFull() implementation in bootes this actually was inside a locked section, but after migrating to CFE_SB_TransmitMsg() it was not invoked from a locked section anymore. I suspect (though I cannot confirm) that getting the buffer from the pool was the actual problem area causing the bad msg ID errors - because if two tasks call this simultaneously, there is a possibility that they both get the same buffer.

Of course there were a ton of other race conditions as well, such as accessing the pipe/routing table - but these didn't seem to manifest themselves in a noticeable way.

The other race conditions fixed by this PR address problem areas with Creating/Deleting SB pipes and route subscribe/unsubscribe ops. While Bootes probably has problems there if you push it hard enough, most FSW doesn't do repetitive subscribe/unsubscribe operations enough to expose these problems.

So in other words - I wouldn't necessarily worry about Bootes for normal steady-state message passing. The specific issue that likely manifested itself as #1073 was introduced more recently.

jphickey · 2021-01-13T20:19:23Z

@CDKnightNASA -

how should we test and roll this out? SBN touches a fair bit of

Yes SBN is an important use case that I was not really able to test as part of this - I don't really have any test setups that use SBN. I'd be hoping that you (or someone!) can pull this into an SBN setup.

Note that aside from the scary-looking refactoring, it actually doesn't change that much in the way of actual logic. It gets rid of some redundant struct members and consolidates access to the structs into a locked section, but tries hard not to actually change the behavior of the way these things actually operated.

Move certain definitions related to the CFE_ES_ResourceID_t type into the global CFE include files. This introduces two new headers: cfe_resourceid.h (public) cfe_resourceid_internal.h (private to CFE core apps) This allows other CFE core apps, such as SB, to use the CFE_ES_ResourceID_t using the same manipulators.

Significant refactor of many SB API calls to address inconsistencies with respect to locking and unlocking of global data structures. First this updates the definition of CFE_SB_PipeId_t to use the CFE_ES_ResourceID_t base type, and a new ID range. Notably this prevents direct access to the CFE_SB.PipeTbl global, forcing code to go through the proper lookup routine, which should only be done while locked. All API implementations follow the same general pattern: - Initial checks/queries while unlocked - Lock SB global - Lookups and/or modifications to the pipe table/routing info - Unlock SB global - Invoke other subsystems (e.g. OSAL) - Re-lock SB global (if needed) do final update, and unlock again - Send all events All error counters should be updated at the end, while still locked. All event processing is deferred to the end of each function, after all other processing is done.

jphickey · 2021-01-15T21:34:08Z

Rebased to "main" branch (2021-01-12 baseline) and resolved conflicts

astrogeco · 2021-01-25T03:47:09Z

CCB:2021-01-13 APPROVED

jphickey force-pushed the fix-1073-sb-locking branch 2 times, most recently from 982db6e to c3ecf28 Compare January 13, 2021 02:05

jphickey added the CCB:Ready Ready for discussion at the Configuration Control Board (CCB) label Jan 13, 2021

jphickey linked an issue Jan 13, 2021 that may be closed by this pull request

CFE_SB_ReceiveBuffer() -- switch timeout to int32 #1078

Closed

astrogeco added the CCB:2021-01-13 label Jan 13, 2021

jphickey mentioned this pull request Jan 13, 2021

Inconsistent Pipe ID reporting in SB events #1093

Open

astrogeco removed the CCB:Ready Ready for discussion at the Configuration Control Board (CCB) label Jan 13, 2021

jphickey mentioned this pull request Jan 13, 2021

Possible race conditions in SB Pipe operations #1096

Closed

skliper linked an issue Jan 14, 2021 that may be closed by this pull request

Possible race conditions in SB Pipe operations #1096

Closed

jphickey added 2 commits January 15, 2021 16:26

jphickey force-pushed the fix-1073-sb-locking branch from c3ecf28 to b17cd1e Compare January 15, 2021 21:32

jphickey mentioned this pull request Jan 15, 2021

Fix #982, separate pipeinfo file data structure #1102

Merged

This was referenced Jan 19, 2021

Eliminating recursive locks in SB could allow for using more efficient resource #948

Closed

Confirm no recursive locking and transition to fast mutex use #1105

Open

skliper linked an issue Jan 19, 2021 that may be closed by this pull request

Eliminating recursive locks in SB could allow for using more efficient resource #948

Closed

jphickey mentioned this pull request Jan 19, 2021

Split "ResourceId" type into a separate module #1108

Closed

astrogeco added the IC:2021-01-19 label Jan 25, 2021

astrogeco changed the base branch from main to integration-candidate January 25, 2021 03:46

astrogeco merged commit 95ec2da into nasa:integration-candidate Jan 25, 2021

astrogeco added a commit to nasa/cFS that referenced this pull request Jan 25, 2021

Add nasa/cFE#1092

b69f2ab

astrogeco mentioned this pull request Jan 25, 2021

cFE Integration candidate: 2021-01-19 #1109

Merged

jphickey mentioned this pull request Jan 29, 2021

"PipeName" is unset for debug event #1141

Closed

jphickey deleted the fix-1073-sb-locking branch February 1, 2021 19:46

skliper added this to the 7.0.0 milestone Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #1073, software bus locking #1092

Fix #1073, software bus locking #1092

jphickey commented Jan 13, 2021 •

edited by skliper

Loading

jphickey commented Jan 13, 2021

astrogeco commented Jan 13, 2021

CDKnightNASA commented Jan 13, 2021

jphickey commented Jan 13, 2021

jphickey commented Jan 13, 2021

jphickey commented Jan 15, 2021

astrogeco commented Jan 25, 2021 •

edited

Loading

Fix #1073, software bus locking #1092

Fix #1073, software bus locking #1092

Conversation

jphickey commented Jan 13, 2021 • edited by skliper Loading

jphickey commented Jan 13, 2021

astrogeco commented Jan 13, 2021

CDKnightNASA commented Jan 13, 2021

jphickey commented Jan 13, 2021

jphickey commented Jan 13, 2021

jphickey commented Jan 15, 2021

astrogeco commented Jan 25, 2021 • edited Loading

jphickey commented Jan 13, 2021 •

edited by skliper

Loading

astrogeco commented Jan 25, 2021 •

edited

Loading