Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1073, software bus locking #1092

Merged
merged 2 commits into from
Jan 25, 2021

Conversation

jphickey
Copy link
Contributor

@jphickey jphickey commented Jan 13, 2021

Describe the contribution

The root cause of the mysterious message delivery issues was tracked down to inconsistent locking. When running on multi core systems a variety of buffer corruption issues were observed. This fix is really a rework of all CFE SB API implementations toward the goal of more consistent locking patterns to solve this issue.

Because SB APIs make heavy use of event IDs, this presents a challenge because events can only be sent while unlocked. In order to ensure that all events are generated and all counters are incremented consistently, the functions should run through to the end and not return early, and use a "PendingEventID" register to record what event ID should be sent per the current operation, rather than simply sending the event at the time the condition is identified.

The general pattern becomes:

  1. Initial checks without lock
  2. Lock SB
  3. Perform query/modification of SB pipe/routing tables as needed while locked
  4. Unlock SB
  5. Call into other subsystems as needed (e.g. OSAL)
  6. (Optional) Re-lock SB global to update based on the result of (5), and unlock again
  7. Send pending event ID.

This also employs the CFE_ES_ResourceID_t type and related patterns for managing the SB Pipe IDs. In particular this (intentionally) makes it not possible to use this directly an array index, and will break code which directly accessed these items without going through the lookup function.

Fixes #985
Fixes #1073
Fixes #1096

Testing performed
Build and sanity check CFE
Run all unit tests
Also Let CFE run for approx 72 hours, and observed no more "invalid msgid" or memory pool errors.

Expected behavior changes

  • The CFE_SB_PipeId_t type is no longer usable as a direct array index. It is also increased in size from 8 bits to 32 bits. But as a result is now consistent with all other ID types in both behavior and size. As long as the typedef is used and PipeIds are used only as intended,, this should not be noticeable to apps.
  • The "pipe stats" structure in the Pipe TLM is also changed. This structure contained a CFE_SB_PipeId_t value, hence why it had to be updated because the type is now bigger. The spare bytes are also moved to the end of the struct.

Neither of these are expected to be particularly problematic - just stuff to be aware of.

System(s) tested on
Ubuntu 20.04

Additional context
Not sure if the change to CFE_SB_PipeId_t size (and the TLM message that contains this type) might require an update to cFS-GroundSystem.

Contributor Info - All information REQUIRED for consideration of pull request
Joseph Hickey, Vantage Systems, Inc.

EDIT - Fix #948 - eliminates identified double locks.

@jphickey jphickey force-pushed the fix-1073-sb-locking branch 2 times, most recently from 982db6e to c3ecf28 Compare January 13, 2021 02:05
@jphickey jphickey added the CCB:Ready Ready for discussion at the Configuration Control Board (CCB) label Jan 13, 2021
@jphickey
Copy link
Contributor Author

Note this also fixes #1078 -as it removes the problematic routine entirely.

@jphickey jphickey linked an issue Jan 13, 2021 that may be closed by this pull request
@astrogeco
Copy link
Contributor

CCB:2021-01-13

Fixes concurrency issues in SB.

Will probably follow up with moving this to a module since there's multiple ways to implement the Resource ID functionality.

Accessing tables using PipeID will no longer be available to external access.

ResourceIDs are no longer recycled and removes risk of PipeID aliasing

Update cFS-GroundSystem to reflect changes in structure sizes.

@jphickey Create "retroactive" issue to document bugs with event and error messages inconsistencies due to lock handling.

Remove lingering unnecessary debug artifacts

Link this to the "remove pipename from pipe structure" #287

@CDKnightNASA
Copy link
Contributor

This is excellent work, @jphickey ! I agree with all of your changes. Given the scope of these changes, how should we test and roll this out? SBN touches a fair bit of SB, and I've been trying to bring SBN up to "caelum". I may try this branch against my code but do we plan to merge into main soon?

(As an aside, I started at NASA as a contractor for a company called Caelum.)

@astrogeco astrogeco removed the CCB:Ready Ready for discussion at the Configuration Control Board (CCB) label Jan 13, 2021
@jphickey
Copy link
Contributor Author

BTW- regarding discussion of old versions that are affected by this bug.... I wanted to add one important detail:

Although some general race conditions certainly have existed across SB for quite some time, this particular issue (or at the least the specific race condition that was actually causing failures described in #1073) was introduced as part of the SB API refactor that was done a month or two ago - that is in the transition to CFE_SB_TransmitMsg() instead of CFE_SB_SendMsg(). Specifically the call to CFE_SB_GetBufferFromPool() to get the buffer was not serialized as it needed to be. In the old CFE_SB_SendMsgFull() implementation in bootes this actually was inside a locked section, but after migrating to CFE_SB_TransmitMsg() it was not invoked from a locked section anymore. I suspect (though I cannot confirm) that getting the buffer from the pool was the actual problem area causing the bad msg ID errors - because if two tasks call this simultaneously, there is a possibility that they both get the same buffer.

Of course there were a ton of other race conditions as well, such as accessing the pipe/routing table - but these didn't seem to manifest themselves in a noticeable way.

The other race conditions fixed by this PR address problem areas with Creating/Deleting SB pipes and route subscribe/unsubscribe ops. While Bootes probably has problems there if you push it hard enough, most FSW doesn't do repetitive subscribe/unsubscribe operations enough to expose these problems.

So in other words - I wouldn't necessarily worry about Bootes for normal steady-state message passing. The specific issue that likely manifested itself as #1073 was introduced more recently.

@jphickey
Copy link
Contributor Author

@CDKnightNASA -

how should we test and roll this out? SBN touches a fair bit of

Yes SBN is an important use case that I was not really able to test as part of this - I don't really have any test setups that use SBN. I'd be hoping that you (or someone!) can pull this into an SBN setup.

Note that aside from the scary-looking refactoring, it actually doesn't change that much in the way of actual logic. It gets rid of some redundant struct members and consolidates access to the structs into a locked section, but tries hard not to actually change the behavior of the way these things actually operated.

Move certain definitions related to the CFE_ES_ResourceID_t type
into the global CFE include files.  This introduces two new headers:

cfe_resourceid.h (public)
cfe_resourceid_internal.h (private to CFE core apps)

This allows other CFE core apps, such as SB, to use the
CFE_ES_ResourceID_t using the same manipulators.
Significant refactor of many SB API calls to address inconsistencies
with respect to locking and unlocking of global data structures.

First this updates the definition of CFE_SB_PipeId_t to use the
CFE_ES_ResourceID_t base type, and a new ID range.  Notably this
prevents direct access to the CFE_SB.PipeTbl global, forcing
code to go through the proper lookup routine, which should only
be done while locked.

All API implementations follow the same general pattern:

- Initial checks/queries while unlocked
- Lock SB global
- Lookups and/or modifications to the pipe table/routing info
- Unlock SB global
- Invoke other subsystems (e.g. OSAL)
- Re-lock SB global (if needed) do final update, and unlock again
- Send all events

All error counters should be updated at the end, while still locked.
All event processing is deferred to the end of each function, after
all other processing is done.
@jphickey
Copy link
Contributor Author

Rebased to "main" branch (2021-01-12 baseline) and resolved conflicts

@astrogeco
Copy link
Contributor

astrogeco commented Jan 25, 2021

CCB:2021-01-13 APPROVED

@astrogeco astrogeco merged commit 95ec2da into nasa:integration-candidate Jan 25, 2021
astrogeco added a commit to nasa/cFS that referenced this pull request Jan 25, 2021
@jphickey jphickey deleted the fix-1073-sb-locking branch February 1, 2021 19:46
@skliper skliper added this to the 7.0.0 milestone Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment