Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipc_transport_structured: Investigate: Msg_out - move ctor impl mystery (past instability). #20

Open
ygoldfeld opened this issue Mar 12, 2024 · 0 comments
Labels
from-akamai-pre-open Issue origin is Akamai, before opening source question Further information is requested

Comments

@ygoldfeld
Copy link
Contributor

Filed by @ygoldfeld pre-open-source:

Description follows. Environment (the only one, with many others tried!) where problem was actually observed:

Linux 5.4.... machine
CMake-built open-source ipc meta-project
clang-17 installed via apt-get = /usr/bin/clang => clang due to PATH
libc++ (LLVM-10) installed (as opposed to using GNU stdc++, which is also installed but not used here) via apt-get
compiler selection:
-DCMAKE_CXX_COMPILER=clang++
compiler flags:
-DCMAKE_CXX_FLAGS_RELEASE='-stdlib=libc++ -I/usr/lib/llvm-10/include/c++/v1 -O3 -DNDEBUG -fno-omit-frame-pointer'
linker flags:
-DCMAKE_EXE_LINKER_FLAGS_RELEASE="-L/usr/lib/llvm-10/lib -lc++ -lc++abi -fno-omit-frame-pointer -O3"
-DCMAKE_BUILD_TYPE=Release
Plus CMake settings:
CFG_ENABLE_TEST_SUITE:BOOL=ON
CFG_NO_LTO:BOOL=OFF <-- Important
Lastly (important):
Problem reproduced without ASAN (-fsanitize=address in CMAKE_*_FLAGS_RELEASE)
Problem not reproduced with it (+ ASAN detects no errors)

Sufficient to build just libipc_unit_test.exec: make -j32 libipcunit_test.exec
then go (from build dir) to test/suite/unit_test
and run the test in question: ./libipc_unit_test.exec --gtest_filter=Shm_session_test.In_process_array

Expected behavior: it passes; no seg-fault. Observed behavior as-is: it passes; no seg-fault.

The thing to investigate:

  • Edit ipc_transport_structured/src/ipc/transport/struc/msg.hpp.
  • Find Msg_out move ctor impl.
  • Replace its ~4 line + long comment impl (where it invokes the move-assignment operator) with 1-line impl:
    = default;
  • Build libipc_unit_test.exec as above again.
  • Run it as above again.

Expected behavior: it passes; no seg-fault. Observed behavior (with the above code change): it seg-faults in the move ctor. (In backtrace seen by gdb libipc_unit_test.exec core it is shown as being within Client_session_impl::mdt_builder() due to inlining.)

(The mdt_builder() code tickling the problem is this:

Master_channel_req_ptr req_ptr(new Master_channel_req{ mdt_root, std::move(req_msg) });

The moving of Msg_out req_msg into the Master_channel_req member => seg-fault.)

--

Now as to WTF this is all about, the comment in the code should explain it.

TEMPLATE_STRUCT_MSG_OUT
CLASS_STRUCT_MSG_OUT::Msg_out(Msg_out&& src) :
  Msg_out()
{
  operator=(std::move(src));

  /* (This comment is by ygoldfel, original author of this code -- here and all of this class template and file.)
   * This implementation is clearly correct (functionally); one can see a similar pattern used in many places in,
   * e.g., STL/Boost.  However the original, and all else being equal (which it's not; read on), my preferred
   * implementation of this move ctor is simply: `= default;`.  In that original state I tested this guy and
   * ran into the following.
   *
   * As of this writing, to test this (and all of Msg_out and much else besides), I've been using the
   * transport_test.exec exercise-mode section; and the unit_test.exec test suite (both in the `ipc` meta-project,
   * though most of the relevant .../test/... code for unit_test.exec is in `ipc_shm_arena_lend` project therein).
   * (Normally we don't speak of test code inside source code proper like this, but read on to see why the
   * exception.)  transport_test has passed in all configurations (build/run envs) without issue.
   * unit_test.exec has passed in at least the following (all Linux) except the *asterisked* one:
   *   - gcc-9, -O0 or: (-O3 + LTO disabled or LTO (with fat-object-generation on) enabled (for all of libflow
   *     and libipc_* and the test code proper)).
   *   - clang-17 + libc++ (LLVM-10) (note: not GNU stdc++), -O0 or: (-O3 + LTO disabled or LTO (-flto=thin) enabled*
   *     (for all of libflow and libipc_* and the test code proper)).
   *
   * The *asterisk* there denotes the one config where a problem was observed.  Namely,
   * Shm_session_test.In_process_array unit_test failed, seg-faulting before the test could complete (no actual
   * test-failed assertions had triggered up to the seg-fault point).  The seg-fault was completely consistent
   * (at least given a particular machine): in session::Client_session_impl::mdt_builder(), just near the end
   * of the method, a Msg_out was being move-cted (presumably using this ctor here) from another Msg_out
   * that had just been constructed a few lines above.  Granted, that source Msg_out was constructed and
   * populated (rather straightforwardly) in another thread, but a synchronization mechanism was used to ensure
   * this happened synchronously before the code that seg-faulted (this move-ctor) would proceed to be called.
   * (I verified quite studiously that nothing untoward was happening there; clearly the source object was
   * created cleanly before signaling the thread waiting to proceed with this move-ct to quit waiting and proceed.)
   * (In addition note that this Msg_out ctor was, as seen in debugger, clearly getting auto-inlined; since the prob
   * occurred with LTO but not without -- though gcc's LTO had no issue, but I digress -- this may be significant.)
   *
   * First I labored to isolate where the problem occurred; there was no corrupted memory or anything strange like
   * that; and it fairly clearly really was in the move ctor (which, again, was just `= default;` at the time).
   * Since the move ctor was auto-generated, it was somewhat challenging to see where the problem happened, though
   * admittedly due to time pressure I did not delve into the move ctors *that* would've invoked: for
   * m_builder, m_body_root, m_hndl_or_null (doing so might be a good idea; but keep reading).
   *
   * At that point, just to see what's up, I "jiggled" the implementation into its current form.  Empirically
   * speaking the problem went away, and everything passed swimmingly with no instability.  Hence I left the
   * work-around in place.  So where does it leave us?
   *
   * 1, in terms of correctness and generated-code quality: As noted, the new impl is clearly correct.  In terms of
   * generated-code quality, it is potentially a few instructions slower than the auto-generated ctor:
   * the three m_* are first initialized to their null states and then move-assigned; plus
   * store_native_handle_or_null() checks m_hndl_or_null for null (which it is, so that `if` no-ops).
   * It's simple/quick stuff, and it might even get optimized away with -O3, but nevertheless skipping to
   * move-ction of the members would be more certain to not do that unneeded zeroing stuff.  Subjectively: it's
   * probably fine (undoubtedly there are far heavier perf concerns than a few extra zeroings).
   *
   * 2, there's the mystery.  I.e., sure, the replacement code is fine and is unlikely to be a perf problem;
   * but why isn't the preferred `= default;` the one in actual use?  Why did it cause the failure, though only
   * in a very particular build/run environment (clang-17, LLVM-10 libc++, with thin-LTO), when similar ones
   * (such as same but without LTO) were fine?  At this stage I honestly do not know and will surely file a
   * ticket to investigate/solve.  Briefly the culprit candidates are:
   *   - Msg_out code itself.  Is there some unseen uninitialized datum?  Some basic assumption being ignored or
   *     C++ rule being broken? ###
   *   - Something in capnp-generated move-ctor (as of this writing capnp version = 1.0.1, from late 2023) or
   *     m_builder move-ctor.
   *   - Something in clang-17 + thin-LTO optimizer improperly reordering instructions.
   *
   * I cannot speculate much about which it is; can only say after a few hours of contemplating possibilities:
   * no candidates for ### (see above) being at fault has jumped out at me.  That said, no run-time sanitizer
   * has run through this code as of this writing (though the code analyzer, Coverity, has not complained);
   * that could help.  Sanitizer or not, generally:
   * given 0-2 days of investigation by an experienced person surely we can narrow this down to a
   * minimal repro case, etc. etc.  So it is solvable: just needs to be done.
   *
   * Until then: (1) this remains a mystery; and (2) there is an acceptable work-around.  (Though the mystery
   * is personally disconcerting to me; e.g., as of this writing, I cannot name another such mystery in this
   * entire code base.) */
} // CLASS_STRUCT_MSG_OUT::Msg_out(Msg_out&&)

Please note: ASAN passes, as of this writing, throughout this unit test and all known tests (unit_test ones or transport_test ones or otherwise). It "feels" like there could be a gremlin from another thread, but further experiments and work seem to suggest otherwise. So one way or another it's a matter of investigating as noted in the above comment.

(Aside: It's true that as of the time of that writing no sanitizer had run from the code. As pointed out here, though, since then it absolutely has -- and passed. So the plot thickens a little.)

That said, there are no known functional or correctness problems in the actual committed code -- merely the mystery of why the change had to be made. Still I don't like such mysteries.

@ygoldfeld ygoldfeld added question Further information is requested from-akamai-pre-open Issue origin is Akamai, before opening source labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from-akamai-pre-open Issue origin is Akamai, before opening source question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant