Fix websocket crash from connection pointer dangling #530

kang-sw · 2022-08-28T06:44:51Z

On windows, with error 10053, a websocket instance sometimes deleted before post-ed/dispatch-ed asynchronous I/O events invocation, which makescompletion handler refer to dangled pointer 'this'.

We can detect whether 'this' pointer is deleted already or not, by checking captured weak_ptr is expired. Though this introduces a bit of overhead on each message posting/dispatching, it makes concurrent access to single websocket connection instance safe.

dranikpg · 2022-09-10T14:24:38Z

Hi! Thanks for the fix! Sorry for the long wait 😓

Lets ask someone who might have windows on hand @The-EDev if he can reproduce the original bug.

Else I'll check this out in a few days.

dranikpg

Clever solution 👑

See my comment regarding await correctness.

Besides, if we approach #529 and refactor websocket core parts, we might avoid this problem at all. What do you think?

dranikpg · 2022-09-10T16:20:02Z

include/crow/websocket.h

+ // Do not modify anchor_ here since writing shared_ptr is not atomic.
+ auto watch = std::weak_ptr<void>{anchor_};
+
+ // Wait until all unhandled asynchronous operations to join.
+ // As the deletion occurs inside 'check_destroy()', which already locks
+ // anchor, use count can be 1 on valid deletion context.
+ while (watch.use_count() > 2) // 1 for 'check_destroy() routine', 1 for 'this->anchor_'
+ {
+ std::this_thread::yield();
+ }


If I undestand correctly, dispatch/post capture just a weak_ptr. Once the weak_ptr.lock() != nullptr check succeeded, we cannot drop the instance as long as the handler scope is active, i.e. weak_ptr is in scope.

So we should probably wait for the weak_count to be 1 (only our local ref exists). What is more, we would have to create an atomic barrier, because a weak_ptr.lock() != nullptr check could happen after our loop finished runnig...

Actually It does not depend on weak_count. Instead, on actual async_write scope, you may see declaring new variable using captured weak_ptr; auto anchor = watch.lock();
The anchor variable is non-null value if this instance is not yet became dangling, and keeps use_count value more than 1 during its scope. (One for this->anchor_, one for local variable anchor)
The loop on destructor waits until all other anchor instances destroyed by constantly checking its use_count.

And I think resolving #529 is not directly related this issue. Primary purpose of this routine is detecting dangling on deferred execution of events that were posted (or dispatched).

Dangling occurs when I capture connection references to use it outside of callback functions (onmessage, etc ...). By using userdata(in a bit tricky way :|), I can detect whether the connection instance itself is not dangled when invoking send_data on captured reference. However, if websocket is already closed and deleted during previous event invocation, 'this' pointer of queued event to executor is dangling, since the executor of adapter_ is live longer than websocket connection itself.

This PR provides the way to detect whether posted event is already expired, and 'this' pointer is not usable. Of course this is pretty improvised way to resolve the issue. I think overall reviewing on websocket lifetime management logic is required. IMHO, deleting this seems a bit risky in current situation.

IMHO, deleting this seems a bit risky in current situation

It is for sure 😃. I'm no way opposed to this fix - its just that

I think overall reviewing on websocket lifetime management logic is required

this is definitely true.

I see now, I totally forgot check_destroy destroys itself with the upgrated anchor still being there. One last question left - why did you decide to additionally guard only this path? check_destroy is called numeros times.

Well primary reason was just all those crashes were fixed from applying guard on those positions. 🤣 Usually, crash occurred when there's any consistent outbound transfer using websocket connection, and instant shutdown occurs.

Crash always occurs when dereferencing dangling pointer this on send_binary()'s deferred event.

This is just my guess. It seems in most of the websocket connection's lifetime, async_read request is active that was made inside do_read(). Thus, calling close(), or remote endpoint's shutdown signal reaches to async_read completion token very first, with valid asio::error_code, and it makes check_destroy() routine in async_read()'s completion handler valid in most cases.

However, any call to send_data() is usually(in my case) injected outside of generic flow(e.g. do_read() or onmessage() callback). And usually its async_write request made after any async_read completion token is already registered, thus on any shutdown situation, it receives asio::error_code AFTER async_read's completion token receives error code.

I think that was why other calls to check_destroy() didn't make any issue. As this is literally hot-fix to prevent consistent crashing on windows, further investigation is required to asure if it's true that always async_read completion token receives error_code firstly.

SHUTDOWN!!! │ │ ▼ ┌──────────► ┌►asynchronous write ───────────────────┐error_code │ │ │ │ │ │ ▼ │ I/O Threads ─┴──────────► ┌───►asynchronous read───────┼──────────┐error_code │ │ │ │ │ │ │ ▼ ▼ Executor Thread ────► do_read()────┘ ┌───►do_write()─┘ do_read()──►check_destroy() do_write::completion_handler ───► check_destroy() CRASH!!! │ │ User Thread ────────────► send_data()─────┘

Seems reasonable. Well, If you were the first one to report this issue, then let's assume this has enough precaution if it solves it

The-EDev · 2022-09-10T22:10:52Z

@dranikpg I haven't used windows in ages. Though this might be related to a problem I ran into in an early implementation of #269. I kept attempting to access Websocket connections after they've been deleted and couldn't figure out how or why.

dranikpg · 2022-09-14T21:10:04Z

include/crow/websocket.h

+ [this,
+ handler = std::forward<CompletionHandler>(handler),
+ watch = std::weak_ptr<void>{anchor_}] {
+ auto anchor = watch.lock();
+ if (anchor == nullptr) { return; }
+
+ handler();
+ });


Almost forgot when testing - c++11 😓 .

I guess you can make a custom functor with this behaviour that will wrap the handler. But please, don't add over engineered code...

dranikpg · 2022-09-14T21:10:06Z

include/crow/websocket.h

+ [&, watch = std::weak_ptr<void>{anchor_}](const asio::error_code& ec, std::size_t /*bytes_transferred*/) {
 if (!ec && !close_connection_)


as well (and at the top)
Just make sure it compiles without any warnings with std=c++11 flags

Okay, however, as this pull request confilicts with #535 due to modifications on websocket::Connection<>::post and dispatch, I'll make changes for this after any decision is made for #535.

# Conflicts: # include/crow/websocket.h

kang-sw · 2022-09-24T07:15:55Z

Resolved merge conflict with master branch
Resolved compilation warning due to c++14 features

dangerden · 2022-10-09T05:46:06Z

@dranikpg @kang-sw hey guys, is anything missing with this PR to move forward?

dangerden · 2022-10-09T18:57:25Z

oh, thank you for the speedy response! @dranikpg

dangerden · 2022-10-09T19:22:53Z

I would suggest explaining a thread-safe or at least a "recommended" way of using websocket::Connection outside of the asio handler flow. In a way it is related to #258 but taking into account the nature of websocket async communication might be even more urgent. In one way or another a server which needs to send a message to a connection has to capture a reference to the connection received much earlier in accept or previous messages. @dranikpg @kang-sw

Besides, I'm looking forward to #529 and refactor of websocket core parts. Crow is awesome :-) Thank you!

dranikpg · 2022-10-10T19:43:40Z

You're right, it's not documented so far. Websockets are increasingly popular in crow, but unfortunately pretty rough so far.

I wouldn't say its related to the async handlers issue because lifetime management is to be managed differently.

Thanks for using Crow in spite of its small flaws :)

kang-sw added 4 commits August 28, 2022 15:18

websocket: Add dangling-prevent anchor pointer

7d5807d

websocket: Fix deadlock from incorrect expiration expectation

4db57d1

webscoket: Fix unsafe write access to anchor_

befd503

websocket: Fix build error from use of 'nullptr_t'

8f2e4f6

dranikpg reviewed Sep 10, 2022

View reviewed changes

kang-sw and others added 2 commits September 11, 2022 18:40

websocket: Remove latent dereferencing on dangling 'this' pointer

db8b655

Merge branch 'master' into fix/ws-crash

b18d3c6

dranikpg approved these changes Sep 14, 2022

View reviewed changes

dranikpg requested changes Sep 14, 2022

View reviewed changes

kang-sw added 2 commits September 24, 2022 16:06

Merge branch 'master' into fix/ws-crash

da0bb39

# Conflicts: # include/crow/websocket.h

websocket: Resolve conflict with master branch

e8c8a36

dranikpg approved these changes Oct 9, 2022

View reviewed changes

dranikpg merged commit ac757ff into CrowCpp:master Oct 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix websocket crash from connection pointer dangling #530

Fix websocket crash from connection pointer dangling #530

kang-sw commented Aug 28, 2022 •

edited

Loading

dranikpg commented Sep 10, 2022

dranikpg left a comment

dranikpg Sep 10, 2022

kang-sw Sep 11, 2022 •

edited

Loading

dranikpg Sep 11, 2022

kang-sw Sep 12, 2022 •

edited

Loading

dranikpg Sep 14, 2022

The-EDev commented Sep 10, 2022

dranikpg Sep 14, 2022

dranikpg Sep 14, 2022 •

edited

Loading

kang-sw Sep 17, 2022 •

edited

Loading

kang-sw commented Sep 24, 2022

dangerden commented Oct 9, 2022

dangerden commented Oct 9, 2022

dangerden commented Oct 9, 2022

dranikpg commented Oct 10, 2022

		[&, watch = std::weak_ptr<void>{anchor_}](const asio::error_code& ec, std::size_t /bytes_transferred/) {
		if (!ec && !close_connection_)

Fix websocket crash from connection pointer dangling #530

Fix websocket crash from connection pointer dangling #530

Conversation

kang-sw commented Aug 28, 2022 • edited Loading

dranikpg commented Sep 10, 2022

dranikpg left a comment

Choose a reason for hiding this comment

dranikpg Sep 10, 2022

Choose a reason for hiding this comment

kang-sw Sep 11, 2022 • edited Loading

Choose a reason for hiding this comment

dranikpg Sep 11, 2022

Choose a reason for hiding this comment

kang-sw Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

dranikpg Sep 14, 2022

Choose a reason for hiding this comment

The-EDev commented Sep 10, 2022

dranikpg Sep 14, 2022

Choose a reason for hiding this comment

dranikpg Sep 14, 2022 • edited Loading

Choose a reason for hiding this comment

kang-sw Sep 17, 2022 • edited Loading

Choose a reason for hiding this comment

kang-sw commented Sep 24, 2022

dangerden commented Oct 9, 2022

dangerden commented Oct 9, 2022

dangerden commented Oct 9, 2022

dranikpg commented Oct 10, 2022

kang-sw commented Aug 28, 2022 •

edited

Loading

kang-sw Sep 11, 2022 •

edited

Loading

kang-sw Sep 12, 2022 •

edited

Loading

dranikpg Sep 14, 2022 •

edited

Loading

kang-sw Sep 17, 2022 •

edited

Loading