-
-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix websocket crash from connection pointer dangling #530
Conversation
Hi! Thanks for the fix! Sorry for the long wait 😓 Lets ask someone who might have windows on hand @The-EDev if he can reproduce the original bug. Else I'll check this out in a few days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever solution 👑
See my comment regarding await correctness.
Besides, if we approach #529 and refactor websocket core parts, we might avoid this problem at all. What do you think?
// Do not modify anchor_ here since writing shared_ptr is not atomic. | ||
auto watch = std::weak_ptr<void>{anchor_}; | ||
|
||
// Wait until all unhandled asynchronous operations to join. | ||
// As the deletion occurs inside 'check_destroy()', which already locks | ||
// anchor, use count can be 1 on valid deletion context. | ||
while (watch.use_count() > 2) // 1 for 'check_destroy() routine', 1 for 'this->anchor_' | ||
{ | ||
std::this_thread::yield(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I undestand correctly, dispatch/post capture just a weak_ptr. Once the weak_ptr.lock() != nullptr
check succeeded, we cannot drop the instance as long as the handler scope is active, i.e. weak_ptr
is in scope.
So we should probably wait for the weak_count
to be 1 (only our local ref exists). What is more, we would have to create an atomic barrier, because a weak_ptr.lock() != nullptr
check could happen after our loop finished runnig...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually It does not depend on weak_count. Instead, on actual async_write scope, you may see declaring new variable using captured weak_ptr; auto anchor = watch.lock();
The anchor variable is non-null value if this
instance is not yet became dangling, and keeps use_count value more than 1 during its scope. (One for this->anchor_
, one for local variable anchor
)
The loop on destructor waits until all other anchor
instances destroyed by constantly checking its use_count
.
And I think resolving #529 is not directly related this issue. Primary purpose of this routine is detecting dangling on deferred execution of events that were posted (or dispatched).
Dangling occurs when I capture connection references to use it outside of callback functions (onmessage
, etc ...). By using userdata(in a bit tricky way :|), I can detect whether the connection instance itself is not dangled when invoking send_data
on captured reference. However, if websocket is already closed and deleted during previous event invocation, 'this' pointer of queued event to executor is dangling, since the executor
of adapter_
is live longer than websocket connection itself.
This PR provides the way to detect whether posted event is already expired, and 'this' pointer is not usable. Of course this is pretty improvised way to resolve the issue. I think overall reviewing on websocket lifetime management logic is required. IMHO, deleting this
seems a bit risky in current situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, deleting this seems a bit risky in current situation
It is for sure 😃. I'm no way opposed to this fix - its just that
I think overall reviewing on websocket lifetime management logic is required
this is definitely true.
I see now, I totally forgot check_destroy
destroys itself with the upgrated anchor still being there. One last question left - why did you decide to additionally guard only this path? check_destroy
is called numeros times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well primary reason was just all those crashes were fixed from applying guard on those positions. 🤣 Usually, crash occurred when there's any consistent outbound transfer using websocket connection, and instant shutdown occurs.
Crash always occurs when dereferencing dangling pointer
this
onsend_binary()
's deferred event.
This is just my guess. It seems in most of the websocket connection's lifetime, async_read
request is active that was made inside do_read()
. Thus, calling close(), or remote endpoint's shutdown signal reaches to async_read
completion token very first, with valid asio::error_code, and it makes check_destroy() routine in async_read()
's completion handler valid in most cases.
However, any call to send_data()
is usually(in my case) injected outside of generic flow(e.g. do_read()
or onmessage()
callback). And usually its async_write
request made after any async_read
completion token is already registered, thus on any shutdown situation, it receives asio::error_code
AFTER async_read
's completion token receives error code.
I think that was why other calls to check_destroy() didn't make any issue. As this is literally hot-fix to prevent consistent crashing on windows, further investigation is required to asure if it's true that always async_read
completion token receives error_code firstly.
SHUTDOWN!!!
│
│
▼
┌──────────► ┌►asynchronous write ───────────────────┐error_code
│ │ │ │
│ │ ▼ │
I/O Threads ─┴──────────► ┌───►asynchronous read───────┼──────────┐error_code │
│ │ │ │
│ │ ▼ ▼
Executor Thread ────► do_read()────┘ ┌───►do_write()─┘ do_read()──►check_destroy() do_write::completion_handler ───► check_destroy() CRASH!!!
│
│
User Thread ────────────► send_data()─────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable. Well, If you were the first one to report this issue, then let's assume this has enough precaution if it solves it
include/crow/websocket.h
Outdated
[this, | ||
handler = std::forward<CompletionHandler>(handler), | ||
watch = std::weak_ptr<void>{anchor_}] { | ||
auto anchor = watch.lock(); | ||
if (anchor == nullptr) { return; } | ||
|
||
handler(); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost forgot when testing - c++11 😓 .
I guess you can make a custom functor with this behaviour that will wrap the handler. But please, don't add over engineered code...
include/crow/websocket.h
Outdated
[&, watch = std::weak_ptr<void>{anchor_}](const asio::error_code& ec, std::size_t /*bytes_transferred*/) { | ||
if (!ec && !close_connection_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as well (and at the top)
Just make sure it compiles without any warnings with std=c++11 flags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Conflicts: # include/crow/websocket.h
|
oh, thank you for the speedy response! @dranikpg |
I would suggest explaining a thread-safe or at least a "recommended" way of using websocket::Connection outside of the asio handler flow. In a way it is related to #258 but taking into account the nature of websocket async communication might be even more urgent. In one way or another a server which needs to send a message to a connection has to capture a reference to the connection received much earlier in accept or previous messages. @dranikpg @kang-sw Besides, I'm looking forward to #529 and refactor of websocket core parts. Crow is awesome :-) Thank you! |
You're right, it's not documented so far. Websockets are increasingly popular in crow, but unfortunately pretty rough so far. I wouldn't say its related to the async handlers issue because lifetime management is to be managed differently. Thanks for using Crow in spite of its small flaws :) |
On windows, with error 10053, a websocket instance sometimes deleted before post-ed/dispatch-ed asynchronous I/O events invocation, which makescompletion handler refer to dangled pointer 'this'.
We can detect whether 'this' pointer is deleted already or not, by checking captured
weak_ptr
is expired. Though this introduces a bit of overhead on each message posting/dispatching, it makes concurrent access to single websocket connection instance safe.