Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix websocket crash from connection pointer dangling #530

Merged
merged 8 commits into from
Oct 9, 2022

Conversation

kang-sw
Copy link
Contributor

@kang-sw kang-sw commented Aug 28, 2022

On windows, with error 10053, a websocket instance sometimes deleted before post-ed/dispatch-ed asynchronous I/O events invocation, which makescompletion handler refer to dangled pointer 'this'.

We can detect whether 'this' pointer is deleted already or not, by checking captured weak_ptr is expired. Though this introduces a bit of overhead on each message posting/dispatching, it makes concurrent access to single websocket connection instance safe.

@dranikpg
Copy link
Member

Hi! Thanks for the fix! Sorry for the long wait 😓

Lets ask someone who might have windows on hand @The-EDev if he can reproduce the original bug.

Else I'll check this out in a few days.

Copy link
Member

@dranikpg dranikpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever solution 👑

See my comment regarding await correctness.

Besides, if we approach #529 and refactor websocket core parts, we might avoid this problem at all. What do you think?

Comment on lines +123 to +132
// Do not modify anchor_ here since writing shared_ptr is not atomic.
auto watch = std::weak_ptr<void>{anchor_};

// Wait until all unhandled asynchronous operations to join.
// As the deletion occurs inside 'check_destroy()', which already locks
// anchor, use count can be 1 on valid deletion context.
while (watch.use_count() > 2) // 1 for 'check_destroy() routine', 1 for 'this->anchor_'
{
std::this_thread::yield();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I undestand correctly, dispatch/post capture just a weak_ptr. Once the weak_ptr.lock() != nullptr check succeeded, we cannot drop the instance as long as the handler scope is active, i.e. weak_ptr is in scope.

So we should probably wait for the weak_count to be 1 (only our local ref exists). What is more, we would have to create an atomic barrier, because a weak_ptr.lock() != nullptr check could happen after our loop finished runnig...

Copy link
Contributor Author

@kang-sw kang-sw Sep 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually It does not depend on weak_count. Instead, on actual async_write scope, you may see declaring new variable using captured weak_ptr; auto anchor = watch.lock();
The anchor variable is non-null value if this instance is not yet became dangling, and keeps use_count value more than 1 during its scope. (One for this->anchor_, one for local variable anchor)
The loop on destructor waits until all other anchor instances destroyed by constantly checking its use_count.


And I think resolving #529 is not directly related this issue. Primary purpose of this routine is detecting dangling on deferred execution of events that were posted (or dispatched).

Dangling occurs when I capture connection references to use it outside of callback functions (onmessage, etc ...). By using userdata(in a bit tricky way :|), I can detect whether the connection instance itself is not dangled when invoking send_data on captured reference. However, if websocket is already closed and deleted during previous event invocation, 'this' pointer of queued event to executor is dangling, since the executor of adapter_ is live longer than websocket connection itself.

This PR provides the way to detect whether posted event is already expired, and 'this' pointer is not usable. Of course this is pretty improvised way to resolve the issue. I think overall reviewing on websocket lifetime management logic is required. IMHO, deleting this seems a bit risky in current situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, deleting this seems a bit risky in current situation

It is for sure 😃. I'm no way opposed to this fix - its just that

I think overall reviewing on websocket lifetime management logic is required

this is definitely true.

I see now, I totally forgot check_destroy destroys itself with the upgrated anchor still being there. One last question left - why did you decide to additionally guard only this path? check_destroy is called numeros times.

Copy link
Contributor Author

@kang-sw kang-sw Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well primary reason was just all those crashes were fixed from applying guard on those positions. 🤣 Usually, crash occurred when there's any consistent outbound transfer using websocket connection, and instant shutdown occurs.

image

Crash always occurs when dereferencing dangling pointer this on send_binary()'s deferred event.

This is just my guess. It seems in most of the websocket connection's lifetime, async_read request is active that was made inside do_read(). Thus, calling close(), or remote endpoint's shutdown signal reaches to async_read completion token very first, with valid asio::error_code, and it makes check_destroy() routine in async_read()'s completion handler valid in most cases.

However, any call to send_data() is usually(in my case) injected outside of generic flow(e.g. do_read() or onmessage() callback). And usually its async_write request made after any async_read completion token is already registered, thus on any shutdown situation, it receives asio::error_code AFTER async_read's completion token receives error code.

I think that was why other calls to check_destroy() didn't make any issue. As this is literally hot-fix to prevent consistent crashing on windows, further investigation is required to asure if it's true that always async_read completion token receives error_code firstly.

                                                                     SHUTDOWN!!!
                                                                        │
                                                                        │
                                                                        ▼
                  ┌──────────►                                    ┌►asynchronous write ───────────────────┐error_code
                  │                                               │     │                                 │
                  │                                               │     ▼                                 │
  I/O Threads    ─┴──────────►       ┌───►asynchronous read───────┼──────────┐error_code                  │
                                     │                            │          │                            │
                                     │                            │          ▼                            ▼
  Executor Thread ────► do_read()────┘            ┌───►do_write()─┘        do_read()──►check_destroy()   do_write::completion_handler ───► check_destroy() CRASH!!!
                                                  │
                                                  │
  User Thread    ────────────►    send_data()─────┘

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. Well, If you were the first one to report this issue, then let's assume this has enough precaution if it solves it

@The-EDev
Copy link
Member

@dranikpg I haven't used windows in ages. Though this might be related to a problem I ran into in an early implementation of #269. I kept attempting to access Websocket connections after they've been deleted and couldn't figure out how or why.

Comment on lines 155 to 162
[this,
handler = std::forward<CompletionHandler>(handler),
watch = std::weak_ptr<void>{anchor_}] {
auto anchor = watch.lock();
if (anchor == nullptr) { return; }

handler();
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost forgot when testing - c++11 😓 .

I guess you can make a custom functor with this behaviour that will wrap the handler. But please, don't add over engineered code...

Comment on lines 659 to 660
[&, watch = std::weak_ptr<void>{anchor_}](const asio::error_code& ec, std::size_t /*bytes_transferred*/) {
if (!ec && !close_connection_)
Copy link
Member

@dranikpg dranikpg Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as well (and at the top)
Just make sure it compiles without any warnings with std=c++11 flags

Copy link
Contributor Author

@kang-sw kang-sw Sep 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, however, as this pull request confilicts with #535 due to modifications on websocket::Connection<>::post and dispatch, I'll make changes for this after any decision is made for #535.

@kang-sw
Copy link
Contributor Author

kang-sw commented Sep 24, 2022

  1. Resolved merge conflict with master branch
  2. Resolved compilation warning due to c++14 features

@dangerden
Copy link

@dranikpg @kang-sw hey guys, is anything missing with this PR to move forward?

@dranikpg dranikpg merged commit ac757ff into CrowCpp:master Oct 9, 2022
@dangerden
Copy link

oh, thank you for the speedy response! @dranikpg

@dangerden
Copy link

I would suggest explaining a thread-safe or at least a "recommended" way of using websocket::Connection outside of the asio handler flow. In a way it is related to #258 but taking into account the nature of websocket async communication might be even more urgent. In one way or another a server which needs to send a message to a connection has to capture a reference to the connection received much earlier in accept or previous messages. @dranikpg @kang-sw

Besides, I'm looking forward to #529 and refactor of websocket core parts. Crow is awesome :-) Thank you!

@dranikpg
Copy link
Member

You're right, it's not documented so far. Websockets are increasingly popular in crow, but unfortunately pretty rough so far.

I wouldn't say its related to the async handlers issue because lifetime management is to be managed differently.

Thanks for using Crow in spite of its small flaws :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants