-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback: Fix/API Eventshandler thread leak includes #5416 #5419
Conversation
Fix/api deadlock
This could be done nicer with a waitlock..
Better version, using locks (while at it...)
More time for slow devices (i.e. aNag on slow networks) Also less checking the entire Http array, no need to rush it.. worst cast scenario: a zombie thread stays `alive` for 89 seconds...
curl and CTRL-C white the API is sending makes for a good new test!
Don't think SSL with a local lock will work correctly, keeping the whole thing locked for now.
I think the timer is needed at all but for now I'll set this to 1 hour, this doesn't get fired for anything except for my Netscalers that use keep-alive on the TCP socket to check if it's up and running. Also made the message a warning so it's more visible when it does happen. Timer checks every 10 minutes instead of every 15 seconds.
While waiting for the lock it may try disconnecting again, forgot to change this on git
While waiting to close and the httpserver timer is still implemented it could theoretically fire Disconnect twice.
Added fix for Icinga#5377
Fixing the rest, don't like the fact you can get a remote stack.. Hackers would be so happy to know where what in memory is located... Ps. This generates a thread leak also, going to debug that later!
metas, attrs and joins have all a custom message when it fails.
Should be mutable for sure, can't do changes in const function.
Thanks for your help 👍 We'll take a look at the whole issue after the 2.7.0 release. Changes to our threading are always complex to do and test, so we really appreciate you invested the time! |
"Invalid type for 'metas' attribute specified. Array type is required.", Empty); | ||
return true; | ||
} | ||
|
||
bool allJoins = HttpUtility::GetLastParameter(params, "all_joins"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move that into a separate PR.
@@ -134,6 +134,9 @@ class I2_REMOTE_API ApiListener : public ObjectImpl<ApiListener> | |||
WorkQueue m_RelayQueue; | |||
WorkQueue m_SyncQueue; | |||
|
|||
mutable boost::mutex m_HttpLock; | |||
mutable boost::mutex m_JsonLock; | |||
|
|||
boost::mutex m_LogLock; | |||
Stream::Ptr m_LogFile; | |||
size_t m_LogMessageCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll split the locks into a separate commit.
I haven't been able to reproduce the issue so far. It does not happen inside a CentOS 7 Vagrant box nor locally on macOS. It doesn't matter if there are 100 services or 9000 services, with checker/notification/ido-mysql enabled or disabled.
Firing these queries against it in an endless loop won't let thread count rise that much.
|
Got some crawler/port-scanner doing nasty things on Icinga2-port recently, crashing the icinga2 agent on alot of servers... Seems like this is something that should be milestoned for 2.7.1 :) |
I'm still reviewing the patches. After they are merged, a test window is required which will presumingly be after 2.7.1. |
Summary: I'm not able to reproduce it on RHEL/macOS, only on Debian Jessie in #5148. Still, the proposed patches do not solve the underlaying issue. I could think of a boost specific problem here. As this requires further analysis, this was removed for 2.8. It is an ongoing process and problem. |
I'm closing this in favour of #5148 |
See #6361. |
The problem is that the httpserverconnection get's unloaded (attempt) before the eventhandler and things get stuck.. We need the thread to wait until the handler knows the socket died so the response can finish.
A. Make it use a waitlock for processing
or
B. Make eventshandler set a bool in http server so it knows it needs to wait for the handler to die.
edit: Changed the patch to choice A; it fixes some more bugs and was easy enough and works instantly with eventshandler. (maybe there is no need to close socket before waiting for the condition this way?)
Making changed locally, testing them and then edit on github is not the best way to do this I know...
edit2: Found another way to test; fixed a few more thread lockups when quickly opening/closing connections (simulating unstable connection with connection drops)
Status; still stable. I have a hard time getting it to crash again with all the patches.
![image](https://user-images.githubusercontent.com/28925335/28326506-3b604e30-6be1-11e7-9c46-46ba74c8d3b4.png)