Feedback: Fix/API Eventshandler thread leak includes #5416 #5419

Stefar77 · 2017-07-16T23:03:04Z

The problem is that the httpserverconnection get's unloaded (attempt) before the eventhandler and things get stuck.. We need the thread to wait until the handler knows the socket died so the response can finish.
A. Make it use a waitlock for processing
or
B. Make eventshandler set a bool in http server so it knows it needs to wait for the handler to die.

edit: Changed the patch to choice A; it fixes some more bugs and was easy enough and works instantly with eventshandler. (maybe there is no need to close socket before waiting for the condition this way?)

Making changed locally, testing them and then edit on github is not the best way to do this I know...

edit2: Found another way to test; fixed a few more thread lockups when quickly opening/closing connections (simulating unstable connection with connection drops)

Status; still stable. I have a hard time getting it to crash again with all the patches.

Fix/api deadlock

This could be done nicer with a waitlock..

Better version, using locks (while at it...)

More time for slow devices (i.e. aNag on slow networks) Also less checking the entire Http array, no need to rush it.. worst cast scenario: a zombie thread stays `alive` for 89 seconds...

curl and CTRL-C white the API is sending makes for a good new test!

Don't think SSL with a local lock will work correctly, keeping the whole thing locked for now.

I think the timer is needed at all but for now I'll set this to 1 hour, this doesn't get fired for anything except for my Netscalers that use keep-alive on the TCP socket to check if it's up and running. Also made the message a warning so it's more visible when it does happen. Timer checks every 10 minutes instead of every 15 seconds.

While waiting for the lock it may try disconnecting again, forgot to change this on git

While waiting to close and the httpserver timer is still implemented it could theoretically fire Disconnect twice.

Added fix for Icinga#5377

Fixing the rest, don't like the fact you can get a remote stack.. Hackers would be so happy to know where what in memory is located... Ps. This generates a thread leak also, going to debug that later!

metas, attrs and joins have all a custom message when it fails.

Should be mutable for sure, can't do changes in const function.

Stefar77 · 2017-07-20T14:11:39Z

32694 icinga 32 20 0 409M 265M nanslp 16 121:46 7.24% icinga2

With concurrency set to 2 (on a 24 core machine) it seems to get the active checks timing better..

Notice me restarting Icinga many times last night making the checks go wild.

Crunsher · 2017-07-26T08:17:33Z

Thanks for your help 👍

We'll take a look at the whole issue after the 2.7.0 release. Changes to our threading are always complex to do and test, so we really appreciate you invested the time!

dnsmichi · 2017-08-11T14:08:41Z

lib/remote/objectqueryhandler.cpp

+ "Invalid type for 'metas' attribute specified. Array type is required.", Empty);
+ return true;
+ }
+
 bool allJoins = HttpUtility::GetLastParameter(params, "all_joins");


I'll move that into a separate PR.

dnsmichi · 2017-08-11T14:35:14Z

lib/remote/apilistener.hpp

@@ -134,6 +134,9 @@ class I2_REMOTE_API ApiListener : public ObjectImpl<ApiListener>
 WorkQueue m_RelayQueue;
 WorkQueue m_SyncQueue;

+ mutable boost::mutex m_HttpLock;
+ mutable boost::mutex m_JsonLock;
+
 boost::mutex m_LogLock;
 Stream::Ptr m_LogFile;
 size_t m_LogMessageCount;


I'll split the locks into a separate commit.

dnsmichi · 2017-08-14T07:47:02Z

I haven't been able to reproduce the issue so far. It does not happen inside a CentOS 7 Vagrant box nor locally on macOS. It doesn't matter if there are 100 services or 9000 services, with checker/notification/ido-mysql enabled or disabled.

watch -n 1 'for pid in $(pidof icinga2); do ps -T -p $pid | grep -F icinga2; done | wc -l'

Firing these queries against it in an endless loop won't let thread count rise that much.

while true; do curl -k -s -u root:icinga 'https://localhost:5665/v1/objects/services' >/dev/null; done

 while true; do curl -k -s -u root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5665/v1/actions/process-check-result' -d '{ "type": "Service", "filter": true, "plugin_output": "api stress", "exit_status": 1 }' >/dev/null && sleep 0.1; done

while true; do /usr/local/sbin/check_http -H localhost -p 5665 -S -e '401 Unauthorized' -N > /dev/null; done

dupondje · 2017-09-03T07:05:17Z

Got some crawler/port-scanner doing nasty things on Icinga2-port recently, crashing the icinga2 agent on alot of servers...

Seems like this is something that should be milestoned for 2.7.1 :)
Please merge!

dnsmichi · 2017-09-04T08:19:34Z

I'm still reviewing the patches. After they are merged, a test window is required which will presumingly be after 2.7.1.

This was split from #5416 and #5419. More patches from #5419 are pending. refs #5419 refs #5418 refs #5416 refs #5408 refs #5148 refs #5007 refs #4968 refs #4910

dnsmichi · 2017-11-02T14:42:10Z

Summary: I'm not able to reproduce it on RHEL/macOS, only on Debian Jessie in #5148.

Still, the proposed patches do not solve the underlaying issue.

I could think of a boost specific problem here. As this requires further analysis, this was removed for 2.8. It is an ongoing process and problem.

#5148 (comment)

dnsmichi · 2018-04-18T07:08:01Z

I'm closing this in favour of #5148

dnsmichi · 2018-06-18T06:21:15Z

See #6361.

Stefar77 added 13 commits July 15, 2017 00:46

API Deadlock fix

ea5f66a

API Deadlock fix

2f93f4b

Merge pull request #1 from Stefar77/fix/API-deadlock

352aee7

Fix/api deadlock

Fixes deadlock when using eventshandler

42e13c8

This could be done nicer with a waitlock..

Fixes for deadlock in Eventhandler

a05f701

Added some more debug / logging / cleanup

a9b2636

m_Connected flag is not needed

d0e720b

Made it more GIT/CLI readable

c424b80

Removed log line after socket->close() -- fails

4cf3e83

Added condition and mutex for processing..

a490314

Lock around m_PendingRequests

f691539

Merge pull fix/eventshandler-api-with-lock

0838bad

Better version, using locks (while at it...)

Misuse of [\s]+

b441374

Stefar77 added 7 commits July 17, 2017 04:33

Timeout set to 30 seconds..

2fa3442

More time for slow devices (i.e. aNag on slow networks) Also less checking the entire Http array, no need to rush it.. worst cast scenario: a zombie thread stays `alive` for 89 seconds...

Fix even more lockups with the single lock

9b4206c

curl and CTRL-C white the API is sending makes for a good new test!

Fix more deadlocks

6b2df8a

m_SslLock is unused *removed

3c5f4e6

Don't think SSL with a local lock will work correctly, keeping the whole thing locked for now.

Disconnect can run more then once...

79a5aaf

While waiting for the lock it may try disconnecting again, forgot to change this on git

Just in case...

a2dafbc

While waiting to close and the httpserver timer is still implemented it could theoretically fire Disconnect twice.

Stefar77 changed the title ~~Feedback: Fix/API Eventshandler thread leak~~ Feedback: Fix/API Eventshandler thread leak includes #5416 Jul 17, 2017

Stefar77 added 3 commits July 17, 2017 16:35

Fix Icinga#5377

1b15d6e

Added a fix for Icinga#5377

947eaec

Added fix for Icinga#5377

Also seems to happen for joins: false

b97b1d1

Fixing the rest, don't like the fact you can get a remote stack.. Hackers would be so happy to know where what in memory is located... Ps. This generates a thread leak also, going to debug that later!

Stefar77 mentioned this pull request Jul 17, 2017

'recreated' API objects are not (always) active in IDO DB #5205

Closed

Crunsher requested a review from gunnarbeutner July 18, 2017 08:25

Crunsher added the area/api REST API label Jul 18, 2017

Fix/remotestack fixed messages.

52a20ff

metas, attrs and joins have all a custom message when it fails.

Stefar77 mentioned this pull request Jul 18, 2017

Periodic crash/lockup (fork failed with error code 35) when running clustered #5413

Closed

Make locks mutable for const

09b7054

Should be mutable for sure, can't do changes in const function.

dnsmichi self-requested a review July 20, 2017 08:52

dnsmichi reviewed Aug 11, 2017

View reviewed changes

This was referenced Aug 11, 2017

Sending wrong value for key causes ugly stacktrace #5377

Closed

API: Fix requested attrs/joins/meta type errors in object query response #5497

Merged

dnsmichi reviewed Aug 11, 2017

View reviewed changes

dnsmichi added this to the 2.8.0 milestone Sep 7, 2017

dnsmichi self-assigned this Sep 7, 2017

dnsmichi added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Sep 7, 2017

dnsmichi pushed a commit that referenced this pull request Sep 18, 2017

Fix API crash with race condition on locks

9948bee

This was split from #5416 and #5419. More patches from #5419 are pending. refs #5419 refs #5418 refs #5416 refs #5408 refs #5148 refs #5007 refs #4968 refs #4910

dnsmichi mentioned this pull request Sep 18, 2017

Fix possible race condition in ApiListener locking #5581

Merged

gunnarbeutner modified the milestones: 2.8.0, 2.9.0 Oct 16, 2017

Crunsher force-pushed the master branch from 73d73ee to 36fddaf Compare November 7, 2017 10:34

djboris9 mentioned this pull request Nov 30, 2017

Hang in TlsStream::Handshake #5007

Closed

dnsmichi closed this Apr 18, 2018

dnsmichi removed this from the 2.9.0 milestone Apr 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback: Fix/API Eventshandler thread leak includes #5416 #5419

Feedback: Fix/API Eventshandler thread leak includes #5416 #5419

Stefar77 commented Jul 16, 2017 •

edited

Loading

Stefar77 commented Jul 20, 2017

Crunsher commented Jul 26, 2017

dnsmichi Aug 11, 2017

dnsmichi Aug 11, 2017

dnsmichi commented Aug 14, 2017

dupondje commented Sep 3, 2017

dnsmichi commented Sep 4, 2017 •

edited

Loading

dnsmichi commented Nov 2, 2017

dnsmichi commented Apr 18, 2018

dnsmichi commented Jun 18, 2018

Feedback: Fix/API Eventshandler thread leak includes #5416 #5419

Feedback: Fix/API Eventshandler thread leak includes #5416 #5419

Conversation

Stefar77 commented Jul 16, 2017 • edited Loading

Stefar77 commented Jul 20, 2017

Crunsher commented Jul 26, 2017

dnsmichi Aug 11, 2017

Choose a reason for hiding this comment

dnsmichi Aug 11, 2017

Choose a reason for hiding this comment

dnsmichi commented Aug 14, 2017

dupondje commented Sep 3, 2017

dnsmichi commented Sep 4, 2017 • edited Loading

dnsmichi commented Nov 2, 2017

dnsmichi commented Apr 18, 2018

dnsmichi commented Jun 18, 2018

Stefar77 commented Jul 16, 2017 •

edited

Loading

dnsmichi commented Sep 4, 2017 •

edited

Loading