Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback: Fix/API Eventshandler thread leak includes #5416 #5419

Closed
wants to merge 25 commits into from

Conversation

Stefar77
Copy link
Contributor

@Stefar77 Stefar77 commented Jul 16, 2017

The problem is that the httpserverconnection get's unloaded (attempt) before the eventhandler and things get stuck.. We need the thread to wait until the handler knows the socket died so the response can finish.
A. Make it use a waitlock for processing
or
B. Make eventshandler set a bool in http server so it knows it needs to wait for the handler to die.

edit: Changed the patch to choice A; it fixes some more bugs and was easy enough and works instantly with eventshandler. (maybe there is no need to close socket before waiting for the condition this way?)

Making changed locally, testing them and then edit on github is not the best way to do this I know...

edit2: Found another way to test; fixed a few more thread lockups when quickly opening/closing connections (simulating unstable connection with connection drops)

Status; still stable. I have a hard time getting it to crash again with all the patches.
image

More time for slow devices (i.e. aNag on slow networks) 
Also less checking the entire Http array, no need to rush it.. worst cast scenario: a zombie thread stays `alive` for 89 seconds...
curl and CTRL-C white the API is sending makes for a good new test!
Don't think SSL with a local lock will work correctly, keeping the whole thing locked for now.
I think the timer is needed at all but for now I'll set this to 1 hour, this doesn't get fired for anything except for my Netscalers that use keep-alive on the TCP socket to check if it's up and running.
Also made the message a warning so it's more visible when it does happen.
Timer checks every 10 minutes instead of every 15 seconds.
While waiting for the lock it may try disconnecting again, forgot to change this on git
While waiting to close and the httpserver timer is still implemented it could theoretically fire Disconnect twice.
@Stefar77 Stefar77 changed the title Feedback: Fix/API Eventshandler thread leak Feedback: Fix/API Eventshandler thread leak includes #5416 Jul 17, 2017
Fixing the rest, don't like the fact you can get a remote stack.. 
Hackers would be so happy to know where what in memory is located...

Ps. This generates a thread leak also, going to debug that later!
@Crunsher Crunsher added the area/api REST API label Jul 18, 2017
metas, attrs and joins have all a custom message when it fails.
Should be mutable for sure, can't do changes in const function.
@dnsmichi dnsmichi self-requested a review July 20, 2017 08:52
@Stefar77
Copy link
Contributor Author

32694 icinga 32 20 0 409M 265M nanslp 16 121:46 7.24% icinga2

With concurrency set to 2 (on a 24 core machine) it seems to get the active checks timing better..
image

Notice me restarting Icinga many times last night making the checks go wild.

@Crunsher
Copy link
Contributor

Thanks for your help 👍

We'll take a look at the whole issue after the 2.7.0 release. Changes to our threading are always complex to do and test, so we really appreciate you invested the time!

"Invalid type for 'metas' attribute specified. Array type is required.", Empty);
return true;
}

bool allJoins = HttpUtility::GetLastParameter(params, "all_joins");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move that into a separate PR.

@@ -134,6 +134,9 @@ class I2_REMOTE_API ApiListener : public ObjectImpl<ApiListener>
WorkQueue m_RelayQueue;
WorkQueue m_SyncQueue;

mutable boost::mutex m_HttpLock;
mutable boost::mutex m_JsonLock;

boost::mutex m_LogLock;
Stream::Ptr m_LogFile;
size_t m_LogMessageCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll split the locks into a separate commit.

@dnsmichi
Copy link
Contributor

I haven't been able to reproduce the issue so far. It does not happen inside a CentOS 7 Vagrant box nor locally on macOS. It doesn't matter if there are 100 services or 9000 services, with checker/notification/ido-mysql enabled or disabled.

watch -n 1 'for pid in $(pidof icinga2); do ps -T -p $pid | grep -F icinga2; done | wc -l'

Firing these queries against it in an endless loop won't let thread count rise that much.

while true; do curl -k -s -u root:icinga 'https://localhost:5665/v1/objects/services' >/dev/null; done

 while true; do curl -k -s -u root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5665/v1/actions/process-check-result' -d '{ "type": "Service", "filter": true, "plugin_output": "api stress", "exit_status": 1 }' >/dev/null && sleep 0.1; done

while true; do /usr/local/sbin/check_http -H localhost -p 5665 -S -e '401 Unauthorized' -N > /dev/null; done

@dupondje
Copy link
Contributor

dupondje commented Sep 3, 2017

Got some crawler/port-scanner doing nasty things on Icinga2-port recently, crashing the icinga2 agent on alot of servers...

Seems like this is something that should be milestoned for 2.7.1 :)
Please merge!

@dnsmichi
Copy link
Contributor

dnsmichi commented Sep 4, 2017

I'm still reviewing the patches. After they are merged, a test window is required which will presumingly be after 2.7.1.

@dnsmichi dnsmichi added this to the 2.8.0 milestone Sep 7, 2017
@dnsmichi dnsmichi self-assigned this Sep 7, 2017
@dnsmichi dnsmichi added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) labels Sep 7, 2017
dnsmichi pushed a commit that referenced this pull request Sep 18, 2017
This was split from #5416 and #5419.

More patches from #5419 are pending.

refs #5419
refs #5418
refs #5416

refs #5408
refs #5148
refs #5007
refs #4968
refs #4910
@gunnarbeutner gunnarbeutner modified the milestones: 2.8.0, 2.9.0 Oct 16, 2017
@dnsmichi
Copy link
Contributor

dnsmichi commented Nov 2, 2017

Summary: I'm not able to reproduce it on RHEL/macOS, only on Debian Jessie in #5148.

Still, the proposed patches do not solve the underlaying issue.

I could think of a boost specific problem here. As this requires further analysis, this was removed for 2.8. It is an ongoing process and problem.

#5148 (comment)

@dnsmichi
Copy link
Contributor

I'm closing this in favour of #5148

@dnsmichi dnsmichi closed this Apr 18, 2018
@dnsmichi dnsmichi removed this from the 2.9.0 milestone Apr 18, 2018
@dnsmichi
Copy link
Contributor

See #6361.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api REST API area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants