-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove -and notify- expired downtimes immediately, not every 60s II #9726
Conversation
Short storyref/IP/43624 BeforeAfterMedium storyDon't look for expired downtimes in a timer fired every 60s, but fire one timer per downtime once at expire time. Long story
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already looks way better in general compared to #9614 👍
TODO @Al2Klimov
|
git ls-files -z |xargs -0 perl -pi -e 's/\bnew Timer\b/Timer::Create/g' ex. in Timer::Create() itself.
to prevent the case: Timer callback destroys parent object -> destroys Timer -> ~Timer() -> Stop(true) -> waits for the Timer callback to finish -> deadlock.
via new Timer#InternalRescheduleUnlocked()
so that only the first one changes l_AliveTimers (as in Timer#Stop()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat bad news: given the changes to Timer
, more places would now have to explicitly use Stop(true)
, for example (m_StatsTimer
isn't used elsewhere):
icinga2/lib/icingadb/icingadb.cpp
Lines 118 to 121 in 7d49921
m_StatsTimer = Timer::Create(); | |
m_StatsTimer->SetInterval(1); | |
m_StatsTimer->OnTimerExpired.connect([this](const Timer * const&) { PublishStatsTimerHandler(); }); | |
m_StatsTimer->Start(); |
Relying on when a object managed by a shared_ptr
/intrusive_ptr
is destroyed already looked quite fragile before (i.e. the point of these types is that there is no single owning reference to the object, yet, Timer::Ptr
was used like that in many places). We could make that more explicit with something like a scoped timer class that wraps a Timer::Ptr
and in its destructor calls Stop(true)
on that pointer.
What exactly did my changes cause here? |
The callback for If I haven't overlooked anything:
|
Good catch. What about a flag instead which explicitly enables the behavior I introduced? In this case just for Downtime cleanup? |
I'm not sure how that would look like exactly but it sounds like a flag I might want to remove immediately afterwards. The problem with the existing use of |
Please cross-check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In tests, I've noticed that sometimes (haven't noticed a pattern and also not very often), downtimes are being deleted immediately and the schedule-downtime action even fails with an internal server error:
[2023-04-14 14:10:23 +0200] information/ApiListener: New client connection from [::ffff:172.20.0.1]:46650 (no client certificate)
[2023-04-14 14:10:23 +0200] information/Downtime: Triggering downtime 'x-dummy-1!random-1!15175d57-46f6-4ff5-9d91-41f6aaadb81c' for checkable 'x-dummy-1!random-1'.
[2023-04-14 14:10:23 +0200] information/ConfigObjectUtility: Created and activated object 'x-dummy-1!random-1!15175d57-46f6-4ff5-9d91-41f6aaadb81c' of type 'Downtime'.
[2023-04-14 14:10:23 +0200] information/ConfigObjectUtility: Deleted object 'x-dummy-1!random-1!15175d57-46f6-4ff5-9d91-41f6aaadb81c' of type 'Downtime'.
[2023-04-14 14:10:23 +0200] information/Downtime: Removed downtime 'x-dummy-1!random-1!15175d57-46f6-4ff5-9d91-41f6aaadb81c' from checkable 'x-dummy-1!random-1' (Reason: expired at 2023-04-14 14:15:23 +0200).
[2023-04-14 14:10:23 +0200] information/HttpServerConnection: Request: POST /v1/actions/schedule-downtime (from [::ffff:172.20.0.1]:46650), user: root, agent: curl/8.0.1, status: Internal Server Error).
[2023-04-14 14:10:23 +0200] information/HttpServerConnection: HTTP client disconnected (from [::ffff:172.20.0.1]:46650)
Also not how "expired at 2023-04-14 14:15:23 +0200" is a time 5 minutes after the timestamp of that log message.
I was able to reproduce this with a loop repeatedly scheduling a flexible downtime with a window beginning now, ending in 5 minutes, with a duration of 1 minute:
while curl --fail -isSku root:icinga https://localhost:5665/v1/actions/schedule-downtime --json '{ "type": "Service", "service": "x-dummy-1!random-1", "start_time": '$(date +%s)', "end_time": '$(date -d 5min +%s)', "author": "icingaadmin", "comment": "flexible", "fixed": false, "duration": 60, "pretty": true }'; do :; done
HTTP response body showed this error message:
icinga2/lib/icinga/downtime.cpp
Line 340 in 72af7df
BOOST_THROW_EXCEPTION(std::runtime_error("Could not create downtime object.")); |
before permitting their parent objects' destruction. For the cases where the handlers have raw pointers to these objects.
Don't look for expired downtimes in a timer fired every 60s, but fire one timer per downtime once at expire time.
to actually reflect its purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can still reproduce #9726 (review) with the updated version.
Yep.
|
the last state change could be a long time ago. If it's longer than the new downtime's duration, the downtime expires immediately. trigger time + duration < now
I was also able to reproduce the incorrect trigger time on the current master. Basically run the same test against master and then filter for downtimes with a too early trigger time:
$ date -d @1682322676.989451
Mon 24 Apr 2023 09:51:16 CEST
$ date -d @1682322676
Mon 24 Apr 2023 09:51:16 CEST
$ date -d @1682322146.121111
Mon 24 Apr 2023 09:42:26 CEST |
closes #9614