Remove -and notify- expired downtimes immediately, not every 60s II #9726

Al2Klimov · 2023-03-21T12:06:31Z

closes #9614

Al2Klimov · 2023-03-21T15:53:21Z

Short story

ref/IP/43624

Before

After

Medium story

Don't look for expired downtimes in a timer fired every 60s, but fire one timer per downtime once at expire time.

Long story

310dea2 is your friend. Whichever node is responsible for a Downtime runs a timer at its expire time which removes the Downtime. IMAO pretty clear.
As the above changed what the existing timer does, 8502fab changes its name. Dead simple.
As I introduced non-periodic timers, 907050d makes sure system clock jumps don't break them – see Remove -and notify- expired downtimes immediately, not every 60s #9614 (review).
Now we have a problem. The Downtime owns its destruction timer. I.e. the timer destroys the downtime, that destroys the timer, that waits for the timer run to complete and that deadlocks – see Remove -and notify- expired downtimes immediately, not every 60s #9614 (comment) . My solution for this is to keep the timer alive during run via e78fd53. The additional keepalive pointer outlives the main one in the Downtime and the whole Timer#Call() so the timer is destroyed in the thread pool which ran the timer.
The keepalive pointer has to come from somewhere, I chose weak_ptr. This required switching from intrusive pointers to shared ones via c749d38.
That approach required a factory method (5df7cd0) being used everywhere (06eaef7) as the only way of construction (e6170a3) as a weak_ptr requires an existing shared_ptr which requires a fully constructed object.
bcd7342 avoids redundant locking via RAII from the new continue; code path.
a0a0d61 prevents intrusive pointer usage.

julianbrost

Already looks way better in general compared to #9614 👍

lib/base/timer.cpp

lib/icinga/downtime.cpp

lib/base/timer.cpp

lib/icinga/downtime.cpp

Al2Klimov · 2023-04-03T16:13:51Z

TODO @Al2Klimov

rebase
test

git ls-files -z |xargs -0 perl -pi -e 's/\bnew Timer\b/Timer::Create/g' ex. in Timer::Create() itself.

to prevent the case: Timer callback destroys parent object -> destroys Timer -> ~Timer() -> Stop(true) -> waits for the Timer callback to finish -> deadlock.

…Mutex

via new Timer#InternalRescheduleUnlocked()

so that only the first one changes l_AliveTimers (as in Timer#Stop()).

julianbrost

Somewhat bad news: given the changes to Timer, more places would now have to explicitly use Stop(true), for example (m_StatsTimer isn't used elsewhere):

icinga2/lib/icingadb/icingadb.cpp

Lines 118 to 121 in 7d49921

 m_StatsTimer = Timer::Create(); 

 m_StatsTimer->SetInterval(1); 

 m_StatsTimer->OnTimerExpired.connect([this](const Timer * const&) { PublishStatsTimerHandler(); }); 

 m_StatsTimer->Start();

Relying on when a object managed by a shared_ptr/intrusive_ptr is destroyed already looked quite fragile before (i.e. the point of these types is that there is no single owning reference to the object, yet, Timer::Ptr was used like that in many places). We could make that more explicit with something like a scoped timer class that wraps a Timer::Ptr and in its destructor calls Stop(true) on that pointer.

Al2Klimov · 2023-04-12T14:17:59Z

What exactly did my changes cause here?

julianbrost · 2023-04-12T14:33:15Z

The callback for m_StatsTimer captures this.

If I haven't overlooked anything:

Without this PR: when deleting a IcingaDB object, that m_StatsTimer member should have been the last reference to that Timer so when that Timer::Ptr is destroyed, it should have called ~Timer() which waited for any running callback to finish which ensures that this is not accessed afterwards.
With this PR: a running callback extends the lifespan of the Timer object and m_StatsTimer might be only the second to last reference, so the deletion of the IcingaDB might finish while the callback is still running on this.

Al2Klimov · 2023-04-12T14:48:49Z

Good catch. What about a flag instead which explicitly enables the behavior I introduced? In this case just for Downtime cleanup?

julianbrost · 2023-04-12T15:07:28Z

I'm not sure how that would look like exactly but it sounds like a flag I might want to remove immediately afterwards.

The problem with the existing use of Timer::Ptr is that the type itself models shared ownership of the timer, but sometimes it's just like if it was unique ownership and there is really nothing that prevents you from accidentally breaking this assumption (basically, that's what this PR did in the current state). So I like the idea of having an explicit owning reference type instead (no refcounting, just delete this object, the timer will be stopped and waited for, always).

Al2Klimov · 2023-04-13T14:25:39Z

Please cross-check.

julianbrost

In tests, I've noticed that sometimes (haven't noticed a pattern and also not very often), downtimes are being deleted immediately and the schedule-downtime action even fails with an internal server error:

[2023-04-14 14:10:23 +0200] information/ApiListener: New client connection from [::ffff:172.20.0.1]:46650 (no client certificate)
[2023-04-14 14:10:23 +0200] information/Downtime: Triggering downtime 'x-dummy-1!random-1!15175d57-46f6-4ff5-9d91-41f6aaadb81c' for checkable 'x-dummy-1!random-1'. 
[2023-04-14 14:10:23 +0200] information/ConfigObjectUtility: Created and activated object 'x-dummy-1!random-1!15175d57-46f6-4ff5-9d91-41f6aaadb81c' of type 'Downtime'.
[2023-04-14 14:10:23 +0200] information/ConfigObjectUtility: Deleted object 'x-dummy-1!random-1!15175d57-46f6-4ff5-9d91-41f6aaadb81c' of type 'Downtime'.
[2023-04-14 14:10:23 +0200] information/Downtime: Removed downtime 'x-dummy-1!random-1!15175d57-46f6-4ff5-9d91-41f6aaadb81c' from checkable 'x-dummy-1!random-1' (Reason: expired at 2023-04-14 14:15:23 +0200).
[2023-04-14 14:10:23 +0200] information/HttpServerConnection: Request: POST /v1/actions/schedule-downtime (from [::ffff:172.20.0.1]:46650), user: root, agent: curl/8.0.1, status: Internal Server Error).
[2023-04-14 14:10:23 +0200] information/HttpServerConnection: HTTP client disconnected (from [::ffff:172.20.0.1]:46650)

Also not how "expired at 2023-04-14 14:15:23 +0200" is a time 5 minutes after the timestamp of that log message.

I was able to reproduce this with a loop repeatedly scheduling a flexible downtime with a window beginning now, ending in 5 minutes, with a duration of 1 minute:

while curl --fail -isSku root:icinga https://localhost:5665/v1/actions/schedule-downtime --json '{ "type": "Service", "service": "x-dummy-1!random-1", "start_time": '$(date +%s)', "end_time": '$(date -d 5min +%s)', "author": "icingaadmin", "comment": "flexible", "fixed": false, "duration": 60, "pretty": true }'; do :; done

HTTP response body showed this error message:

icinga2/lib/icinga/downtime.cpp

Line 340 in 72af7df

BOOST_THROW_EXCEPTION(std::runtime_error("Could not create downtime object."));

before permitting their parent objects' destruction. For the cases where the handlers have raw pointers to these objects.

Don't look for expired downtimes in a timer fired every 60s, but fire one timer per downtime once at expire time.

to actually reflect its purpose.

julianbrost

I can still reproduce #9726 (review) with the updated version.

Al2Klimov · 2023-04-18T11:17:41Z

Yep.

include <itl>

object CheckerComponent "cc" { }

object ApiListener "api" { }

object ApiUser "root" {
	password = "123456"
	permissions = [ "*" ]
}

object Endpoint NodeName { }

object Zone NodeName {
	endpoints = [ NodeName ]
}

for (s in range(4)) {
	object Host s {
		check_command = "dummy"
		vars.dummy_state = 0
	}
}

for (i in range(1000)) {
	apply Service i {
		check_command = "dummy"
		vars.dummy_state = host.name
		assign where true
	}
}

the last state change could be a long time ago. If it's longer than the new downtime's duration, the downtime expires immediately. trigger time + duration < now

julianbrost · 2023-04-24T07:57:54Z

I was also able to reproduce the incorrect trigger time on the current master. Basically run the same test against master and then filter for downtimes with a too early trigger time:

$ curl -sSku root:icinga https://localhost:5665/v1/objects/downtimes -X GET --json '{"filter":"downtime.trigger_time < downtime.start_time", "pretty": true}'
[...]
        {
            "attrs": {
                "__name": "x-dummy-6!random-3!05d098ea-5e79-4185-94d2-f38cdc1cf067",
                "active": true,
                "author": "icingaadmin",
                "authoritative_zone": "",
                "comment": "flexible",
                "config_owner": "",
                "config_owner_hash": "",
                "duration": 60,
                "end_time": 1682322976,
                "entry_time": 1682322676.989451,
                "fixed": false,
                "ha_mode": 0,
                "host_name": "x-dummy-6",
                "legacy_id": 513,
                "name": "05d098ea-5e79-4185-94d2-f38cdc1cf067",
                "original_attributes": null,
                "package": "_api",
                "parent": "",
                "paused": false,
                "remove_time": 0,
                "scheduled_by": "",
                "service_name": "random-3",
                "source_location": {
                    "first_column": 0,
                    "first_line": 1,
                    "last_column": 69,
                    "last_line": 1,
                    "path": "/var/lib/icinga2/api/packages/_api/77e78991-06d4-4589-aa0c-ee60e1d7724e/conf.d/downtimes/x-dummy-6!random-3!05d098ea-5e79-4185-94d2-f38cdc1cf067.conf"
                },
                "start_time": 1682322676,
                "templates": [
                    "05d098ea-5e79-4185-94d2-f38cdc1cf067"
                ],
                "trigger_time": 1682322146.121111,
                "triggered_by": "",
                "triggers": [],
                "type": "Downtime",
                "version": 1682322676.989506,
                "was_cancelled": false,
                "zone": "master"
            },
            "joins": {},
            "meta": {},
            "name": "x-dummy-6!random-3!05d098ea-5e79-4185-94d2-f38cdc1cf067",
            "type": "Downtime"
        }
[...]

entry_time:

$ date -d @1682322676.989451
Mon 24 Apr 2023 09:51:16 CEST

start_time:

$ date -d @1682322676
Mon 24 Apr 2023 09:51:16 CEST

trigger_time:

$ date -d @1682322146.121111
Mon 24 Apr 2023 09:42:26 CEST

…wn. #9726

Al2Klimov self-assigned this Mar 21, 2023

cla-bot bot added the cla/signed label Mar 21, 2023

Al2Klimov force-pushed the 43624b branch from 969e851 to bcd7342 Compare March 21, 2023 14:52

Al2Klimov changed the title ~~WIP~~ Remove -and notify- expired downtimes immediately, not every 60s II Mar 21, 2023

Al2Klimov removed their assignment Mar 21, 2023

Al2Klimov requested a review from julianbrost March 21, 2023 15:53

Al2Klimov marked this pull request as ready for review March 21, 2023 15:54

Al2Klimov added bug Something isn't working area/notifications Notification events ref/IP labels Mar 21, 2023

Al2Klimov added this to the 2.14.0 milestone Mar 21, 2023

julianbrost requested changes Mar 28, 2023

View reviewed changes

lib/base/timer.cpp Show resolved Hide resolved

lib/icinga/downtime.cpp Outdated Show resolved Hide resolved

lib/icinga/downtime.cpp Show resolved Hide resolved

lib/base/timer.cpp Show resolved Hide resolved

Al2Klimov force-pushed the 43624b branch from 907050d to 2d67075 Compare March 29, 2023 08:19

Al2Klimov requested a review from julianbrost March 29, 2023 08:19

Al2Klimov mentioned this pull request Mar 29, 2023

Document why Timer::TimerThreadProc() can use Timer members during Timer#~Timer() call #9738

Merged

julianbrost requested changes Mar 30, 2023

View reviewed changes

lib/icinga/downtime.cpp Show resolved Hide resolved

Al2Klimov force-pushed the 43624b branch from 2d67075 to 83622c3 Compare April 3, 2023 16:13

Al2Klimov force-pushed the 43624b branch from 83622c3 to 098cbf6 Compare April 4, 2023 08:32

Al2Klimov added 10 commits April 4, 2023 10:35

Introduce factory method Timer::Create()

bb1f574

Use Timer::Create() instead of new Timer()

21b6845

git ls-files -z |xargs -0 perl -pi -e 's/\bnew Timer\b/Timer::Create/g' ex. in Timer::Create() itself.

Make Timer#Timer() private to enforce Timer::Create() usage

9ee4d08

Make Timer::Ptr a std::shared_ptr

24681b3

Timer: drop unnecessary base class

9b00c1c

Timer::TimerThreadProc(): keep a Timer alive while it's running

1badbab

to prevent the case: Timer callback destroys parent object -> destroys Timer -> ~Timer() -> Stop(true) -> waits for the Timer callback to finish -> deadlock.

Timer::TimerThreadProc(): don't unnecessarily unlock and lock l_Timer…

13b9cfd

…Mutex

Timer#Start(): don't unnecessarily unlock/lock l_TimerMutex

3933502

via new Timer#InternalRescheduleUnlocked()

Timer: actually support non-periodic timers

298f3b1

Support multiple redundant Timer#Start() calls

c41e5fd

so that only the first one changes l_AliveTimers (as in Timer#Stop()).

Al2Klimov force-pushed the 43624b branch from 098cbf6 to 7d49921 Compare April 4, 2023 08:35

Al2Klimov requested a review from julianbrost April 5, 2023 10:02

julianbrost requested changes Apr 12, 2023

View reviewed changes

Al2Klimov requested a review from julianbrost April 13, 2023 14:26

julianbrost requested changes Apr 14, 2023

View reviewed changes

Al2Klimov added 3 commits April 14, 2023 14:52

Explicitly stop started timers and wait for them

ba7102c

before permitting their parent objects' destruction. For the cases where the handlers have raw pointers to these objects.

Remove -and notify- expired downtimes immediately, not every 60s

6adf2d1

Don't look for expired downtimes in a timer fired every 60s, but fire one timer per downtime once at expire time.

Rename Downtime::DowntimesExpireTimerHandler()

0ac1cd1

to actually reflect its purpose.

Al2Klimov force-pushed the 43624b branch from 72af7df to 0ac1cd1 Compare April 14, 2023 12:52

Al2Klimov requested a review from julianbrost April 14, 2023 12:52

julianbrost requested changes Apr 17, 2023

View reviewed changes

Al2Klimov assigned Al2Klimov and unassigned Al2Klimov Apr 18, 2023

Al2Klimov requested a review from julianbrost April 18, 2023 14:53

Downtime#Start(): trigger flexible downtimes not earlier than fixed ones

58b788c

the last state change could be a long time ago. If it's longer than the new downtime's duration, the downtime expires immediately. trigger time + duration < now

Al2Klimov force-pushed the 43624b branch from cf29362 to 58b788c Compare April 18, 2023 15:05

julianbrost approved these changes May 2, 2023

View reviewed changes

julianbrost merged commit af9d67b into master May 2, 2023

icinga-probot bot deleted the 43624b branch May 2, 2023 09:25

Al2Klimov added the consider backporting Should be considered for inclusion in a bugfix release label May 17, 2023

julianbrost mentioned this pull request May 30, 2023

Icinga DB: don't write negative Downtime durations into Redis #9775

Merged

Al2Klimov added a commit that referenced this pull request Jul 4, 2023

Trigger flexible downtimes not in the past if checkable is already do…

0215a02

…wn. #9726

julianbrost mentioned this pull request Jul 4, 2023

Downtime#Start(): trigger flexible downtimes not earlier than fixed ones #9817

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove -and notify- expired downtimes immediately, not every 60s II #9726

Remove -and notify- expired downtimes immediately, not every 60s II #9726

Al2Klimov commented Mar 21, 2023 •

edited

Loading

Al2Klimov commented Mar 21, 2023

julianbrost left a comment

Al2Klimov commented Apr 3, 2023 •

edited

Loading

julianbrost left a comment

Al2Klimov commented Apr 12, 2023

julianbrost commented Apr 12, 2023

Al2Klimov commented Apr 12, 2023

julianbrost commented Apr 12, 2023

Al2Klimov commented Apr 13, 2023

julianbrost left a comment

julianbrost left a comment

Al2Klimov commented Apr 18, 2023

julianbrost commented Apr 24, 2023

	m_StatsTimer = Timer::Create();
	m_StatsTimer->SetInterval(1);
	m_StatsTimer->OnTimerExpired.connect([this](const Timer * const&) { PublishStatsTimerHandler(); });
	m_StatsTimer->Start();

Remove -and notify- expired downtimes immediately, not every 60s II #9726

Remove -and notify- expired downtimes immediately, not every 60s II #9726

Conversation

Al2Klimov commented Mar 21, 2023 • edited Loading

Al2Klimov commented Mar 21, 2023

Short story

Before

After

Medium story

Long story

julianbrost left a comment

Choose a reason for hiding this comment

Al2Klimov commented Apr 3, 2023 • edited Loading

TODO @Al2Klimov

julianbrost left a comment

Choose a reason for hiding this comment

Al2Klimov commented Apr 12, 2023

julianbrost commented Apr 12, 2023

Al2Klimov commented Apr 12, 2023

julianbrost commented Apr 12, 2023

Al2Klimov commented Apr 13, 2023

julianbrost left a comment

Choose a reason for hiding this comment

julianbrost left a comment

Choose a reason for hiding this comment

Al2Klimov commented Apr 18, 2023

julianbrost commented Apr 24, 2023

Al2Klimov commented Mar 21, 2023 •

edited

Loading

Al2Klimov commented Apr 3, 2023 •

edited

Loading