Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove -and notify- expired downtimes immediately, not every 60s II #9726

Merged
merged 14 commits into from
May 2, 2023

Conversation

Al2Klimov
Copy link
Member

@Al2Klimov Al2Klimov commented Mar 21, 2023

closes #9614

@Al2Klimov Al2Klimov self-assigned this Mar 21, 2023
@cla-bot cla-bot bot added the cla/signed label Mar 21, 2023
@Al2Klimov Al2Klimov changed the title WIP Remove -and notify- expired downtimes immediately, not every 60s II Mar 21, 2023
@Al2Klimov Al2Klimov removed their assignment Mar 21, 2023
@Al2Klimov
Copy link
Member Author

Short story

ref/IP/43624

Before

Bildschirm­foto 2022-12-14 um 15 31 47

After

Bildschirm­foto 2023-03-21 um 16 13 55

Medium story

Don't look for expired downtimes in a timer fired every 60s, but fire one timer per downtime once at expire time.

Long story

  • 310dea2 is your friend. Whichever node is responsible for a Downtime runs a timer at its expire time which removes the Downtime. IMAO pretty clear.
  • As the above changed what the existing timer does, 8502fab changes its name. Dead simple.
  • As I introduced non-periodic timers, 907050d makes sure system clock jumps don't break them – see Remove -and notify- expired downtimes immediately, not every 60s #9614 (review).
  • Now we have a problem. The Downtime owns its destruction timer. I.e. the timer destroys the downtime, that destroys the timer, that waits for the timer run to complete and that deadlocks – see Remove -and notify- expired downtimes immediately, not every 60s #9614 (comment) . My solution for this is to keep the timer alive during run via e78fd53. The additional keepalive pointer outlives the main one in the Downtime and the whole Timer#Call() so the timer is destroyed in the thread pool which ran the timer.
  • The keepalive pointer has to come from somewhere, I chose weak_ptr. This required switching from intrusive pointers to shared ones via c749d38.
  • That approach required a factory method (5df7cd0) being used everywhere (06eaef7) as the only way of construction (e6170a3) as a weak_ptr requires an existing shared_ptr which requires a fully constructed object.
  • bcd7342 avoids redundant locking via RAII from the new continue; code path.
  • a0a0d61 prevents intrusive pointer usage.

@Al2Klimov Al2Klimov marked this pull request as ready for review March 21, 2023 15:54
@Al2Klimov Al2Klimov added bug Something isn't working area/notifications Notification events ref/IP labels Mar 21, 2023
@Al2Klimov Al2Klimov added this to the 2.14.0 milestone Mar 21, 2023
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already looks way better in general compared to #9614 👍

lib/base/timer.cpp Show resolved Hide resolved
lib/icinga/downtime.cpp Outdated Show resolved Hide resolved
lib/icinga/downtime.cpp Show resolved Hide resolved
lib/base/timer.cpp Show resolved Hide resolved
@Al2Klimov
Copy link
Member Author

Al2Klimov commented Apr 3, 2023

TODO @Al2Klimov

  • rebase
  • test

git ls-files -z |xargs -0 perl -pi -e 's/\bnew Timer\b/Timer::Create/g'

ex. in Timer::Create() itself.
to prevent the case: Timer callback destroys parent object -> destroys
Timer -> ~Timer() -> Stop(true) -> waits for the Timer callback to finish
-> deadlock.
via new Timer#InternalRescheduleUnlocked()
so that only the first one changes l_AliveTimers (as in Timer#Stop()).
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat bad news: given the changes to Timer, more places would now have to explicitly use Stop(true), for example (m_StatsTimer isn't used elsewhere):

m_StatsTimer = Timer::Create();
m_StatsTimer->SetInterval(1);
m_StatsTimer->OnTimerExpired.connect([this](const Timer * const&) { PublishStatsTimerHandler(); });
m_StatsTimer->Start();

Relying on when a object managed by a shared_ptr/intrusive_ptr is destroyed already looked quite fragile before (i.e. the point of these types is that there is no single owning reference to the object, yet, Timer::Ptr was used like that in many places). We could make that more explicit with something like a scoped timer class that wraps a Timer::Ptr and in its destructor calls Stop(true) on that pointer.

@Al2Klimov
Copy link
Member Author

What exactly did my changes cause here?

@julianbrost
Copy link
Contributor

The callback for m_StatsTimer captures this.

If I haven't overlooked anything:

  • Without this PR: when deleting a IcingaDB object, that m_StatsTimer member should have been the last reference to that Timer so when that Timer::Ptr is destroyed, it should have called ~Timer() which waited for any running callback to finish which ensures that this is not accessed afterwards.
  • With this PR: a running callback extends the lifespan of the Timer object and m_StatsTimer might be only the second to last reference, so the deletion of the IcingaDB might finish while the callback is still running on this.

@Al2Klimov
Copy link
Member Author

Good catch. What about a flag instead which explicitly enables the behavior I introduced? In this case just for Downtime cleanup?

@julianbrost
Copy link
Contributor

I'm not sure how that would look like exactly but it sounds like a flag I might want to remove immediately afterwards.

The problem with the existing use of Timer::Ptr is that the type itself models shared ownership of the timer, but sometimes it's just like if it was unique ownership and there is really nothing that prevents you from accidentally breaking this assumption (basically, that's what this PR did in the current state). So I like the idea of having an explicit owning reference type instead (no refcounting, just delete this object, the timer will be stopped and waited for, always).

@Al2Klimov
Copy link
Member Author

Please cross-check.

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tests, I've noticed that sometimes (haven't noticed a pattern and also not very often), downtimes are being deleted immediately and the schedule-downtime action even fails with an internal server error:

[2023-04-14 14:10:23 +0200] information/ApiListener: New client connection from [::ffff:172.20.0.1]:46650 (no client certificate)
[2023-04-14 14:10:23 +0200] information/Downtime: Triggering downtime 'x-dummy-1!random-1!15175d57-46f6-4ff5-9d91-41f6aaadb81c' for checkable 'x-dummy-1!random-1'. 
[2023-04-14 14:10:23 +0200] information/ConfigObjectUtility: Created and activated object 'x-dummy-1!random-1!15175d57-46f6-4ff5-9d91-41f6aaadb81c' of type 'Downtime'.
[2023-04-14 14:10:23 +0200] information/ConfigObjectUtility: Deleted object 'x-dummy-1!random-1!15175d57-46f6-4ff5-9d91-41f6aaadb81c' of type 'Downtime'.
[2023-04-14 14:10:23 +0200] information/Downtime: Removed downtime 'x-dummy-1!random-1!15175d57-46f6-4ff5-9d91-41f6aaadb81c' from checkable 'x-dummy-1!random-1' (Reason: expired at 2023-04-14 14:15:23 +0200).
[2023-04-14 14:10:23 +0200] information/HttpServerConnection: Request: POST /v1/actions/schedule-downtime (from [::ffff:172.20.0.1]:46650), user: root, agent: curl/8.0.1, status: Internal Server Error).
[2023-04-14 14:10:23 +0200] information/HttpServerConnection: HTTP client disconnected (from [::ffff:172.20.0.1]:46650)

Also not how "expired at 2023-04-14 14:15:23 +0200" is a time 5 minutes after the timestamp of that log message.

I was able to reproduce this with a loop repeatedly scheduling a flexible downtime with a window beginning now, ending in 5 minutes, with a duration of 1 minute:

while curl --fail -isSku root:icinga https://localhost:5665/v1/actions/schedule-downtime --json '{ "type": "Service", "service": "x-dummy-1!random-1", "start_time": '$(date +%s)', "end_time": '$(date -d 5min +%s)', "author": "icingaadmin", "comment": "flexible", "fixed": false, "duration": 60, "pretty": true }'; do :; done

HTTP response body showed this error message:

BOOST_THROW_EXCEPTION(std::runtime_error("Could not create downtime object."));

before permitting their parent objects' destruction.
For the cases where the handlers have raw pointers to these objects.
Don't look for expired downtimes in a timer fired every 60s,
but fire one timer per downtime once at expire time.
to actually reflect its purpose.
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can still reproduce #9726 (review) with the updated version.

@Al2Klimov
Copy link
Member Author

Yep.

include <itl>

object CheckerComponent "cc" { }

object ApiListener "api" { }

object ApiUser "root" {
	password = "123456"
	permissions = [ "*" ]
}

object Endpoint NodeName { }

object Zone NodeName {
	endpoints = [ NodeName ]
}

for (s in range(4)) {
	object Host s {
		check_command = "dummy"
		vars.dummy_state = 0
	}
}

for (i in range(1000)) {
	apply Service i {
		check_command = "dummy"
		vars.dummy_state = host.name
		assign where true
	}
}

@Al2Klimov Al2Klimov assigned Al2Klimov and unassigned Al2Klimov Apr 18, 2023
the last state change could be a long time ago. If it's longer than
the new downtime's duration, the downtime expires immediately.

trigger time + duration < now
@julianbrost
Copy link
Contributor

I was also able to reproduce the incorrect trigger time on the current master. Basically run the same test against master and then filter for downtimes with a too early trigger time:

$ curl -sSku root:icinga https://localhost:5665/v1/objects/downtimes -X GET --json '{"filter":"downtime.trigger_time < downtime.start_time", "pretty": true}'
[...]
        {
            "attrs": {
                "__name": "x-dummy-6!random-3!05d098ea-5e79-4185-94d2-f38cdc1cf067",
                "active": true,
                "author": "icingaadmin",
                "authoritative_zone": "",
                "comment": "flexible",
                "config_owner": "",
                "config_owner_hash": "",
                "duration": 60,
                "end_time": 1682322976,
                "entry_time": 1682322676.989451,
                "fixed": false,
                "ha_mode": 0,
                "host_name": "x-dummy-6",
                "legacy_id": 513,
                "name": "05d098ea-5e79-4185-94d2-f38cdc1cf067",
                "original_attributes": null,
                "package": "_api",
                "parent": "",
                "paused": false,
                "remove_time": 0,
                "scheduled_by": "",
                "service_name": "random-3",
                "source_location": {
                    "first_column": 0,
                    "first_line": 1,
                    "last_column": 69,
                    "last_line": 1,
                    "path": "/var/lib/icinga2/api/packages/_api/77e78991-06d4-4589-aa0c-ee60e1d7724e/conf.d/downtimes/x-dummy-6!random-3!05d098ea-5e79-4185-94d2-f38cdc1cf067.conf"
                },
                "start_time": 1682322676,
                "templates": [
                    "05d098ea-5e79-4185-94d2-f38cdc1cf067"
                ],
                "trigger_time": 1682322146.121111,
                "triggered_by": "",
                "triggers": [],
                "type": "Downtime",
                "version": 1682322676.989506,
                "was_cancelled": false,
                "zone": "master"
            },
            "joins": {},
            "meta": {},
            "name": "x-dummy-6!random-3!05d098ea-5e79-4185-94d2-f38cdc1cf067",
            "type": "Downtime"
        }
[...]

entry_time:

$ date -d @1682322676.989451
Mon 24 Apr 2023 09:51:16 CEST

start_time:

$ date -d @1682322676
Mon 24 Apr 2023 09:51:16 CEST

trigger_time:

$ date -d @1682322146.121111
Mon 24 Apr 2023 09:42:26 CEST

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/notifications Notification events bug Something isn't working cla/signed consider backporting Should be considered for inclusion in a bugfix release ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants