Disallow config modifications via API during reload #9445

Al2Klimov · 2022-07-15T15:57:37Z

Once the new main process has read the config,
it misses subsequent modifications from the old process otherwise.

Al2Klimov · 2022-07-15T16:04:18Z

Tests

Disable checker. Add enough hosts so that committing items on reload takes long enough (-x debug) for you to trigger a 503 (for an otherwise 200 one):

{"results":[{"code":503,"errors":["Icinga is reloading"],"status":"Object could not be created."}]}

julianbrost · 2022-07-18T07:41:31Z

Once the new main process has read the config

Not only once it's done doing that, but as soon as it started reading the config, it may miss new files that (dis)appear due to changes by the old worker. But the implementation looks like it would block modifications starting before the new worker process is started, so that seems to be fine.

Disable checker.

Why is this relevant/necessary here?

Is boost::shared_mutex guaranteed to be safe to be used over shared memory?

Also, to me this feels like similar to some already existing functionality: after all, this is used to communicate some state between the processes, like it's already done with the SIGUSR2 based signaling. So it would be nice to have a common mechanism for all of this.

And there's also some general issue remaining: you only do this for the /v1/objects API, but there are more things writing config: some /v1/actions (add/remove comment/downtime) as well as internal functionality like scheduled downtimes creating the actual downtime object or expiry of comments/downtimes.

Al2Klimov · 2022-07-18T08:49:29Z

Why is this relevant/necessary here?

Not to blow up your WS.

Is boost::shared_mutex guaranteed to be safe to be used over shared memory?

It's guaranteed to synchronise threads. Under Linux a thread is just a process sharing its whole memory with other processes. So: yes.

Also, to me this feels like similar to some already existing functionality: after all, this is used to communicate some state between the processes, like it's already done with the SIGUSR2 based signaling. So it would be nice to have a common mechanism for all of this.

As a template Ipc should be generic enough for being the common mechanism for not yet merged stuff. (Btw. shall I simplify #8429 down to Ipc<std::atomic<double>>?)

And there's also some general issue remaining: you only do this for the /v1/objects API, but there are more things writing config: some /v1/actions (add/remove comment/downtime) as well as internal functionality like scheduled downtimes creating the actual downtime object or expiry of comments/downtimes.

Let’s get back to this once we’ve agreed upon the implementation in general.

julianbrost · 2022-07-18T09:06:33Z

Is boost::shared_mutex guaranteed to be safe to be used over shared memory?

It's guaranteed to synchronise threads. Under Linux a thread is just a process sharing its whole memory with other processes. So: yes.

It's not that simple. For example, if it uses a pthread mutex internally, these have to be configured specifically to work over shared memory, see pthread_mutexattr_setpshared.

Al2Klimov · 2022-07-18T09:36:53Z

OK, shall I just use that type instead?

Al2Klimov · 2022-07-18T09:46:05Z

Wait!

Mutexes created with this attributes object are to be shared only among threads in the same process that initialized the mutex. This is the default value for the process-shared mutex attribute.

Why my test succeeded then?

julianbrost · 2022-07-18T13:47:01Z

Why my test succeeded then?

Does boost::shared_mutex actually use pthread on your system? And even then, things might be implemented slightly different between systems and versions and don't have to look obviously broken either. I just brought that up as an example that things aren't as simple as just putting stuff in shared memory.

OK, shall I just use that type instead?

Based on the documentation, that sound more suitable for that particular use-case.

Al2Klimov · 2022-07-18T14:54:33Z

internal functionality like scheduled downtimes creating the actual downtime object or expiry of comments/downtimes

Fortunately we won’t lose anything here.

julianbrost · 2022-07-18T15:08:55Z

internal functionality like scheduled downtimes creating the actual downtime object or expiry of comments/downtimes

Fortunately we won’t lose anything here.

I think this could possibly lead to duplicate notifications/history though. But probably still better than continuing checks but not ending downtimes for example (that would probably be the alternative).

julianbrost

Also, to me this feels like similar to some already existing functionality: after all, this is used to communicate some state between the processes, like it's already done with the SIGUSR2 based signaling. So it would be nice to have a common mechanism for all of this.

As a template Ipc should be generic enough for being the common mechanism for not yet merged stuff. (Btw. shall I simplify #8429 down to Ipc<std::atomic<double>>?)

My comment wasn't about other PRs, so please don't do anything there yet.

The thing is, in the current state of this PR, the reload now works in a way that once the new worker finished loaded the configuration, one of the following happens:

The config was loaded successfully, the new worker is told to continue by SIGUSR2 (and the old one is killed).
The config failed to load, the old worker is told to continue by releasing a mutex (and the new worker already exited by itself).

Basically the same notification either one of the workers, just using totally different mechanisms. I'd rather try to use a common mechanism for all signaling between umbrella and workers.

We should first go back a step and think about how we want things to work in the end, possible options:

Keep using signals. Maybe some process-local state in the old worker + sending back a SIGUSR2 in case the old worker should continue would suffice. I do understand that signals are annoying and at some point there's a limit how far you can get with them.
Do all the state synchronization between umbrella and worker processes using shared memory. Would require some more reworking, but also a chance to get rid of complex signal handling code.
Something completely different, first possible alternatives that come to my mind would be (anonymous) pipes or sockets.

lib/base/ipc.hpp

julianbrost · 2022-07-19T11:00:37Z

lib/base/ipc.hpp

+ inline ~Ipc()
+ {
+ GetAddress()->~T();
+ }


This looks incredibly dangerous. After all, the point of this class is to share memory between processes. With this, when one process exits, it will destruct the object here even though it might still be in use by other processes.

Due to the overall nature of this class, it might be a good idea to restrict it to trivially copyable and destructible. Trivially copyable as that basically states you can copy the object using memcpy, which could serve as an approximation to what happens by sharing it into a second process by memory.

Fortunately the given mutex type is both trivially copyable and destructible. /s

Can this be put into a static_assert?

Fortunately the given mutex type is both trivially copyable and destructible. /s

It's not! /s stands for sarcasm.

Al2Klimov · 2022-07-19T16:30:56Z

I'd rather try to use a common mechanism for all signaling between umbrella and workers.

We should first go back a step and think about how we want things to work in the end

No, I don’t think so. The concept behind this PR is just fine. At the same time we don’t have (any reason) to wake sleeping dogs by touching the already existing working and well tested IPC mechanisms such as signals or pipe FD passing between main process and spawner. As well as we didn’t have (any reason) to re-write any var type to auto along with writing the very first one auto var.

Al2Klimov · 2022-10-21T09:17:30Z

This is already done for deploys:

icinga2/lib/remote/configstageshandler.cpp

Lines 134 to 137 in 987bb22

 if (m_RunningPackageUpdates.exchange(true)) { 

 return HttpUtility::SendJsonError(response, params, 423, 

 "Conflicting request, there is already an ongoing package update in progress. Please try it again later."); 

 }

No reason not to do it here as well. Oh, even better: there is a reason to do it – the one why #9267 was done. And backported.

julianbrost

Keep using signals. Maybe some process-local state in the old worker + sending back a SIGUSR2 in case the old worker should continue would suffice. I do understand that signals are annoying and at some point there's a limit how far you can get with them.

Do all the state synchronization between umbrella and worker processes using shared memory. Would require some more reworking, but also a chance to get rid of complex signal handling code.

Something completely different, first possible alternatives that come to my mind would be (anonymous) pipes or sockets.

Coming back to these ideas:

(1) That would mean the umbrella has to signal the worker that it wants to reload, it would then have to block new and wait for already running API config changes and then signal back to the umbrella, and so on. That would probably get quite messy just using signals.
(3) Would basically have to do the same as (1), but instead of using signals, it would communicate using less annoying means. But in the end, it would probably end up providing a mutex implementing message-based communication.

So that leaves us with (2), which this PR might be a first step of. As much as I hate it, this approach might be the most pragmatic one here. In general, it would have been easier if this config inconsistency issue came up when initially implementing that reload mechanism rather than having to do something about this years later.

What might the next steps be? For #8429, instead of creating a completely separate memory allocation, I'm thinking of extending this allocation into a bigger one for all the umbrella-worker communication so that it's in one place. Further steps could then be expanding it even further and replace some signals with it. But that shouldn't affect this PR, let's finish it first, and then we'll see how to extend in the next one.

lib/remote/configobjectslock.cpp

julianbrost · 2023-03-31T11:30:23Z

Oh, and what about Windows? I think the exclusive lock shouldn't exist there at all (it's only used in the non-Windows reload code) and the shared lock should be a no-op (if there can't be an exclusive lock, you can basically have infinite shared ones).

lib/remote/configobjectslock.hpp

lib/remote/deleteobjecthandler.cpp

Once the new main process has read the config, it misses subsequent modifications from the old process otherwise.

…9445

Al2Klimov added this to the 2.14.0 milestone Jul 15, 2022

cla-bot bot added the cla/signed label Jul 15, 2022

icinga-probot bot added area/api REST API area/checks Check execution and results area/configuration DSL, parser, compiler, error handling bug Something isn't working labels Jul 15, 2022

Al2Klimov force-pushed the 9365 branch from 907513c to 92bc28e Compare July 18, 2022 14:54

Al2Klimov requested a review from julianbrost July 18, 2022 15:11

julianbrost reviewed Jul 19, 2022

View reviewed changes

Al2Klimov force-pushed the 9365 branch from 92bc28e to ac7ca0e Compare July 19, 2022 16:31

Al2Klimov requested a review from julianbrost October 12, 2022 08:29

Al2Klimov requested a review from lippserd October 21, 2022 09:17

julianbrost mentioned this pull request Jan 26, 2023

Problem creating service objects via api because of notification object, which shouldn't be there anymore #9638

Closed

lippserd removed their request for review January 30, 2023 15:06

julianbrost mentioned this pull request Feb 16, 2023

Share "Last reload attempt failed" time across Icinga process tree on *nix #8429

Merged

1 task

julianbrost requested changes Mar 30, 2023

View reviewed changes

lib/remote/configobjectslock.cpp Outdated Show resolved Hide resolved

Introduce SharedMemory

a66ace7

Al2Klimov force-pushed the 9365 branch from ac7ca0e to 842ef12 Compare April 4, 2023 11:41

Al2Klimov requested a review from julianbrost April 5, 2023 10:00

Al2Klimov force-pushed the 9365 branch 2 times, most recently from 608027f to e2fda14 Compare April 5, 2023 14:19

julianbrost reviewed Apr 12, 2023

View reviewed changes

lib/remote/configobjectslock.hpp Outdated Show resolved Hide resolved

lib/remote/deleteobjecthandler.cpp Outdated Show resolved Hide resolved

Introduce ConfigObjects*Lock

64e000d

Al2Klimov force-pushed the 9365 branch from e2fda14 to 04434ff Compare April 12, 2023 11:42

Al2Klimov requested a review from julianbrost April 12, 2023 11:43

Disallow config modifications via API during reload

2ee776b

Once the new main process has read the config, it misses subsequent modifications from the old process otherwise.

Al2Klimov force-pushed the 9365 branch from 04434ff to 2ee776b Compare April 12, 2023 12:45

julianbrost approved these changes Apr 13, 2023

View reviewed changes

julianbrost merged commit f505325 into master Apr 13, 2023

icinga-probot bot deleted the 9365 branch April 13, 2023 15:12

julianbrost mentioned this pull request Jul 4, 2023

Icinga 2.14 #9760

Merged

3 tasks

Al2Klimov added a commit that referenced this pull request Jul 4, 2023

API: reject config modifications during reload with HTTP status 503. #…

7f55b3d

…9445

This was referenced Mar 11, 2024

Missing handling for HTTP 503 from Icinga 2 API in integration tests Icinga/icingadb#685

Closed

Icinga2Client#Do(): call parent method, re-try on HTTP 503 300 times Icinga/icinga-testing#29

Merged

Al2Klimov mentioned this pull request Aug 21, 2024

Fix broken runtime config sync #10126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disallow config modifications via API during reload #9445

Disallow config modifications via API during reload #9445

Al2Klimov commented Jul 15, 2022 •

edited

Loading

Al2Klimov commented Jul 15, 2022

julianbrost commented Jul 18, 2022

Al2Klimov commented Jul 18, 2022 •

edited

Loading

julianbrost commented Jul 18, 2022

Al2Klimov commented Jul 18, 2022

Al2Klimov commented Jul 18, 2022

julianbrost commented Jul 18, 2022

Al2Klimov commented Jul 18, 2022

julianbrost commented Jul 18, 2022

julianbrost left a comment

julianbrost Jul 19, 2022

Al2Klimov Jul 19, 2022

julianbrost Mar 30, 2023

Al2Klimov Apr 3, 2023

Al2Klimov commented Jul 19, 2022

Al2Klimov commented Oct 21, 2022

julianbrost left a comment

julianbrost commented Mar 31, 2023

Disallow config modifications via API during reload #9445

Disallow config modifications via API during reload #9445

Conversation

Al2Klimov commented Jul 15, 2022 • edited Loading

Al2Klimov commented Jul 15, 2022

Tests

julianbrost commented Jul 18, 2022

Al2Klimov commented Jul 18, 2022 • edited Loading

julianbrost commented Jul 18, 2022

Al2Klimov commented Jul 18, 2022

Al2Klimov commented Jul 18, 2022

julianbrost commented Jul 18, 2022

Al2Klimov commented Jul 18, 2022

julianbrost commented Jul 18, 2022

julianbrost left a comment

Choose a reason for hiding this comment

julianbrost Jul 19, 2022

Choose a reason for hiding this comment

Al2Klimov Jul 19, 2022

Choose a reason for hiding this comment

julianbrost Mar 30, 2023

Choose a reason for hiding this comment

Al2Klimov Apr 3, 2023

Choose a reason for hiding this comment

Al2Klimov commented Jul 19, 2022

Al2Klimov commented Oct 21, 2022

julianbrost left a comment

Choose a reason for hiding this comment

julianbrost commented Mar 31, 2023

Al2Klimov commented Jul 15, 2022 •

edited

Loading

Al2Klimov commented Jul 18, 2022 •

edited

Loading