Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow config modifications via API during reload #9445

Merged
merged 3 commits into from
Apr 13, 2023
Merged

Conversation

Al2Klimov
Copy link
Member

@Al2Klimov Al2Klimov commented Jul 15, 2022

Once the new main process has read the config,
it misses subsequent modifications from the old process otherwise.

fixes #9365
fixes #9366
fixes #9638

@Al2Klimov Al2Klimov added this to the 2.14.0 milestone Jul 15, 2022
@cla-bot cla-bot bot added the cla/signed label Jul 15, 2022
@icinga-probot icinga-probot bot added area/api REST API area/checks Check execution and results area/configuration DSL, parser, compiler, error handling bug Something isn't working labels Jul 15, 2022
@Al2Klimov
Copy link
Member Author

Tests

Disable checker. Add enough hosts so that committing items on reload takes long enough (-x debug) for you to trigger a 503 (for an otherwise 200 one):

{"results":[{"code":503,"errors":["Icinga is reloading"],"status":"Object could not be created."}]}

@julianbrost
Copy link
Contributor

Once the new main process has read the config

Not only once it's done doing that, but as soon as it started reading the config, it may miss new files that (dis)appear due to changes by the old worker. But the implementation looks like it would block modifications starting before the new worker process is started, so that seems to be fine.

Disable checker.

Why is this relevant/necessary here?

Is boost::shared_mutex guaranteed to be safe to be used over shared memory?

Also, to me this feels like similar to some already existing functionality: after all, this is used to communicate some state between the processes, like it's already done with the SIGUSR2 based signaling. So it would be nice to have a common mechanism for all of this.

And there's also some general issue remaining: you only do this for the /v1/objects API, but there are more things writing config: some /v1/actions (add/remove comment/downtime) as well as internal functionality like scheduled downtimes creating the actual downtime object or expiry of comments/downtimes.

@Al2Klimov
Copy link
Member Author

Al2Klimov commented Jul 18, 2022

Why is this relevant/necessary here?

Not to blow up your WS.

Is boost::shared_mutex guaranteed to be safe to be used over shared memory?

It's guaranteed to synchronise threads. Under Linux a thread is just a process sharing its whole memory with other processes. So: yes.

Also, to me this feels like similar to some already existing functionality: after all, this is used to communicate some state between the processes, like it's already done with the SIGUSR2 based signaling. So it would be nice to have a common mechanism for all of this.

As a template Ipc should be generic enough for being the common mechanism for not yet merged stuff. (Btw. shall I simplify #8429 down to Ipc<std::atomic<double>>?)

And there's also some general issue remaining: you only do this for the /v1/objects API, but there are more things writing config: some /v1/actions (add/remove comment/downtime) as well as internal functionality like scheduled downtimes creating the actual downtime object or expiry of comments/downtimes.

Let’s get back to this once we’ve agreed upon the implementation in general.

@julianbrost
Copy link
Contributor

Is boost::shared_mutex guaranteed to be safe to be used over shared memory?

It's guaranteed to synchronise threads. Under Linux a thread is just a process sharing its whole memory with other processes. So: yes.

It's not that simple. For example, if it uses a pthread mutex internally, these have to be configured specifically to work over shared memory, see pthread_mutexattr_setpshared.

@Al2Klimov
Copy link
Member Author

OK, shall I just use that type instead?

@Al2Klimov
Copy link
Member Author

Wait!

Mutexes created with this attributes object are to be shared only among threads in the same process that initialized the mutex. This is the default value for the process-shared mutex attribute.

Why my test succeeded then?

@julianbrost
Copy link
Contributor

Why my test succeeded then?

Does boost::shared_mutex actually use pthread on your system? And even then, things might be implemented slightly different between systems and versions and don't have to look obviously broken either. I just brought that up as an example that things aren't as simple as just putting stuff in shared memory.

OK, shall I just use that type instead?

Based on the documentation, that sound more suitable for that particular use-case.

@Al2Klimov
Copy link
Member Author

internal functionality like scheduled downtimes creating the actual downtime object or expiry of comments/downtimes

Fortunately we won’t lose anything here.

@julianbrost
Copy link
Contributor

internal functionality like scheduled downtimes creating the actual downtime object or expiry of comments/downtimes

Fortunately we won’t lose anything here.

I think this could possibly lead to duplicate notifications/history though. But probably still better than continuing checks but not ending downtimes for example (that would probably be the alternative).

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to me this feels like similar to some already existing functionality: after all, this is used to communicate some state between the processes, like it's already done with the SIGUSR2 based signaling. So it would be nice to have a common mechanism for all of this.

As a template Ipc should be generic enough for being the common mechanism for not yet merged stuff. (Btw. shall I simplify #8429 down to Ipc<std::atomic<double>>?)

My comment wasn't about other PRs, so please don't do anything there yet.

The thing is, in the current state of this PR, the reload now works in a way that once the new worker finished loaded the configuration, one of the following happens:

  1. The config was loaded successfully, the new worker is told to continue by SIGUSR2 (and the old one is killed).
  2. The config failed to load, the old worker is told to continue by releasing a mutex (and the new worker already exited by itself).

Basically the same notification either one of the workers, just using totally different mechanisms. I'd rather try to use a common mechanism for all signaling between umbrella and workers.

We should first go back a step and think about how we want things to work in the end, possible options:

  1. Keep using signals. Maybe some process-local state in the old worker + sending back a SIGUSR2 in case the old worker should continue would suffice. I do understand that signals are annoying and at some point there's a limit how far you can get with them.
  2. Do all the state synchronization between umbrella and worker processes using shared memory. Would require some more reworking, but also a chance to get rid of complex signal handling code.
  3. Something completely different, first possible alternatives that come to my mind would be (anonymous) pipes or sockets.

lib/base/ipc.hpp Outdated Show resolved Hide resolved
lib/base/ipc.hpp Outdated
Comment on lines 32 to 35
inline ~Ipc()
{
GetAddress()->~T();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks incredibly dangerous. After all, the point of this class is to share memory between processes. With this, when one process exits, it will destruct the object here even though it might still be in use by other processes.

Due to the overall nature of this class, it might be a good idea to restrict it to trivially copyable and destructible. Trivially copyable as that basically states you can copy the object using memcpy, which could serve as an approximation to what happens by sharing it into a second process by memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately the given mutex type is both trivially copyable and destructible. /s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be put into a static_assert?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately the given mutex type is both trivially copyable and destructible. /s

It's not! /s stands for sarcasm.

@Al2Klimov
Copy link
Member Author

I'd rather try to use a common mechanism for all signaling between umbrella and workers.

We should first go back a step and think about how we want things to work in the end

No, I don’t think so. The concept behind this PR is just fine. At the same time we don’t have (any reason) to wake sleeping dogs by touching the already existing working and well tested IPC mechanisms such as signals or pipe FD passing between main process and spawner. As well as we didn’t have (any reason) to re-write any var type to auto along with writing the very first one auto var.

@Al2Klimov
Copy link
Member Author

This is already done for deploys:

if (m_RunningPackageUpdates.exchange(true)) {
return HttpUtility::SendJsonError(response, params, 423,
"Conflicting request, there is already an ongoing package update in progress. Please try it again later.");
}

No reason not to do it here as well. Oh, even better: there is a reason to do it – the one why #9267 was done. And backported.

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Keep using signals. Maybe some process-local state in the old worker + sending back a SIGUSR2 in case the old worker should continue would suffice. I do understand that signals are annoying and at some point there's a limit how far you can get with them.
  2. Do all the state synchronization between umbrella and worker processes using shared memory. Would require some more reworking, but also a chance to get rid of complex signal handling code.
  3. Something completely different, first possible alternatives that come to my mind would be (anonymous) pipes or sockets.

Coming back to these ideas:

  • (1) That would mean the umbrella has to signal the worker that it wants to reload, it would then have to block new and wait for already running API config changes and then signal back to the umbrella, and so on. That would probably get quite messy just using signals.
  • (3) Would basically have to do the same as (1), but instead of using signals, it would communicate using less annoying means. But in the end, it would probably end up providing a mutex implementing message-based communication.

So that leaves us with (2), which this PR might be a first step of. As much as I hate it, this approach might be the most pragmatic one here. In general, it would have been easier if this config inconsistency issue came up when initially implementing that reload mechanism rather than having to do something about this years later.

What might the next steps be? For #8429, instead of creating a completely separate memory allocation, I'm thinking of extending this allocation into a bigger one for all the umbrella-worker communication so that it's in one place. Further steps could then be expanding it even further and replace some signals with it. But that shouldn't affect this PR, let's finish it first, and then we'll see how to extend in the next one.

lib/remote/configobjectslock.cpp Outdated Show resolved Hide resolved
@julianbrost
Copy link
Contributor

Oh, and what about Windows? I think the exclusive lock shouldn't exist there at all (it's only used in the non-Windows reload code) and the shared lock should be a no-op (if there can't be an exclusive lock, you can basically have infinite shared ones).

lib/remote/configobjectslock.hpp Outdated Show resolved Hide resolved
lib/remote/deleteobjecthandler.cpp Outdated Show resolved Hide resolved
Once the new main process has read the config,
it misses subsequent modifications from the old process otherwise.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api REST API area/checks Check execution and results area/configuration DSL, parser, compiler, error handling bug Something isn't working cla/signed
Projects
None yet
2 participants