Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: adds affinity and other scheduling to the operator #29977

Merged
merged 1 commit into from
Jul 3, 2024

Conversation

shawkins
Copy link
Contributor

closes: #29258

Draft of a scheduling spec with the options that are most likely to be requested - affinity, topologyspread, priorityclass, and tolerations.

The handling is similar to other constructs - if the user has populated the field in the unsupported spec, that takes presedence.

A new label, app.kubernetes.io/component, is added to distinguish the server pods.

The default node anti-affinity is added if no other affinity has been specified. This mechanism may not have great usability. It could be better to have a withDefaults field on the scheduling spec.

Similarly I'm proposing to go a step further and handle the default affinity or spread - depending upon whether an external cache is in use. Issues with doing this:

  • cache configuration is currently done though additional properties. However we do have a cache spec where we could have first-class properties.
  • the cache configuration could be baked into the image, which we currently don't know how to account for. At worst the user would need to supply their own spread constraint(s).

@ahus1 @vmuzikar @abelmatos WDYT?

Copy link

@keycloak-github-bot keycloak-github-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreported flaky test detected, please review

@keycloak-github-bot
Copy link

Unreported flaky test detected

If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.

org.keycloak.testsuite.broker.KcOidcBrokerTest#testPostBrokerLoginFlowWithOTP_bruteForceEnabled

Keycloak CI - Java Distribution IT (windows-latest - temurin - 19)

java.lang.AssertionError: expected:<Invalid authenticator code.> but was:<null>
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:120)
	at org.junit.Assert.assertEquals(Assert.java:146)
...

Report flaky test

@ahus1
Copy link
Contributor

ahus1 commented Jun 3, 2024

Thank you for this PR. What I'm reading here is that you're about to expose the Kubernetes constructs as is. When doing this, we'll offer a non-opinionated way to schedule Pods, which is IMHO ok as we don't seem to have an opinionated and agreed-upon solution.

Eventually, more coarse grained options might emerge as we learn about how this can be used in different environments. I'd say even then exposing this as a supported feature makes sense as we won't be able to foresee any possible scenario.

Even if the deployment uses an external Infinispan, there are still the embedded Infinispans for now, and therefore I'd advise the Operator not to take any caching configuration into account - at least for now. @pruivo is working an a PR which is planned for KC26 which would allow running Keycloak without its embedded distributed caches, but this is still some time in the future.

So for now, I'd say

  • have a default node anti-affinity,
  • ignore the caching options, and
  • expose the scheduling as is.

cc: @ryanemerson

@shawkins
Copy link
Contributor Author

shawkins commented Jun 3, 2024

@ahus1 I'm fine with removing the defaults around spread constraints and moving that to a topic for the scaling guide.

@ahus1
Copy link
Contributor

ahus1 commented Jun 3, 2024

I'm fine with removing the defaults around spread constraints and moving that to a topic for the scaling guide.

No, please add a default best-effort spread. I'm just against making the default settings dependent on the cache configuration.

@shawkins
Copy link
Contributor Author

shawkins commented Jun 3, 2024

No, please add a default best-effort spread. I'm just against making the default settings dependent on the cache configuration.

The default spread is just an affinity for the same zone. The pr has been updated with that.

Should just need some docs.

@vmuzikar WDYT?

Copy link

@keycloak-github-bot keycloak-github-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreported flaky test detected, please review

@keycloak-github-bot
Copy link

Unreported flaky test detected

If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.

org.keycloak.testsuite.oauth.TokenIntrospectionTest#testUnsupportedToken

Keycloak CI - Base IT (6)

java.lang.NullPointerException: Cannot invoke "com.fasterxml.jackson.databind.JsonNode.asBoolean()" because the return value of "com.fasterxml.jackson.databind.JsonNode.get(String)" is null
	at org.keycloak.testsuite.oauth.TokenIntrospectionTest.testUnsupportedToken(TokenIntrospectionTest.java:313)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
...

Report flaky test

Copy link
Contributor

@vmuzikar vmuzikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shawkins Thank you for the changes!

@shawkins
Copy link
Contributor Author

@vmuzikar added the docs, should be fully ready for review now.

Copy link
Contributor

@mabartos mabartos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shawkins Great work! I've just put a little brain dump below. And there's also some changes conflict.

nit: I also like when we have test-covered the CR serialization as they're running a few ms and may bring additional test coverage - but in this case the PodTemplate test should be sufficient enough.

docs/guides/operator/advanced-configuration.adoc Outdated Show resolved Hide resolved
@@ -204,6 +204,25 @@ It is achieved by providing certain JVM options.

For more details, see <@links.server id="containers" />.

=== Scheduling

You may control several aspects of the server Pod scheduling via the Keycloak CR. The scheduling stanza exposes the standard Kubernetes affinity, tolerations, topology spread constraints, and the priority class name to fine tune the scheduling and placement of your server Pods. By default if you do not specify a custom affinity your Pods will have an affinity for the same zone to prevent stretch clusters and an anti-affinity for the same node to improve availability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be good for users to also briefly explain why we want to prevent the stretched clusters? When we support the affinity configuration, we should mention the cons of the different AZ and the embedded Infinispan, correct? We will avoid some future confusion when users use different affinity settings.

@shawkins WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a little more on this. I'd defer to @ryanemerson if we want a more detailed explanation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we have for now is fine. I think we need a dedicated guide explaining different architectures, stretch vs non-stretch etc, as it's a complex topic.

@shawkins
Copy link
Contributor Author

@ryanemerson @ahus1 @vmuzikar in the docs should we call out specifically recommending an affinity for the same zone as the database?

@ahus1
Copy link
Contributor

ahus1 commented Jun 28, 2024

@shawkins

should we call out specifically recommending an affinity for the same zone as the database?

The good thing about pod affinity is that it is now (1) supported by the Operator and no longer part of the unsupported pod template and (2) we have a reasonable default.

Deploying Keycloak on the same region as the database seems to be a good idea, still I didn't tested this how much it actually impacts Keycloak. And if the database is running outside of Kubernetes, you would start coding the zone manually in the affinity rules, and not use the "try to run all pods in the same zone but I don't care which".

So I'd say it is probably good enough for now.

Once we work with this a bit more, I could see that we provide some predefined scheduling templates to hide the complexity and to promote some well-tested and well-understood scheduling methods. This would then be a follow-up PR.

@ryanemerson
Copy link
Contributor

+1 to Alexander's comment. I think that more advanced/opinionated setting examples should exist as part of the Keycloak HA guide or similar, where we explain in detail the trade-offs of different architectures and how they apply to specific use-cases.

Copy link
Contributor

@mabartos mabartos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shawkins Just last minor thing about the example formatting, otherwise LGTM

docs/guides/operator/advanced-configuration.adoc Outdated Show resolved Hide resolved
@shawkins
Copy link
Contributor Author

shawkins commented Jul 2, 2024

@shawkins Just last minor thing about the example formatting, otherwise LGTM

Should be good now.

Copy link
Contributor

@vmuzikar vmuzikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one minor comment. LGTM otherwise.

docs/guides/operator/advanced-configuration.adoc Outdated Show resolved Hide resolved
Copy link
Contributor

@vmuzikar vmuzikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shawkins LGTM, thank you.

@vmuzikar
Copy link
Contributor

vmuzikar commented Jul 3, 2024

@mabartos I believe your comments has been resolved. I'm overriding your review request so we can merge. Once you're back, let me know if you have any further concerns, we can address them as a follow-up.

@vmuzikar vmuzikar dismissed mabartos’s stale review July 3, 2024 18:06

Review comments has been resolved

@vmuzikar vmuzikar merged commit a7ae90c into keycloak:main Jul 3, 2024
64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support pod affinity settings in the Keycloak Operator
5 participants