-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Leader Election for Gloo #6926
Conversation
Visit the preview URL for this PR (updated for commit 81a3ac5): https://gloo-edge--pr6926-ha-part-3-g0d9pb36.web.app (expires Mon, 22 Aug 2022 18:36:04 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 |
// Create the resource Lock interface necessary for leader election. | ||
// Controller runtime requires an event handler provider, but that package is | ||
// internal so for right now we pass a noop handler. | ||
resourceLock, err := leaderelection.NewResourceLock(f.restCfg, NewNoopProvider(), leOpts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the effect of using a noop handler? will we be missing some functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A noop handler is just the implementation for leader election that does nothing (ie what we have today, where we only support a single replica). Is there anything I can do to help make that clearer?
if podNamespace := os.Getenv("POD_NAMESPACE"); podNamespace != "" { | ||
return podNamespace | ||
} | ||
return "gloo-system" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we already have a const defined for this somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the challenge I ran into, is that that constant is defined in some projects/gloo/pkg/utils
, which causes a circular dependency. I would like in the future (and I tried in my other cleanup PR) to define constants at the top level, without dependencies so they can be imported easily.
config.OnStoppedLeading() | ||
}, | ||
OnNewLeader: func(identity string) { | ||
contextutils.LoggerFrom(ctx).Debugf("New Leader Elected with Identity: %s", identity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this print any info that helps identify which pod is the leader? looks like identity only contains a bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the base implementation, I tried to keep it as lightweight as possible. I could certainly, add information about the identity of the leader in here, or we could configure Gloo elections to define their "OnNewLeader" callback to print this information. What do you think?
@@ -129,6 +129,9 @@ settings: | |||
replaceInvalidRoutes: true | |||
gateway: | |||
persistProxySpec: true | |||
gloo: | |||
deployment: | |||
replicas: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will you be adding some e2e testing around making sure that leader election works with the 2 replicas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah first I want to make sure that existing regression tests just all work with multiple replicas.
Ideally I'd like to come up with a more explicit strategy to test that leader election is working, but I'm still thinking about that.
Issues linked to changelog: |
if s.identity.IsLeader() { | ||
// Only leaders will write reports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would think only leaders would do translation etc too? otherwise we may use a bunch of CPU / RAM for no reason? also a bunch of checks like this all over the codebase will be very very hard to maintain rather than a single check at the top level that disables everything else until elected as leader
should we even be serving validation requests either? do we really want to have watches duplicated everywhere (can we stress out or require k8s or dependencies to scale?) simultaneously? i thought most of the value of having the follower ready was to ensure that the pod was ready / was already scheduled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open to more discussion here; there are some valid points against but i think it will be easier and perhaps better to just short circuit everything at the top level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being able to scale validation seems pretty nice but the point about watches is a good one it might even be behavior one could opt into. That being said if the watches arent in place is the pod really ready to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a fair point; probably good to have the watches ready and warmed before being marked as ready and thus available for failover. i may just be paranoid; depends on how much strain gloo alone puts on the k8s apiserver itself and thus scaling it could become an issue. might be nice to get input from the field here to determine if this is warranted / should be configurable behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everyone should be doing translation and validation. Envoy/k8s will reach out to the service and the request will be Load Balanced, so all must be ready to serve the latest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we solving for scaling gloo, or just HA? I agree, if we want to scale gloo and load balance validation / xds requests we need to run translation (really everything but status reporting) on every pod (or have followers to talk leader to get state, similar to how zookeeper HA works). that said, i'm also nervous about scaling gloo given how many requests it can make, particularly in the consul case. I don't think having 3-5 gloo pods will be all that aggressive on the k8s apiserver because of the natural watch api, but for backends like consul we implement watches by querying for every single endpoint individually this puts a lot of strain on the consul server, and scaling gloo to multiple replicas could cause problems there when we may just want to solve for HA / ensuring pods can be scheduled.
really feels like more of a product requirement decision here than an engineering one to me. is the customer concerned about just scheduling pods? or about scaling gloo itself (is CPU / RAM pegged?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also i don't know enough about the leader election libraries used here, but have we reasoned about what happens in error scenarios? usually in failure there can be (briefly) two or zero leaders..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible for there to be 2 (fencing not implemented by library) or 0 (leader dies and there is a delay before a new elector is chosen). In both cases, the status reporting is affected, but will not cause issues pushing configuration to Envoy.
given that, I think we're ok in both scenarios
before we merge i'd recommend reading https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/ and following best practices here, in particular:
|
Good call. I had looked too quickly and seen that the library exposes metrics, but the default is a noop. I'll update to include metrics. Do you think lease expiration should just be configurable given the differences based on users environments? |
I think the defaults are sane; i don't think we will hit multi second issues unless we have deadlock, kernel errors, almost total network failure.. the kinds of things that should result in a new leader regardless |
I am content with current state. |
* add leaderelector module * use single replica as placeholder * delete unused * kill gloo on lost leadership * add support in e2e testss * include metrics, skip statuses * set the leader election factory based on the settings * fix bad test * add kube lease rbac * update rbac and tests * fix rbac" " * generated code * set 2 replicas in proper place * skip config map with leader election annotation * move go.mod * add changelog * codegen * bad comment * udate consistent state check to allow for more jitter, which occurs during startup * Adding changelog file to new location * Deleting changelog file from old location * make metrics check more consistent * move leader election to setup so it runs once * preoprly quit on lost leadership, update comment to reflect * go mod tidy * fix kube2e setupfunc * bump timeout for proxy creation * print debug logs on gateway kube2e test failure * if leader election could not be started, error * different prefail handler * realseoNcancel * add informative description to expect * remove ReleaseOnCancel * fix timeout * add metrics provider * change metric name * go mod tidy * codegen * increase status timeout Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com> Co-authored-by: changelog-bot <changelog-bot>
* Introduce Leader Election for Gloo (#6926) * add leaderelector module * use single replica as placeholder * delete unused * kill gloo on lost leadership * add support in e2e testss * include metrics, skip statuses * set the leader election factory based on the settings * fix bad test * add kube lease rbac * update rbac and tests * fix rbac" " * generated code * set 2 replicas in proper place * skip config map with leader election annotation * move go.mod * add changelog * codegen * bad comment * udate consistent state check to allow for more jitter, which occurs during startup * Adding changelog file to new location * Deleting changelog file from old location * make metrics check more consistent * move leader election to setup so it runs once * preoprly quit on lost leadership, update comment to reflect * go mod tidy * fix kube2e setupfunc * bump timeout for proxy creation * print debug logs on gateway kube2e test failure * if leader election could not be started, error * different prefail handler * realseoNcancel * add informative description to expect * remove ReleaseOnCancel * fix timeout * add metrics provider * change metric name * go mod tidy * codegen * increase status timeout Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com> Co-authored-by: changelog-bot <changelog-bot> * fix(tls_inspector): added AggregateListener support "envoy.filters.listener.tls_inspector" is an essential listener_filter when attempting to use TLS. For _most_ gateways, we would previously always consider logic on whether or not to add it based on the presence of a user-provided SslConfig field. For the new _AggregateListener_, we neglected to add support, originally. This caused TLS to be unable to function when using the "isolateVirtualHostsBySslConfig" feature flag. related #6677 * changelog: condense * leader: Attempt to see if leader lease time is the breaking part * Revert "fix(tls_inspector): added AggregateListener support" This reverts commit 278187d. * Revert "Revert "fix(tls_inspector): added AggregateListener support"" This reverts commit 4d38e2b. Co-authored-by: Sam Heilbron <[email protected]> Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com> Co-authored-by: Gunnar <[email protected]>
@@ -0,0 +1,37 @@ | |||
package kube |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sam-heilbron Do you remember why we manually vendored this file from the c-r or client-go repository?
Description
Introduce leader election to Gloo component
Context
Introduce leader election if using Kubernetes as the source of truth for resources.
Solution
Initially I place the leader election code inside the Gloo SetupFunc, which is executed each time Settings change. The effect was that each time the SetupFunc was called, the election process was cancelled (since we use a new context for each setup), a new leader was elected, and inconsistent behavior was observed.
I moved this election logic to the main setup portion so that it happens once during startup, and if leadership is lost, we kill the application.
Technical Debt
This is called out loudly in a code comment in https://github.com/solo-io/gloo/blob/master/projects/gloo/pkg/api/converters/kube/artifact_converter.go, where the debt is incurred.
We need to ignore the configmap (or whatever kube resource maintains the state of the leader) during translation. Since it is updated on an interval (2 seconds) if it's processed by Gloo controllers, we will resync the entire state of the world continually.
Ideally, we ignore configmaps with a particular label, but that isn't supported in solo-kit. The faster solution, is to ignore it explicitly in code for now, and handle the more robust solution in a follow-up.
Follow Up Work
Enterprise Follow-Up
To support this in Enterprise, the following work will be required: https://github.com/solo-io/solo-projects/pull/3974
Checklist:
make -B install-go-tools generated-code
to ensure there will be no code diff