Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Leader Election for Gloo #6926

Merged
merged 49 commits into from
Aug 15, 2022
Merged

Introduce Leader Election for Gloo #6926

merged 49 commits into from
Aug 15, 2022

Conversation

sam-heilbron
Copy link
Contributor

@sam-heilbron sam-heilbron commented Aug 10, 2022

Description

Introduce leader election to Gloo component

Context

Introduce leader election if using Kubernetes as the source of truth for resources.

Solution

Initially I place the leader election code inside the Gloo SetupFunc, which is executed each time Settings change. The effect was that each time the SetupFunc was called, the election process was cancelled (since we use a new context for each setup), a new leader was elected, and inconsistent behavior was observed.

I moved this election logic to the main setup portion so that it happens once during startup, and if leadership is lost, we kill the application.

Technical Debt

This is called out loudly in a code comment in https://github.com/solo-io/gloo/blob/master/projects/gloo/pkg/api/converters/kube/artifact_converter.go, where the debt is incurred.

We need to ignore the configmap (or whatever kube resource maintains the state of the leader) during translation. Since it is updated on an interval (2 seconds) if it's processed by Gloo controllers, we will resync the entire state of the world continually.

Ideally, we ignore configmaps with a particular label, but that isn't supported in solo-kit. The faster solution, is to ignore it explicitly in code for now, and handle the more robust solution in a follow-up.

Follow Up Work

  1. Solo-kit enhancements to resource filtering
  2. Helm hardening around other HA features

Enterprise Follow-Up

To support this in Enterprise, the following work will be required: https://github.com/solo-io/solo-projects/pull/3974

Checklist:

  • I included a concise, user-facing changelog (for details, see https://github.com/solo-io/go-utils/tree/master/changelogutils) which references the issue that is resolved.
  • If I updated APIs (our protos) or helm values, I ran make -B install-go-tools generated-code to ensure there will be no code diff
  • I followed guidelines laid out in the Gloo Edge contribution guide
  • I opened a draft PR or added the work in progress label if my PR is not ready for review
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

@github-actions github-actions bot added the keep pr updated signals bulldozer to keep pr up to date with base branch label Aug 10, 2022
@sam-heilbron sam-heilbron marked this pull request as ready for review August 11, 2022 02:56
@github-actions
Copy link

github-actions bot commented Aug 11, 2022

Visit the preview URL for this PR (updated for commit 81a3ac5):

https://gloo-edge--pr6926-ha-part-3-g0d9pb36.web.app

(expires Mon, 22 Aug 2022 18:36:04 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

@sam-heilbron sam-heilbron requested a review from a team as a code owner August 11, 2022 05:25
go.mod Outdated Show resolved Hide resolved
// Create the resource Lock interface necessary for leader election.
// Controller runtime requires an event handler provider, but that package is
// internal so for right now we pass a noop handler.
resourceLock, err := leaderelection.NewResourceLock(f.restCfg, NewNoopProvider(), leOpts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the effect of using a noop handler? will we be missing some functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A noop handler is just the implementation for leader election that does nothing (ie what we have today, where we only support a single replica). Is there anything I can do to help make that clearer?

if podNamespace := os.Getenv("POD_NAMESPACE"); podNamespace != "" {
return podNamespace
}
return "gloo-system"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we already have a const defined for this somewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the challenge I ran into, is that that constant is defined in some projects/gloo/pkg/utils, which causes a circular dependency. I would like in the future (and I tried in my other cleanup PR) to define constants at the top level, without dependencies so they can be imported easily.

config.OnStoppedLeading()
},
OnNewLeader: func(identity string) {
contextutils.LoggerFrom(ctx).Debugf("New Leader Elected with Identity: %s", identity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this print any info that helps identify which pod is the leader? looks like identity only contains a bool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the base implementation, I tried to keep it as lightweight as possible. I could certainly, add information about the identity of the leader in here, or we could configure Gloo elections to define their "OnNewLeader" callback to print this information. What do you think?

@@ -129,6 +129,9 @@ settings:
replaceInvalidRoutes: true
gateway:
persistProxySpec: true
gloo:
deployment:
replicas: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will you be adding some e2e testing around making sure that leader election works with the 2 replicas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah first I want to make sure that existing regression tests just all work with multiple replicas.

Ideally I'd like to come up with a more explicit strategy to test that leader election is working, but I'm still thinking about that.

@solo-changelog-bot
Copy link

Issues linked to changelog:
#5795

@sam-heilbron sam-heilbron changed the title [DNM] Introduce Leader Election for Gloo Introduce Leader Election for Gloo Aug 11, 2022
Comment on lines +105 to +106
if s.identity.IsLeader() {
// Only leaders will write reports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would think only leaders would do translation etc too? otherwise we may use a bunch of CPU / RAM for no reason? also a bunch of checks like this all over the codebase will be very very hard to maintain rather than a single check at the top level that disables everything else until elected as leader

should we even be serving validation requests either? do we really want to have watches duplicated everywhere (can we stress out or require k8s or dependencies to scale?) simultaneously? i thought most of the value of having the follower ready was to ensure that the pod was ready / was already scheduled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open to more discussion here; there are some valid points against but i think it will be easier and perhaps better to just short circuit everything at the top level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being able to scale validation seems pretty nice but the point about watches is a good one it might even be behavior one could opt into. That being said if the watches arent in place is the pod really ready to go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a fair point; probably good to have the watches ready and warmed before being marked as ready and thus available for failover. i may just be paranoid; depends on how much strain gloo alone puts on the k8s apiserver itself and thus scaling it could become an issue. might be nice to get input from the field here to determine if this is warranted / should be configurable behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everyone should be doing translation and validation. Envoy/k8s will reach out to the service and the request will be Load Balanced, so all must be ready to serve the latest.

Copy link
Contributor

@kdorosh kdorosh Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we solving for scaling gloo, or just HA? I agree, if we want to scale gloo and load balance validation / xds requests we need to run translation (really everything but status reporting) on every pod (or have followers to talk leader to get state, similar to how zookeeper HA works). that said, i'm also nervous about scaling gloo given how many requests it can make, particularly in the consul case. I don't think having 3-5 gloo pods will be all that aggressive on the k8s apiserver because of the natural watch api, but for backends like consul we implement watches by querying for every single endpoint individually this puts a lot of strain on the consul server, and scaling gloo to multiple replicas could cause problems there when we may just want to solve for HA / ensuring pods can be scheduled.

really feels like more of a product requirement decision here than an engineering one to me. is the customer concerned about just scheduling pods? or about scaling gloo itself (is CPU / RAM pegged?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also i don't know enough about the leader election libraries used here, but have we reasoned about what happens in error scenarios? usually in failure there can be (briefly) two or zero leaders..

Copy link
Contributor Author

@sam-heilbron sam-heilbron Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible for there to be 2 (fencing not implemented by library) or 0 (leader dies and there is a delay before a new elector is chosen). In both cases, the status reporting is affected, but will not cause issues pushing configuration to Envoy.
given that, I think we're ok in both scenarios

@kdorosh
Copy link
Contributor

kdorosh commented Aug 15, 2022

before we merge i'd recommend reading https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/ and following best practices here, in particular:

  • add metrics for leaders
  • reason about lease expiration, especially in the case of garbage collection (which is a real concern for us as a cpu intensive application)

@sam-heilbron
Copy link
Contributor Author

before we merge i'd recommend reading https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/ and following best practices here, in particular:

  • add metrics for leaders
  • reason about lease expiration, especially in the case of garbage collection (which is a real concern for us as a cpu intensive application)

Good call. I had looked too quickly and seen that the library exposes metrics, but the default is a noop. I'll update to include metrics.

Do you think lease expiration should just be configurable given the differences based on users environments?

@kdorosh
Copy link
Contributor

kdorosh commented Aug 15, 2022

I think the defaults are sane; i don't think we will hit multi second issues unless we have deadlock, kernel errors, almost total network failure.. the kinds of things that should result in a new leader regardless

@nfuden
Copy link
Contributor

nfuden commented Aug 15, 2022

I am content with current state.

@soloio-bulldozer soloio-bulldozer bot merged commit dd3ac2d into master Aug 15, 2022
@soloio-bulldozer soloio-bulldozer bot deleted the ha/part-3 branch August 15, 2022 19:57
nfuden pushed a commit that referenced this pull request Aug 15, 2022
* add leaderelector module

* use single replica as placeholder

* delete unused

* kill gloo on lost leadership

* add support in e2e testss

* include metrics, skip statuses

* set the leader election factory based on the settings

* fix bad test

* add kube lease rbac

* update rbac and tests

* fix rbac"
"

* generated code

* set 2 replicas in proper place

* skip config map with leader election annotation

* move go.mod

* add changelog

* codegen

* bad comment

* udate consistent state check to allow for more jitter, which occurs during startup

* Adding changelog file to new location

* Deleting changelog file from old location

* make metrics check more consistent

* move leader election to setup so it runs once

* preoprly quit on lost leadership, update comment to reflect

* go mod tidy

* fix kube2e setupfunc

* bump timeout for proxy creation

* print debug logs on gateway kube2e test failure

* if leader election could not be started, error

* different prefail handler

* realseoNcancel

* add informative description to expect

* remove ReleaseOnCancel

* fix timeout

* add metrics provider

* change metric name

* go mod tidy

* codegen

* increase status timeout

Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: changelog-bot <changelog-bot>
nfuden added a commit that referenced this pull request Aug 16, 2022
* Introduce Leader Election for Gloo  (#6926)

* add leaderelector module

* use single replica as placeholder

* delete unused

* kill gloo on lost leadership

* add support in e2e testss

* include metrics, skip statuses

* set the leader election factory based on the settings

* fix bad test

* add kube lease rbac

* update rbac and tests

* fix rbac"
"

* generated code

* set 2 replicas in proper place

* skip config map with leader election annotation

* move go.mod

* add changelog

* codegen

* bad comment

* udate consistent state check to allow for more jitter, which occurs during startup

* Adding changelog file to new location

* Deleting changelog file from old location

* make metrics check more consistent

* move leader election to setup so it runs once

* preoprly quit on lost leadership, update comment to reflect

* go mod tidy

* fix kube2e setupfunc

* bump timeout for proxy creation

* print debug logs on gateway kube2e test failure

* if leader election could not be started, error

* different prefail handler

* realseoNcancel

* add informative description to expect

* remove ReleaseOnCancel

* fix timeout

* add metrics provider

* change metric name

* go mod tidy

* codegen

* increase status timeout

Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: changelog-bot <changelog-bot>

* fix(tls_inspector): added AggregateListener support

"envoy.filters.listener.tls_inspector" is an essential listener_filter when attempting to use TLS.  For _most_ gateways, we would previously always consider logic on whether or not to add it based on the presence of a user-provided SslConfig field.  For the new _AggregateListener_, we neglected to add support, originally.  This caused TLS to be unable to function when using the "isolateVirtualHostsBySslConfig" feature flag.

related #6677

* changelog: condense

* leader: Attempt to see if leader lease time is the breaking part

* Revert "fix(tls_inspector): added AggregateListener support"

This reverts commit 278187d.

* Revert "Revert "fix(tls_inspector): added AggregateListener support""

This reverts commit 4d38e2b.

Co-authored-by: Sam Heilbron <[email protected]>
Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com>
Co-authored-by: Gunnar <[email protected]>
@@ -0,0 +1,37 @@
package kube
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sam-heilbron Do you remember why we manually vendored this file from the c-r or client-go repository?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep pr updated signals bulldozer to keep pr up to date with base branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants