fix: requests are sent to all pods even if cc=1 and the parity of activatorCount and podTracker is different #14022

Gekko0114 · 2023-05-27T14:10:01Z

Issue

When cc=1 and the parity of activatorCount and podTracker is different like the conditions below, requests are sent to some pods, but not all.

2 activators
no istio sidecar injection
5 replica servers
CC: 1
TBC : -1

Solution

This issue occurs due to an incorrect calculation of targetCapacity.
Example:
If podTracker=3, assignedTracker=2 and activatorCount=2, the expected targetCapacity should be 2. However, in the current implementation targetCapacity is calculated as 1.

I've fixed this issue by following two steps.

step 1. Change `assignSlice` logic

Actual:
If the number of pods is not divisible by the number of activators, any remaining pods are shared across all activators.

Activator1: pod1, pod2, pod5
Activator2: pod3, pod4, pod5

Expected:
Instead of this implementation, we can allocate one pod to each activator exclusively.

Activator1: pod1, pod2, pod5
Activator2: pod3, pod4

step 2. Change `calculateCapacity`

By step 1, we can calculate the targetCapacity as CC * the number of assigned pods.

I think this problem is specific to cc=1 case.
If cc is greater than 1, targetCapacity should be larger than assignedPods according to the current implementation.

Release Note

NONE

knative-prow · 2023-05-27T14:10:11Z

Hi @Gekko0114. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Gekko0114 · 2023-05-27T14:12:35Z

Hi @dprotaso,
I've fixed issue #12734. Could you take a look?

codecov · 2023-05-29T14:47:57Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.01 ⚠️

Comparison is base (bf48e64) 86.19% compared to head (1c867e4) 86.18%.

❗ Current head 1c867e4 differs from pull request most recent head 6979f5e. Consider uploading reports for the commit 6979f5e to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14022      +/-   ##
==========================================
- Coverage   86.19%   86.18%   -0.01%     
==========================================
  Files         199      199              
  Lines       14767    14757      -10     
==========================================
- Hits        12728    12718      -10     
  Misses       1735     1735              
  Partials      304      304

Impacted Files	Coverage Δ
pkg/activator/net/throttler.go	`87.74% <100.00%> (-0.27%)`	⬇️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

dprotaso · 2023-05-29T15:15:28Z

If podTracker=3 and activatorCount=2, the expected targetCapacity should be 2. However, in the current implementation targetCapacity is calculated as 1.

Since we have two activators each assuming a targetCapacity of '2' - the total assumed capacity is 4 (2 activators * 2 targetCapacity). This means the two activators will send more requests than what we have capacity for - causing some queuing in our proxy.

It would be great if we could instead figure out how to split the capacity between the number of activators. eg. in this instance for two activators each would have capacity 2 and 3.

Doing this might be complex though - so I think I'd rather answer the question - when each activator has a capacity 1 and the total capacity is 3 (off by one) - how come a pod does not receiving traffic?

Gekko0114 · 2023-05-29T23:39:00Z

@dprotaso
I see.

It would be great if we could instead figure out how to split the capacity between the number of activators. eg. in this instance for two activators each would have capacity 2 and 3.

It shouldn't be difficult to adjust the capacities of the two activators to 2 and 3, respectively.

In the current implementation, if the number of pods is not divisible by the number of activators, any remaining pods are shared across all activators.

Activator1: pod1, pod2, pod5
Activator2: pod3, pod4, pod5

Instead of this implementation, we can allocate one pod to each activator exclusively.

Activator1: pod1, pod2, pod5
Activator2: pod3, pod4

By doing so, we can calculate the targetCapacity as CC * the number of assigned pods.

What do you think?

dprotaso · 2023-05-30T02:05:59Z

In the current implementation,

I guess to clarify - you mean the current implementation in this PR?

Instead of this implementation, we can allocate one pod to each activator exclusively.

Yeah but be aware - I don't know the complexity of doing this ;)

Gekko0114 · 2023-05-30T02:13:20Z

Current implementation means code on main branch.
Thanks, I will implement as I wrote and ask feedback from others

Gekko0114 · 2023-05-31T14:53:36Z

I've updated PR.
In this PR, targetCapacity is calculated by using assignedTracker instead of podTracker.
By doing this, we can assign precise targetCapacity for each activator.
Please take a look.

Thanks,

skonto · 2023-05-31T19:58:03Z

If podTracker=3, assignedTracker=2 and activatorCount=2, the expected targetCapacity should be 2. However, in the current implementation targetCapacity is calculated as 1.

@Gekko0114 hi, could you elaborate on which part causes the problem with the pod not receiving traffic and why targetCapacity is 1?

Is this because of:

targetCapacity = minOneOrValue(targetCapacity / minOneOrValue(activatorCount))

where we have 3/2 = 1 and then one of the pod is not considered due to that capacity? So using numTrackers provides the real capacity cc*numeTrackers=2. Is that the idea of the fix?

Here is my understanding. Capacity of one means we have maximum of 1 in-flight requests at the breaker side and given the firstAvailableLBPolicy each request is served by the first available pod. Probably everything happens too quickly and the second pod never gets a request when the new in-flight request is being served.
So with the fix the invariant is that we will never have less capacity than assigned_pods*CC, correct?

Gekko0114 · 2023-06-01T13:51:28Z

Hi @skonto,

Thanks for your comment!

Is this because of:
targetCapacity = minOneOrValue(targetCapacity / minOneOrValue(activatorCount))
where we have 3/2 = 1 and then one of the pod is not considered due to that capacity?

Yes. The problem arises when targetCapacity is not divisible by activatorCount, resulting in a value smaller than the actual number of assigned pods.
Initially, I considered using the rounded-up value when it's not divisible, but I received feedback from @dprotaso that it's not a good approach. Therefore, I decided to use the actual number of assigned pods as the targetCapacity.
Actually I think this approach is better than the existing code in the main branch since assigned_pods means each activator's capacity.

So with the fix the invariant is that we will never have less capacity than assigned_pods*CC, correct?

Yes, by setting targetCapacity as assigned_pod * CC, there should be no situation where some of pods don't receive requests or get too much requests.

…ivatorCount and podTrackerCount are different

dprotaso

just a few questions so I understand the implications

dprotaso · 2023-06-07T21:14:33Z

pkg/activator/net/throttler.go

- // We shuffle the tail, to ensure that pods in the tail get better
- // load distribution, since we sort the pods above, this puts more requests
- // on the very first tail pod, than on the others.
- rand.Shuffle(remnants, func(i, j int) {
- tail[i], tail[j] = tail[j], tail[i]
- })
- // assignSlice is not called for CC=0 case.
- dcc := int(math.Ceil(float64(cc) / float64(numActivators)))
- // This is basically: x = append(x, trackers[len(trackers)-remnants:]...)
- // But we need to update the capacity.
- for _, t := range tail {
- t.UpdateConcurrency(dcc)


Can you comment on the deletions here (in GitHub) and document the code a bit.

I need to validate my understanding. The way I read the change is we are no longer distributing tail elements across all activators and instead having activators pick up an additional element

eg.

we have 20 pods and 3 activators -> we'd get 2 remnants so activator with index 0,1 would each pick up a unique tracker

we have 24 pods and 5 activators -> we'd get 4 remnants so the activator 0,1,2,3 would each pick up a unique tracker

What's the reason for dropping the t.UpdateConcurrency(dcc) call? I'm a bit lost why the code prior called UpdateConcurrency on the tail trackers but not the picked indicies?

What's the reason for dropping the t.UpdateConcurrency(dcc)

So the test change where the total capacity changed from 15 to 20 highlighted that even though the tracker was shared amongst the multiple activators before - the container concurrency is reduced for each tracker.

I'm surprised this triggered LB issues because that remnant pod tracker should still have a cc=1.

I wonder if there's nothing wrong with our capacity calculation but actually the LB implementation - for cc=1 it uses firstAvailableLBPolicy

Thanks for your comments!

Can you comment on the deletions here (in GitHub) and document the code a bit.

Sure, I will add comments on the code.

I need to validate my understanding. The way I read the change is we are no longer distributing tail elements across all activators and instead having activators pick up an additional element

Yes, I think your understanding is same as mine.

What's the reason for dropping the t.UpdateConcurrency(dcc) call? I'm a bit lost why the code prior called UpdateConcurrency on the tail trackers but not the picked indicies?

I think the reason why the code prior called UpdateConcurrency was to ensure that the total capacity of each tail tracker across all activators matched the desired cc.
However, in this PR, each remnant tracker is now assigned to a unique activator. As a result there is no need to call UpdateCapacity anymore.

I'm surprised this triggered LB issues because that remnant pod tracker should still have a cc=1.

I wonder if there's nothing wrong with our capacity calculation but actually the LB implementation - for cc=1 it uses firstAvailableLBPolicy

While the call of UpdateConcurrency here may not function as intended, this issue's root cause is not here. Instead, activator itself's capacity calculation is the cause.

If I have misunderstood anything or if you have further questions, please let me know.

If I have misunderstood anything or if you have further questions, please let me know.

Bear with me - I'm also learning the code so I'm walking through it so I understand the implication of your changes

If podTracker=3, assignedTracker=2 and activatorCount=2, the expected targetCapacity should be 2. However, in the current implementation targetCapacity is calculated as 1.

When CC=1 the activator uses an infinite breaker which always reports a capacity of 1 when you update the breaker's capacity >= 1. If the capacity is 1 then it will let in as many requests as possible

That's why I was wondering if we're still including excess destinations to all activators (and reducing the container concurrency) it seems like there's an issue with the load balancing.

With the old code I added a test to see the output in your original situation (cc=1, pods=5, activators=2)

func TestBlah(t *testing.T) { logger := TestLogger(t) cc := 1 numActivators := 2 for i := 0; i < numActivators; i++ { rt := &revisionThrottler{ logger: logger, breaker: queue.NewBreaker(testBreakerParams), containerConcurrency: cc, } rt.numActivators.Store(int32(numActivators)) rt.activatorIndex.Store(int32(i)) rt.podTrackers = makeTrackers(5, 1) rt.breaker = newInfiniteBreaker(logger) rt.updateCapacity(5) got := rt.breaker.Capacity() t.Logf("Activator breaker capacity = %d", got) for _, tracker := range rt.assignedTrackers { t.Logf("dest: %s, Capacity = %d", tracker.dest, tracker.Capacity()) } } }

This gives the output

=== RUN TestBlah logger.go:130: 2023-06-09T11:28:39.139-0400 DEBUG net/throttler.go:300 Trackers 0/2: assignment: [0 1 4] logger.go:130: 2023-06-09T11:28:39.139-0400 INFO net/throttler.go:318 Set capacity to 2 (backends: 5, index: 0/2) throttler_test.go:245: Activator breaker capacity = 1 throttler_test.go:248: dest: 0, Capacity = 1 throttler_test.go:248: dest: 1, Capacity = 1 throttler_test.go:248: dest: 4, Capacity = 1 logger.go:130: 2023-06-09T11:28:39.140-0400 DEBUG net/throttler.go:300 Trackers 1/2: assignment: [2 3 4] logger.go:130: 2023-06-09T11:28:39.140-0400 INFO net/throttler.go:318 Set capacity to 2 (backends: 5, index: 1/2) throttler_test.go:245: Activator breaker capacity = 1 throttler_test.go:248: dest: 2, Capacity = 1 throttler_test.go:248: dest: 3, Capacity = 1 throttler_test.go:248: dest: 4, Capacity = 1 --- PASS: TestBlah (0.00s)

So in theory dest 4 should be receiving requests from both activators which would result in some requests being queued at the queue proxy.

Given this I wondered why are we not receiving traffic and I think it's because of the load balancing policy which just iterates over the assigned trackers and picks the first one that has capacity.

serving/pkg/activator/net/throttler.go

Lines 182 to 185 in 5a90438

case containerConcurrency <= 3:

// For very low CC values use first available pod.

revBreaker = queue.NewBreaker(breakerParams)

lbp = firstAvailableLBPolicy

serving/pkg/activator/net/throttler.go

Line 213 in 5a90438

return rt.lbPolicy(ctx, rt.assignedTrackers)

Thus I don't think your change this will solve your original problem. I think if you add artificial waits to your services handling requests you would eventually see all pods getting traffic. Or set your container concurrency to 4 which changes the load balancing to round robin.

So I think there's a separate fix to the original issue and we need to think about whether changing the assignment still makes sense (I think it does - just need to think about the implications a bit)

Thank you for your comment! I got what you meant.
Your opinion is different from my understanding. I will explain it.

When CC=1 the activator uses an infinite breaker which always reports a capacity of 1 when you update the breaker's capacity >= 1. If the capacity is 1 then it will let in as many requests as possible

This is not correct.
When CC >= 1, the infinite breaker is not used. Instead, queue.NewBreaker is called.
https://github.com/knative/serving/blob/main/pkg/activator/net/throttler.go#L179-L190

Let's consider the scenario where there are multiple activators and CC=1.
In the case of CC=1, the capacity of each pod becomes 1.
There is no mechanism for activators to share which pods they are sending requests to.
If both activators are assigned to all pods, there is a possibility that the total requests from all activators will exceed CC. Therefore, each activator is assigned to unique pods.

@dprotaso

Thus I don't think your change this will solve your original problem. I think if you add artificial waits to your services handling requests you would eventually see all pods getting traffic. Or set your container concurrency to 4 which changes the load balancing to round robin.

So I think there's a separate fix to the original issue and we need to think about whether changing the assignment still makes sense (I think it does - just need to think about the implications a bit)

As you pointed out, the issue can also be resolved by changing the load balancing mechanism to round-robin.
However, in my opinion, it would be better to modify the capacity of activator itself. The problem of this issue is not load balancing logic, but capacity of activators.

In the case of cc=1, pods=5, and activators=2:
Each activator has a capacity of 2.
The assignment of activators is as follows: activator1: pod0, pod1, pod4; activator2: pod2, pod3, pod4.

When the hey command is executed:
hey -t 0 -n 10000 -c 100 <address>
The activator is designed not to process requests beyond its capacity.
If a request is sent to activator1, it first tries to send the request to pod0. If pod0 has available capacity, the request is sent there. If pod0 doesn't have available capacity, the request is sent to pod1.
So what happens if neither pod0 nor pod1 can receive the request?
In this case, the activator has reached its capacity limit of 2, so it won't attempt to send the request to the pod in the first place.
When there is available capacity in pod0 or pod1 again, it will attempt to send the request to the pod. That's why pod4 wouldn't get any traffic from activator1 and activator2 (you can observe it by reproducing the issue).

Changing to round-robin will likely solve this bug as well. However, even with round-robin, the issue of activator capacity remains unresolved. Therefore, I believe it would be better to fix the activator's capacity itself, which is the modification in my PR.
Also, if we change to round-robin even when cc <= 3, there is no mechanism for multiple activators to share which pods they are sending requests to, so it will result in sending requests exceeding the capacity to pod4.

Sorry for long explanation.
If I have misunderstood anything or if you have further questions, please let me know.

pkg/activator/net/throttler.go

dprotaso · 2023-06-07T21:30:47Z

pkg/activator/net/throttler.go

- // We shuffle the tail, to ensure that pods in the tail get better
- // load distribution, since we sort the pods above, this puts more requests
- // on the very first tail pod, than on the others.
- rand.Shuffle(remnants, func(i, j int) {
- tail[i], tail[j] = tail[j], tail[i]
- })
- // assignSlice is not called for CC=0 case.
- dcc := int(math.Ceil(float64(cc) / float64(numActivators)))
- // This is basically: x = append(x, trackers[len(trackers)-remnants:]...)
- // But we need to update the capacity.
- for _, t := range tail {
- t.UpdateConcurrency(dcc)


What's the reason for dropping the t.UpdateConcurrency(dcc)

So the test change where the total capacity changed from 15 to 20 highlighted that even though the tracker was shared amongst the multiple activators before - the container concurrency is reduced for each tracker.

I'm surprised this triggered LB issues because that remnant pod tracker should still have a cc=1.

I wonder if there's nothing wrong with our capacity calculation but actually the LB implementation - for cc=1 it uses firstAvailableLBPolicy

pkg/activator/net/throttler_test.go

Gekko0114 · 2023-06-08T14:55:49Z

@dprotaso ,

Thanks for your comments! I've updated PR and replied your comments.

dprotaso · 2023-06-16T14:24:33Z

This is not correct. When CC >= 1, the infinite breaker is not used. Instead, queue.NewBreaker is called.

Good catch - I had a bad assumption.

@Gekko0114 thanks for all the help and clarification here.

/lgtm
/approve

knative-prow · 2023-06-16T14:24:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, Gekko0114

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dprotaso]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Gekko0114 · 2023-06-16T22:11:12Z

Thank you so much for your review.
I learned a lot from you!

…ivatorCount and podTracker is different (knative#14022) * fix: requests are sent to all pods even if cc=1 and the parity of activatorCount and podTrackerCount are different * each activator has correct capacity * rebase * update_docs

- Backport activator fixes from knative#14303 and knative#14347 from 1.12 - Add custom patches for logs and probe durations - Update to go 1.20 - Add patch from knative#14022 - Add custom CI workflows

knative-prow bot requested review from kauana and krsna-m May 27, 2023 14:10

knative-prow bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 27, 2023

knative-prow bot added area/autoscale area/networking labels May 27, 2023

nak3 added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 29, 2023

Gekko0114 force-pushed the serving branch 2 times, most recently from e07af1a to 934751f Compare May 29, 2023 14:11

Gekko0114 force-pushed the serving branch from 934751f to de291d4 Compare May 31, 2023 14:01

knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 31, 2023

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 5, 2023

Gekko0114 added 2 commits June 6, 2023 09:12

fix: requests are sent to all pods even if cc=1 and the parity of act…

9593b07

…ivatorCount and podTrackerCount are different

each activator has correct capacity

085fc6d

Gekko0114 force-pushed the serving branch from de291d4 to e0036c5 Compare June 6, 2023 00:22

knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 6, 2023

rebase

1c867e4

Gekko0114 force-pushed the serving branch from e0036c5 to 1c867e4 Compare June 6, 2023 01:41

dprotaso reviewed Jun 7, 2023

View reviewed changes

pkg/activator/net/throttler.go Show resolved Hide resolved

dprotaso reviewed Jun 7, 2023

View reviewed changes

update_docs

6979f5e

knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 8, 2023

knative-prow bot assigned dprotaso Jun 16, 2023

knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2023

knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2023

knative-prow bot merged commit f085b30 into knative:main Jun 16, 2023
61 checks passed

Gekko0114 deleted the serving branch June 17, 2023 00:06

dprotaso mentioned this pull request Aug 2, 2023

Activator can get stuck into poor load balancing for a revision #14199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: requests are sent to all pods even if cc=1 and the parity of activatorCount and podTracker is different #14022

fix: requests are sent to all pods even if cc=1 and the parity of activatorCount and podTracker is different #14022

Gekko0114 commented May 27, 2023 •

edited

Loading

knative-prow bot commented May 27, 2023

Gekko0114 commented May 27, 2023

codecov bot commented May 29, 2023 •

edited

Loading

dprotaso commented May 29, 2023 •

edited

Loading

Gekko0114 commented May 29, 2023

dprotaso commented May 30, 2023

Gekko0114 commented May 30, 2023

Gekko0114 commented May 31, 2023

skonto commented May 31, 2023 •

edited

Loading

Gekko0114 commented Jun 1, 2023 •

edited

Loading

dprotaso left a comment •

edited

Loading

dprotaso Jun 7, 2023

dprotaso Jun 7, 2023

Gekko0114 Jun 8, 2023 •

edited

Loading

dprotaso Jun 9, 2023

Gekko0114 Jun 10, 2023 •

edited

Loading

Gekko0114 Jun 10, 2023

dprotaso Jun 7, 2023

Gekko0114 commented Jun 8, 2023

dprotaso commented Jun 16, 2023

knative-prow bot commented Jun 16, 2023

Gekko0114 commented Jun 16, 2023

	case containerConcurrency <= 3:
	// For very low CC values use first available pod.
	revBreaker = queue.NewBreaker(breakerParams)
	lbp = firstAvailableLBPolicy

fix: requests are sent to all pods even if cc=1 and the parity of activatorCount and podTracker is different #14022

fix: requests are sent to all pods even if cc=1 and the parity of activatorCount and podTracker is different #14022

Conversation

Gekko0114 commented May 27, 2023 • edited Loading

Issue

Solution

step 1. Change assignSlice logic

step 2. Change calculateCapacity

knative-prow bot commented May 27, 2023

Gekko0114 commented May 27, 2023

codecov bot commented May 29, 2023 • edited Loading

Codecov Report

dprotaso commented May 29, 2023 • edited Loading

Gekko0114 commented May 29, 2023

dprotaso commented May 30, 2023

Gekko0114 commented May 30, 2023

Gekko0114 commented May 31, 2023

skonto commented May 31, 2023 • edited Loading

Gekko0114 commented Jun 1, 2023 • edited Loading

dprotaso left a comment • edited Loading

Choose a reason for hiding this comment

dprotaso Jun 7, 2023

Choose a reason for hiding this comment

dprotaso Jun 7, 2023

Choose a reason for hiding this comment

Gekko0114 Jun 8, 2023 • edited Loading

Choose a reason for hiding this comment

dprotaso Jun 9, 2023

Choose a reason for hiding this comment

Gekko0114 Jun 10, 2023 • edited Loading

Choose a reason for hiding this comment

Gekko0114 Jun 10, 2023

Choose a reason for hiding this comment

dprotaso Jun 7, 2023

Choose a reason for hiding this comment

Gekko0114 commented Jun 8, 2023

dprotaso commented Jun 16, 2023

knative-prow bot commented Jun 16, 2023

Gekko0114 commented Jun 16, 2023

Gekko0114 commented May 27, 2023 •

edited

Loading

step 1. Change `assignSlice` logic

step 2. Change `calculateCapacity`

codecov bot commented May 29, 2023 •

edited

Loading

dprotaso commented May 29, 2023 •

edited

Loading

skonto commented May 31, 2023 •

edited

Loading

Gekko0114 commented Jun 1, 2023 •

edited

Loading

dprotaso left a comment •

edited

Loading

Gekko0114 Jun 8, 2023 •

edited

Loading

Gekko0114 Jun 10, 2023 •

edited

Loading