-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: requests are sent to all pods even if cc=1 and the parity of activatorCount and podTracker is different #14022
Conversation
Hi @Gekko0114. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
e07af1a
to
934751f
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #14022 +/- ##
==========================================
- Coverage 86.19% 86.18% -0.01%
==========================================
Files 199 199
Lines 14767 14757 -10
==========================================
- Hits 12728 12718 -10
Misses 1735 1735
Partials 304 304
☔ View full report in Codecov by Sentry. |
Since we have two activators each assuming a targetCapacity of '2' - the total assumed capacity is 4 (2 activators * 2 targetCapacity). This means the two activators will send more requests than what we have capacity for - causing some queuing in our proxy. It would be great if we could instead figure out how to split the capacity between the number of activators. eg. in this instance for two activators each would have capacity 2 and 3. Doing this might be complex though - so I think I'd rather answer the question - when each activator has a capacity 1 and the total capacity is 3 (off by one) - how come a pod does not receiving traffic? |
@dprotaso
It shouldn't be difficult to adjust the capacities of the two activators to 2 and 3, respectively. In the current implementation, if the number of pods is not divisible by the number of activators, any remaining pods are shared across all activators.
Instead of this implementation, we can allocate one pod to each activator exclusively.
By doing so, we can calculate the targetCapacity as CC * the number of assigned pods. What do you think? |
I guess to clarify - you mean the current implementation in this PR?
Yeah but be aware - I don't know the complexity of doing this ;) |
Current implementation means code on main branch. |
I've updated PR. Thanks, |
@Gekko0114 hi, could you elaborate on which part causes the problem with the pod not receiving traffic and why targetCapacity is 1? Is this because of:
where we have 3/2 = 1 and then one of the pod is not considered due to that capacity? So using numTrackers provides the real capacity cc*numeTrackers=2. Is that the idea of the fix? Here is my understanding. Capacity of one means we have maximum of 1 in-flight requests at the breaker side and given the firstAvailableLBPolicy each request is served by the first available pod. Probably everything happens too quickly and the second pod never gets a request when the new in-flight request is being served. |
Hi @skonto, Thanks for your comment!
Yes. The problem arises when
Yes, by setting |
…ivatorCount and podTrackerCount are different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few questions so I understand the implications
// We shuffle the tail, to ensure that pods in the tail get better | ||
// load distribution, since we sort the pods above, this puts more requests | ||
// on the very first tail pod, than on the others. | ||
rand.Shuffle(remnants, func(i, j int) { | ||
tail[i], tail[j] = tail[j], tail[i] | ||
}) | ||
// assignSlice is not called for CC=0 case. | ||
dcc := int(math.Ceil(float64(cc) / float64(numActivators))) | ||
// This is basically: x = append(x, trackers[len(trackers)-remnants:]...) | ||
// But we need to update the capacity. | ||
for _, t := range tail { | ||
t.UpdateConcurrency(dcc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on the deletions here (in GitHub) and document the code a bit.
I need to validate my understanding. The way I read the change is we are no longer distributing tail elements across all activators and instead having activators pick up an additional element
eg.
- we have 20 pods and 3 activators -> we'd get 2 remnants so activator with index 0,1 would each pick up a unique tracker
- we have 24 pods and 5 activators -> we'd get 4 remnants so the activator 0,1,2,3 would each pick up a unique tracker
What's the reason for dropping the t.UpdateConcurrency(dcc)
call? I'm a bit lost why the code prior called UpdateConcurrency
on the tail trackers but not the picked indicies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for dropping the t.UpdateConcurrency(dcc)
So the test change where the total capacity changed from 15 to 20 highlighted that even though the tracker was shared amongst the multiple activators before - the container concurrency is reduced for each tracker.
I'm surprised this triggered LB issues because that remnant pod tracker should still have a cc=1.
I wonder if there's nothing wrong with our capacity calculation but actually the LB implementation - for cc=1 it uses firstAvailableLBPolicy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments!
Can you comment on the deletions here (in GitHub) and document the code a bit.
Sure, I will add comments on the code.
I need to validate my understanding. The way I read the change is we are no longer distributing tail elements across all activators and instead having activators pick up an additional element
Yes, I think your understanding is same as mine.
What's the reason for dropping the t.UpdateConcurrency(dcc) call? I'm a bit lost why the code prior called UpdateConcurrency on the tail trackers but not the picked indicies?
I think the reason why the code prior called UpdateConcurrency was to ensure that the total capacity of each tail tracker across all activators matched the desired cc.
However, in this PR, each remnant tracker is now assigned to a unique activator. As a result there is no need to call UpdateCapacity anymore.
I'm surprised this triggered LB issues because that remnant pod tracker should still have a cc=1.
I wonder if there's nothing wrong with our capacity calculation but actually the LB implementation - for cc=1 it uses firstAvailableLBPolicy
While the call of UpdateConcurrency here may not function as intended, this issue's root cause is not here. Instead, activator itself's capacity calculation is the cause.
If I have misunderstood anything or if you have further questions, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have misunderstood anything or if you have further questions, please let me know.
Bear with me - I'm also learning the code so I'm walking through it so I understand the implication of your changes
If podTracker=3, assignedTracker=2 and activatorCount=2, the expected targetCapacity should be 2. However, in the current implementation targetCapacity is calculated as 1.
When CC=1 the activator uses an infinite breaker which always reports a capacity of 1 when you update the breaker's capacity >= 1. If the capacity is 1 then it will let in as many requests as possible
That's why I was wondering if we're still including excess destinations to all activators (and reducing the container concurrency) it seems like there's an issue with the load balancing.
With the old code I added a test to see the output in your original situation (cc=1, pods=5, activators=2)
func TestBlah(t *testing.T) {
logger := TestLogger(t)
cc := 1
numActivators := 2
for i := 0; i < numActivators; i++ {
rt := &revisionThrottler{
logger: logger,
breaker: queue.NewBreaker(testBreakerParams),
containerConcurrency: cc,
}
rt.numActivators.Store(int32(numActivators))
rt.activatorIndex.Store(int32(i))
rt.podTrackers = makeTrackers(5, 1)
rt.breaker = newInfiniteBreaker(logger)
rt.updateCapacity(5)
got := rt.breaker.Capacity()
t.Logf("Activator breaker capacity = %d", got)
for _, tracker := range rt.assignedTrackers {
t.Logf("dest: %s, Capacity = %d", tracker.dest, tracker.Capacity())
}
}
}
This gives the output
=== RUN TestBlah
logger.go:130: 2023-06-09T11:28:39.139-0400 DEBUG net/throttler.go:300 Trackers 0/2: assignment: [0 1 4]
logger.go:130: 2023-06-09T11:28:39.139-0400 INFO net/throttler.go:318 Set capacity to 2 (backends: 5, index: 0/2)
throttler_test.go:245: Activator breaker capacity = 1
throttler_test.go:248: dest: 0, Capacity = 1
throttler_test.go:248: dest: 1, Capacity = 1
throttler_test.go:248: dest: 4, Capacity = 1
logger.go:130: 2023-06-09T11:28:39.140-0400 DEBUG net/throttler.go:300 Trackers 1/2: assignment: [2 3 4]
logger.go:130: 2023-06-09T11:28:39.140-0400 INFO net/throttler.go:318 Set capacity to 2 (backends: 5, index: 1/2)
throttler_test.go:245: Activator breaker capacity = 1
throttler_test.go:248: dest: 2, Capacity = 1
throttler_test.go:248: dest: 3, Capacity = 1
throttler_test.go:248: dest: 4, Capacity = 1
--- PASS: TestBlah (0.00s)
So in theory dest 4 should be receiving requests from both activators which would result in some requests being queued at the queue proxy.
Given this I wondered why are we not receiving traffic and I think it's because of the load balancing policy which just iterates over the assigned trackers and picks the first one that has capacity.
serving/pkg/activator/net/throttler.go
Lines 182 to 185 in 5a90438
case containerConcurrency <= 3: | |
// For very low CC values use first available pod. | |
revBreaker = queue.NewBreaker(breakerParams) | |
lbp = firstAvailableLBPolicy |
serving/pkg/activator/net/throttler.go
Line 213 in 5a90438
return rt.lbPolicy(ctx, rt.assignedTrackers) |
Thus I don't think your change this will solve your original problem. I think if you add artificial waits to your services handling requests you would eventually see all pods getting traffic. Or set your container concurrency to 4 which changes the load balancing to round robin.
So I think there's a separate fix to the original issue and we need to think about whether changing the assignment still makes sense (I think it does - just need to think about the implications a bit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment! I got what you meant.
Your opinion is different from my understanding. I will explain it.
When CC=1 the activator uses an infinite breaker which always reports a capacity of 1 when you update the breaker's capacity >= 1. If the capacity is 1 then it will let in as many requests as possible
This is not correct.
When CC >= 1, the infinite breaker is not used. Instead, queue.NewBreaker is called.
https://github.com/knative/serving/blob/main/pkg/activator/net/throttler.go#L179-L190
Let's consider the scenario where there are multiple activators and CC=1.
In the case of CC=1, the capacity of each pod becomes 1.
There is no mechanism for activators to share which pods they are sending requests to.
If both activators are assigned to all pods, there is a possibility that the total requests from all activators will exceed CC. Therefore, each activator is assigned to unique pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus I don't think your change this will solve your original problem. I think if you add artificial waits to your services handling requests you would eventually see all pods getting traffic. Or set your container concurrency to 4 which changes the load balancing to round robin.
So I think there's a separate fix to the original issue and we need to think about whether changing the assignment still makes sense (I think it does - just need to think about the implications a bit)
As you pointed out, the issue can also be resolved by changing the load balancing mechanism to round-robin.
However, in my opinion, it would be better to modify the capacity of activator itself. The problem of this issue is not load balancing logic, but capacity of activators.
In the case of cc=1, pods=5, and activators=2:
Each activator has a capacity of 2.
The assignment of activators is as follows: activator1: pod0, pod1, pod4; activator2: pod2, pod3, pod4.
When the hey command is executed:
hey -t 0 -n 10000 -c 100 <address>
The activator is designed not to process requests beyond its capacity.
If a request is sent to activator1, it first tries to send the request to pod0. If pod0 has available capacity, the request is sent there. If pod0 doesn't have available capacity, the request is sent to pod1.
So what happens if neither pod0 nor pod1 can receive the request?
In this case, the activator has reached its capacity limit of 2, so it won't attempt to send the request to the pod in the first place.
When there is available capacity in pod0 or pod1 again, it will attempt to send the request to the pod. That's why pod4 wouldn't get any traffic from activator1 and activator2 (you can observe it by reproducing the issue).
Changing to round-robin will likely solve this bug as well. However, even with round-robin, the issue of activator capacity remains unresolved. Therefore, I believe it would be better to fix the activator's capacity itself, which is the modification in my PR.
Also, if we change to round-robin even when cc <= 3, there is no mechanism for multiple activators to share which pods they are sending requests to, so it will result in sending requests exceeding the capacity to pod4.
Sorry for long explanation.
If I have misunderstood anything or if you have further questions, please let me know.
// We shuffle the tail, to ensure that pods in the tail get better | ||
// load distribution, since we sort the pods above, this puts more requests | ||
// on the very first tail pod, than on the others. | ||
rand.Shuffle(remnants, func(i, j int) { | ||
tail[i], tail[j] = tail[j], tail[i] | ||
}) | ||
// assignSlice is not called for CC=0 case. | ||
dcc := int(math.Ceil(float64(cc) / float64(numActivators))) | ||
// This is basically: x = append(x, trackers[len(trackers)-remnants:]...) | ||
// But we need to update the capacity. | ||
for _, t := range tail { | ||
t.UpdateConcurrency(dcc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for dropping the t.UpdateConcurrency(dcc)
So the test change where the total capacity changed from 15 to 20 highlighted that even though the tracker was shared amongst the multiple activators before - the container concurrency is reduced for each tracker.
I'm surprised this triggered LB issues because that remnant pod tracker should still have a cc=1.
I wonder if there's nothing wrong with our capacity calculation but actually the LB implementation - for cc=1 it uses firstAvailableLBPolicy
Thanks for your comments! I've updated PR and replied your comments. |
Good catch - I had a bad assumption. @Gekko0114 thanks for all the help and clarification here. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dprotaso, Gekko0114 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thank you so much for your review. |
…ivatorCount and podTracker is different (knative#14022) * fix: requests are sent to all pods even if cc=1 and the parity of activatorCount and podTrackerCount are different * each activator has correct capacity * rebase * update_docs
…ivatorCount and podTracker is different (knative#14022) * fix: requests are sent to all pods even if cc=1 and the parity of activatorCount and podTrackerCount are different * each activator has correct capacity * rebase * update_docs
- Backport activator fixes from knative#14303 and knative#14347 from 1.12 - Add custom patches for logs and probe durations - Update to go 1.20 - Add patch from knative#14022 - Add custom CI workflows
Fixes #12734
Issue
When cc=1 and the parity of
activatorCount
andpodTracker
is different like the conditions below, requests are sent to some pods, but not all.Solution
This issue occurs due to an incorrect calculation of
targetCapacity
.Example:
If
podTracker
=3,assignedTracker
=2 andactivatorCount
=2, the expectedtargetCapacity
should be 2. However, in the current implementationtargetCapacity
is calculated as 1.I've fixed this issue by following two steps.
step 1. Change
assignSlice
logicActual:
If the number of pods is not divisible by the number of activators, any remaining pods are shared across all activators.
Expected:
Instead of this implementation, we can allocate one pod to each activator exclusively.
step 2. Change
calculateCapacity
By step 1, we can calculate the
targetCapacity
as CC * the number of assigned pods.I think this problem is specific to cc=1 case.
If cc is greater than 1,
targetCapacity
should be larger than assignedPods according to the current implementation.Release Note