Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: requests are sent to all pods even if cc=1 and the parity of activatorCount and podTracker is different #14022

Merged
merged 4 commits into from
Jun 16, 2023

Conversation

Gekko0114
Copy link
Contributor

@Gekko0114 Gekko0114 commented May 27, 2023

Fixes #12734

Issue

When cc=1 and the parity of activatorCount and podTracker is different like the conditions below, requests are sent to some pods, but not all.

  • 2 activators
  • no istio sidecar injection
  • 5 replica servers
  • CC: 1
  • TBC : -1

Solution

This issue occurs due to an incorrect calculation of targetCapacity.
Example:
If podTracker=3, assignedTracker=2 and activatorCount=2, the expected targetCapacity should be 2. However, in the current implementation targetCapacity is calculated as 1.

I've fixed this issue by following two steps.

step 1. Change assignSlice logic

Actual:
If the number of pods is not divisible by the number of activators, any remaining pods are shared across all activators.

Activator1: pod1, pod2, pod5
Activator2: pod3, pod4, pod5

Expected:
Instead of this implementation, we can allocate one pod to each activator exclusively.

Activator1: pod1, pod2, pod5
Activator2: pod3, pod4

step 2. Change calculateCapacity

By step 1, we can calculate the targetCapacity as CC * the number of assigned pods.

I think this problem is specific to cc=1 case.
If cc is greater than 1, targetCapacity should be larger than assignedPods according to the current implementation.

Release Note

NONE

@knative-prow knative-prow bot requested review from kauana and krsna-m May 27, 2023 14:10
@knative-prow knative-prow bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 27, 2023
@knative-prow
Copy link

knative-prow bot commented May 27, 2023

Hi @Gekko0114. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Gekko0114
Copy link
Contributor Author

Hi @dprotaso,
I've fixed issue #12734. Could you take a look?

@nak3 nak3 added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 29, 2023
@Gekko0114 Gekko0114 force-pushed the serving branch 2 times, most recently from e07af1a to 934751f Compare May 29, 2023 14:11
@codecov
Copy link

codecov bot commented May 29, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.01 ⚠️

Comparison is base (bf48e64) 86.19% compared to head (1c867e4) 86.18%.

❗ Current head 1c867e4 differs from pull request most recent head 6979f5e. Consider uploading reports for the commit 6979f5e to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14022      +/-   ##
==========================================
- Coverage   86.19%   86.18%   -0.01%     
==========================================
  Files         199      199              
  Lines       14767    14757      -10     
==========================================
- Hits        12728    12718      -10     
  Misses       1735     1735              
  Partials      304      304              
Impacted Files Coverage Δ
pkg/activator/net/throttler.go 87.74% <100.00%> (-0.27%) ⬇️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@dprotaso
Copy link
Member

dprotaso commented May 29, 2023

If podTracker=3 and activatorCount=2, the expected targetCapacity should be 2. However, in the current implementation targetCapacity is calculated as 1.

Since we have two activators each assuming a targetCapacity of '2' - the total assumed capacity is 4 (2 activators * 2 targetCapacity). This means the two activators will send more requests than what we have capacity for - causing some queuing in our proxy.

It would be great if we could instead figure out how to split the capacity between the number of activators. eg. in this instance for two activators each would have capacity 2 and 3.

Doing this might be complex though - so I think I'd rather answer the question - when each activator has a capacity 1 and the total capacity is 3 (off by one) - how come a pod does not receiving traffic?

@Gekko0114
Copy link
Contributor Author

@dprotaso
I see.

It would be great if we could instead figure out how to split the capacity between the number of activators. eg. in this instance for two activators each would have capacity 2 and 3.

It shouldn't be difficult to adjust the capacities of the two activators to 2 and 3, respectively.

In the current implementation, if the number of pods is not divisible by the number of activators, any remaining pods are shared across all activators.

Activator1: pod1, pod2, pod5
Activator2: pod3, pod4, pod5

Instead of this implementation, we can allocate one pod to each activator exclusively.

Activator1: pod1, pod2, pod5
Activator2: pod3, pod4

By doing so, we can calculate the targetCapacity as CC * the number of assigned pods.

What do you think?

@dprotaso
Copy link
Member

In the current implementation,

I guess to clarify - you mean the current implementation in this PR?

Instead of this implementation, we can allocate one pod to each activator exclusively.

Yeah but be aware - I don't know the complexity of doing this ;)

@Gekko0114
Copy link
Contributor Author

Current implementation means code on main branch.
Thanks, I will implement as I wrote and ask feedback from others

@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 31, 2023
@Gekko0114
Copy link
Contributor Author

I've updated PR.
In this PR, targetCapacity is calculated by using assignedTracker instead of podTracker.
By doing this, we can assign precise targetCapacity for each activator.
Please take a look.

Thanks,

@skonto
Copy link
Contributor

skonto commented May 31, 2023

If podTracker=3, assignedTracker=2 and activatorCount=2, the expected targetCapacity should be 2. However, in the current implementation targetCapacity is calculated as 1.

@Gekko0114 hi, could you elaborate on which part causes the problem with the pod not receiving traffic and why targetCapacity is 1?

Is this because of:

targetCapacity = minOneOrValue(targetCapacity / minOneOrValue(activatorCount))

where we have 3/2 = 1 and then one of the pod is not considered due to that capacity? So using numTrackers provides the real capacity cc*numeTrackers=2. Is that the idea of the fix?

Here is my understanding. Capacity of one means we have maximum of 1 in-flight requests at the breaker side and given the firstAvailableLBPolicy each request is served by the first available pod. Probably everything happens too quickly and the second pod never gets a request when the new in-flight request is being served.
So with the fix the invariant is that we will never have less capacity than assigned_pods*CC, correct?

@Gekko0114
Copy link
Contributor Author

Gekko0114 commented Jun 1, 2023

Hi @skonto,

Thanks for your comment!

Is this because of:
targetCapacity = minOneOrValue(targetCapacity / minOneOrValue(activatorCount))
where we have 3/2 = 1 and then one of the pod is not considered due to that capacity?

Yes. The problem arises when targetCapacity is not divisible by activatorCount, resulting in a value smaller than the actual number of assigned pods.
Initially, I considered using the rounded-up value when it's not divisible, but I received feedback from @dprotaso that it's not a good approach. Therefore, I decided to use the actual number of assigned pods as the targetCapacity.
Actually I think this approach is better than the existing code in the main branch since assigned_pods means each activator's capacity.

So with the fix the invariant is that we will never have less capacity than assigned_pods*CC, correct?

Yes, by setting targetCapacity as assigned_pod * CC, there should be no situation where some of pods don't receive requests or get too much requests.

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 5, 2023
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 6, 2023
Copy link
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few questions so I understand the implications

Comment on lines -389 to -400
// We shuffle the tail, to ensure that pods in the tail get better
// load distribution, since we sort the pods above, this puts more requests
// on the very first tail pod, than on the others.
rand.Shuffle(remnants, func(i, j int) {
tail[i], tail[j] = tail[j], tail[i]
})
// assignSlice is not called for CC=0 case.
dcc := int(math.Ceil(float64(cc) / float64(numActivators)))
// This is basically: x = append(x, trackers[len(trackers)-remnants:]...)
// But we need to update the capacity.
for _, t := range tail {
t.UpdateConcurrency(dcc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on the deletions here (in GitHub) and document the code a bit.

I need to validate my understanding. The way I read the change is we are no longer distributing tail elements across all activators and instead having activators pick up an additional element

eg.

  1. we have 20 pods and 3 activators -> we'd get 2 remnants so activator with index 0,1 would each pick up a unique tracker
  2. we have 24 pods and 5 activators -> we'd get 4 remnants so the activator 0,1,2,3 would each pick up a unique tracker

What's the reason for dropping the t.UpdateConcurrency(dcc) call? I'm a bit lost why the code prior called UpdateConcurrency on the tail trackers but not the picked indicies?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for dropping the t.UpdateConcurrency(dcc)

So the test change where the total capacity changed from 15 to 20 highlighted that even though the tracker was shared amongst the multiple activators before - the container concurrency is reduced for each tracker.

I'm surprised this triggered LB issues because that remnant pod tracker should still have a cc=1.

I wonder if there's nothing wrong with our capacity calculation but actually the LB implementation - for cc=1 it uses firstAvailableLBPolicy

Copy link
Contributor Author

@Gekko0114 Gekko0114 Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments!

Can you comment on the deletions here (in GitHub) and document the code a bit.

Sure, I will add comments on the code.

I need to validate my understanding. The way I read the change is we are no longer distributing tail elements across all activators and instead having activators pick up an additional element

Yes, I think your understanding is same as mine.

What's the reason for dropping the t.UpdateConcurrency(dcc) call? I'm a bit lost why the code prior called UpdateConcurrency on the tail trackers but not the picked indicies?

I think the reason why the code prior called UpdateConcurrency was to ensure that the total capacity of each tail tracker across all activators matched the desired cc.
However, in this PR, each remnant tracker is now assigned to a unique activator. As a result there is no need to call UpdateCapacity anymore.

I'm surprised this triggered LB issues because that remnant pod tracker should still have a cc=1.

I wonder if there's nothing wrong with our capacity calculation but actually the LB implementation - for cc=1 it uses firstAvailableLBPolicy

While the call of UpdateConcurrency here may not function as intended, this issue's root cause is not here. Instead, activator itself's capacity calculation is the cause.

If I have misunderstood anything or if you have further questions, please let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have misunderstood anything or if you have further questions, please let me know.

Bear with me - I'm also learning the code so I'm walking through it so I understand the implication of your changes

If podTracker=3, assignedTracker=2 and activatorCount=2, the expected targetCapacity should be 2. However, in the current implementation targetCapacity is calculated as 1.

When CC=1 the activator uses an infinite breaker which always reports a capacity of 1 when you update the breaker's capacity >= 1. If the capacity is 1 then it will let in as many requests as possible

That's why I was wondering if we're still including excess destinations to all activators (and reducing the container concurrency) it seems like there's an issue with the load balancing.

With the old code I added a test to see the output in your original situation (cc=1, pods=5, activators=2)

func TestBlah(t *testing.T) {
	logger := TestLogger(t)
	cc := 1
	numActivators := 2

	for i := 0; i < numActivators; i++ {
		rt := &revisionThrottler{
			logger:               logger,
			breaker:              queue.NewBreaker(testBreakerParams),
			containerConcurrency: cc,
		}
		rt.numActivators.Store(int32(numActivators))
		rt.activatorIndex.Store(int32(i))
		rt.podTrackers = makeTrackers(5, 1)
		rt.breaker = newInfiniteBreaker(logger)

		rt.updateCapacity(5)
		got := rt.breaker.Capacity()
		t.Logf("Activator breaker capacity = %d", got)

		for _, tracker := range rt.assignedTrackers {
			t.Logf("dest: %s, Capacity = %d", tracker.dest, tracker.Capacity())

		}
	}
}

This gives the output

=== RUN   TestBlah
    logger.go:130: 2023-06-09T11:28:39.139-0400 DEBUG   net/throttler.go:300    Trackers 0/2: assignment: [0 1 4]
    logger.go:130: 2023-06-09T11:28:39.139-0400 INFO    net/throttler.go:318    Set capacity to 2 (backends: 5, index: 0/2)
    throttler_test.go:245: Activator breaker capacity = 1
    throttler_test.go:248: dest: 0, Capacity = 1
    throttler_test.go:248: dest: 1, Capacity = 1
    throttler_test.go:248: dest: 4, Capacity = 1
    logger.go:130: 2023-06-09T11:28:39.140-0400 DEBUG   net/throttler.go:300    Trackers 1/2: assignment: [2 3 4]
    logger.go:130: 2023-06-09T11:28:39.140-0400 INFO    net/throttler.go:318    Set capacity to 2 (backends: 5, index: 1/2)
    throttler_test.go:245: Activator breaker capacity = 1
    throttler_test.go:248: dest: 2, Capacity = 1
    throttler_test.go:248: dest: 3, Capacity = 1
    throttler_test.go:248: dest: 4, Capacity = 1
--- PASS: TestBlah (0.00s)

So in theory dest 4 should be receiving requests from both activators which would result in some requests being queued at the queue proxy.

Given this I wondered why are we not receiving traffic and I think it's because of the load balancing policy which just iterates over the assigned trackers and picks the first one that has capacity.

case containerConcurrency <= 3:
// For very low CC values use first available pod.
revBreaker = queue.NewBreaker(breakerParams)
lbp = firstAvailableLBPolicy

return rt.lbPolicy(ctx, rt.assignedTrackers)

Thus I don't think your change this will solve your original problem. I think if you add artificial waits to your services handling requests you would eventually see all pods getting traffic. Or set your container concurrency to 4 which changes the load balancing to round robin.

So I think there's a separate fix to the original issue and we need to think about whether changing the assignment still makes sense (I think it does - just need to think about the implications a bit)

Copy link
Contributor Author

@Gekko0114 Gekko0114 Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment! I got what you meant.
Your opinion is different from my understanding. I will explain it.

When CC=1 the activator uses an infinite breaker which always reports a capacity of 1 when you update the breaker's capacity >= 1. If the capacity is 1 then it will let in as many requests as possible

This is not correct.
When CC >= 1, the infinite breaker is not used. Instead, queue.NewBreaker is called.
https://github.com/knative/serving/blob/main/pkg/activator/net/throttler.go#L179-L190

Let's consider the scenario where there are multiple activators and CC=1.
In the case of CC=1, the capacity of each pod becomes 1.
There is no mechanism for activators to share which pods they are sending requests to.
If both activators are assigned to all pods, there is a possibility that the total requests from all activators will exceed CC. Therefore, each activator is assigned to unique pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dprotaso

Thus I don't think your change this will solve your original problem. I think if you add artificial waits to your services handling requests you would eventually see all pods getting traffic. Or set your container concurrency to 4 which changes the load balancing to round robin.

So I think there's a separate fix to the original issue and we need to think about whether changing the assignment still makes sense (I think it does - just need to think about the implications a bit)

As you pointed out, the issue can also be resolved by changing the load balancing mechanism to round-robin.
However, in my opinion, it would be better to modify the capacity of activator itself. The problem of this issue is not load balancing logic, but capacity of activators.

In the case of cc=1, pods=5, and activators=2:
Each activator has a capacity of 2.
The assignment of activators is as follows: activator1: pod0, pod1, pod4; activator2: pod2, pod3, pod4.

When the hey command is executed:
hey -t 0 -n 10000 -c 100 <address>
The activator is designed not to process requests beyond its capacity.
If a request is sent to activator1, it first tries to send the request to pod0. If pod0 has available capacity, the request is sent there. If pod0 doesn't have available capacity, the request is sent to pod1.
So what happens if neither pod0 nor pod1 can receive the request?
In this case, the activator has reached its capacity limit of 2, so it won't attempt to send the request to the pod in the first place.
When there is available capacity in pod0 or pod1 again, it will attempt to send the request to the pod. That's why pod4 wouldn't get any traffic from activator1 and activator2 (you can observe it by reproducing the issue).

Changing to round-robin will likely solve this bug as well. However, even with round-robin, the issue of activator capacity remains unresolved. Therefore, I believe it would be better to fix the activator's capacity itself, which is the modification in my PR.
Also, if we change to round-robin even when cc <= 3, there is no mechanism for multiple activators to share which pods they are sending requests to, so it will result in sending requests exceeding the capacity to pod4.

Sorry for long explanation.
If I have misunderstood anything or if you have further questions, please let me know.

Comment on lines -389 to -400
// We shuffle the tail, to ensure that pods in the tail get better
// load distribution, since we sort the pods above, this puts more requests
// on the very first tail pod, than on the others.
rand.Shuffle(remnants, func(i, j int) {
tail[i], tail[j] = tail[j], tail[i]
})
// assignSlice is not called for CC=0 case.
dcc := int(math.Ceil(float64(cc) / float64(numActivators)))
// This is basically: x = append(x, trackers[len(trackers)-remnants:]...)
// But we need to update the capacity.
for _, t := range tail {
t.UpdateConcurrency(dcc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for dropping the t.UpdateConcurrency(dcc)

So the test change where the total capacity changed from 15 to 20 highlighted that even though the tracker was shared amongst the multiple activators before - the container concurrency is reduced for each tracker.

I'm surprised this triggered LB issues because that remnant pod tracker should still have a cc=1.

I wonder if there's nothing wrong with our capacity calculation but actually the LB implementation - for cc=1 it uses firstAvailableLBPolicy

pkg/activator/net/throttler_test.go Show resolved Hide resolved
@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 8, 2023
@Gekko0114
Copy link
Contributor Author

@dprotaso ,

Thanks for your comments! I've updated PR and replied your comments.

@dprotaso
Copy link
Member

This is not correct. When CC >= 1, the infinite breaker is not used. Instead, queue.NewBreaker is called.

Good catch - I had a bad assumption.

@Gekko0114 thanks for all the help and clarification here.

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2023
@knative-prow
Copy link

knative-prow bot commented Jun 16, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, Gekko0114

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2023
@knative-prow knative-prow bot merged commit f085b30 into knative:main Jun 16, 2023
61 checks passed
@Gekko0114
Copy link
Contributor Author

Thank you so much for your review.
I learned a lot from you!

@Gekko0114 Gekko0114 deleted the serving branch June 17, 2023 00:06
arsenetar pushed a commit to coreweave/serving that referenced this pull request Jul 3, 2023
…ivatorCount and podTracker is different (knative#14022)

* fix: requests are sent to all pods even if cc=1 and the parity of activatorCount and podTrackerCount are different
* each activator has correct capacity
* rebase
* update_docs
skonto pushed a commit to skonto/serving that referenced this pull request Oct 5, 2023
…ivatorCount and podTracker is different (knative#14022)

* fix: requests are sent to all pods even if cc=1 and the parity of activatorCount and podTrackerCount are different

* each activator has correct capacity

* rebase

* update_docs
arsenetar added a commit to coreweave/serving that referenced this pull request Dec 19, 2023
- Backport activator fixes from
  knative#14303 and
  knative#14347 from 1.12
- Add custom patches for logs and probe durations
- Update to go 1.20
- Add patch from knative#14022
- Add custom CI workflows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/autoscale area/networking lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Activator LB doesn't work well
5 participants