Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pickfirst: Use one subConn per address #7384

Closed

Conversation

arjan-bal
Copy link
Contributor

@arjan-bal arjan-bal commented Jul 3, 2024

As part of the Dual Stack implementation, we would like to move the pickfirst algorithm out of the subConn and in the lb policy

We would like to take this opportunity to move toward a more uniform cross-language architecture. Also, moving pick_first up to the LB policy layer in Java and Go will have the nice effect of making their backoff work per-address instead of across all addresses, which is what C-core does and what the (poorly specified) connection backoff spec seems to have originally envisioned.

This change moves the algorithm into the lb policy and ensures the LB creates one subConn per address, but doesn't remove the the pickfirst logic from the subConn. Happy eyeballs is not implemented.

Inspiration is taken from the c-core implementation:

Other changes

  • Add an ExitIdle() method on stateRecordingBalancer. This is required as the fallback logic of re-connecting all the subConns no longer works. This is because the LB policy closes the active subConn once the connection is dropped and does a pick first over all the addresses when ExitIdle is called. When all the subConns are shutdown, re-connecting will have no effect.
  • Update tests with new assertions based on the new behaviour.

Behaviour Changes

  • Pickfirst now creates one subchannel per address.
  • An exponential back-off is performed per server address when the subchannel enters transient failure.
  • After every address has been tried one, pickfirst attempts to re-connect to the addresses in the order in which they complete their back-off, this is not deterministic.

RELEASE NOTES:

  • Pickfirst creates one subchannel per address.

@arjan-bal arjan-bal marked this pull request as ready for review July 3, 2024 14:55
@arjan-bal arjan-bal marked this pull request as draft July 3, 2024 14:56
Copy link

codecov bot commented Jul 3, 2024

Codecov Report

Attention: Patch coverage is 82.91457% with 34 lines in your changes missing coverage. Please review.

Project coverage is 81.22%. Comparing base (ee62e56) to head (cb21828).
Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7384      +/-   ##
==========================================
- Coverage   81.50%   81.22%   -0.28%     
==========================================
  Files         348      350       +2     
  Lines       26752    26975     +223     
==========================================
+ Hits        21804    21911     +107     
- Misses       3766     3845      +79     
- Partials     1182     1219      +37     
Files Coverage Δ
balancer/pickfirst/pickfirst.go 82.21% <82.91%> (-1.53%) ⬇️

... and 31 files with indirect coverage changes

@arjan-bal arjan-bal closed this Jul 3, 2024
@arjan-bal arjan-bal reopened this Jul 3, 2024
@arjan-bal arjan-bal added Type: Feature New features or improvements in behavior Type: Behavior Change Behavior changes not categorized as bugs labels Jul 3, 2024
@arjan-bal arjan-bal added this to the 1.66 Release milestone Jul 3, 2024
@arjan-bal arjan-bal requested a review from easwars July 3, 2024 19:15
@arjan-bal arjan-bal marked this pull request as ready for review July 3, 2024 19:17
@arjan-bal arjan-bal requested a review from dfawley July 3, 2024 19:18
@arjan-bal arjan-bal assigned arjan-bal and unassigned easwars Jul 8, 2024
@arjan-bal arjan-bal force-pushed the pickfirst_one_address_per_subconn branch from f5d8242 to 88ec4ab Compare July 8, 2024 21:13
@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Jul 8, 2024
balancer/pickfirst/pickfirst.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirst.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirst.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirst.go Outdated Show resolved Hide resolved
})
if err != nil {
if b.logger.V(2) {
b.logger.Infof("Ignoring failure, could not create a subConn for address %q due to error: %v", addr, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it safe to ignore the error here? In our existing code, if NewSubConn fails, we set the connectivity state to TransientFailure, return a picker that always returns an error (and includes the error returned from NewSubConn), and we also return balancer.ErrBadResolverState.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what c-core seems to be doing. It throws an error only when we can't create a subConn for all the addresses: https://github.com/grpc/grpc/blob/a3a91a5a3cb6136d8a8d74e87b395f0ab450fd63/src/core/load_balancing/pick_first/pick_first.cc#L1017C5-L1026

I'll change it to fail fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking a little more, we use the next address when a connection to an address fails. By ignoring errors during creation of a subchannel for an address, we're treating the failure as a similar issue to a connection failure and using another address.
If we return an error early, we would miss the chance of using another address that we could create a subchannel for and use.

balancer/pickfirst/pickfirst.go Outdated Show resolved Hide resolved
Comment on lines 415 to 424
if len(b.subConnList.subConns) == 0 {
b.state = connectivity.TransientFailure
b.cc.UpdateState(balancer.State{
ConnectivityState: connectivity.TransientFailure,
Picker: &picker{err: fmt.Errorf("empty address list")},
})
b.unsetSelectedSubConn()
b.subConnList.close()
return balancer.ErrBadResolverState
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as an error handling step that should have happened when creating the subConnList ran into errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the error handling logic into the refreshSubConnList method. I can't move it into the newSubConnList method since its used to create an empty subConnList in the builder where is it expected for the list to be empty initially.

balancer/pickfirst/pickfirst.go Outdated Show resolved Hide resolved
}
b.subConnList.startConnectingIfNeeded()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually work?

We return early from startConnectingIfNeeded if state is not subConnListPending.

Copy link
Contributor Author

@arjan-bal arjan-bal Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In goIdle() we create a subConnList but don't call startConnectingIfNeeded until the idlePicker is called. So the subConnList will remain in pending. If ExitIdle() is called before the picker is used, this will cause the connection attempt to begin.

Comment on lines 151 to 153
if sl == nil {
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some subConnList methods have a nil receiver check while others don't. When is it possible/legal to call methods on a nil subConnList?

Copy link
Contributor Author

@arjan-bal arjan-bal Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subConnList is nil only when the balancer hasn't received an update from the resolver yet. I've ensured that we set an empty subConnList in the builder and removed the nil receiver checks.

@easwars easwars assigned arjan-bal and unassigned easwars Jul 8, 2024
@arjan-bal arjan-bal force-pushed the pickfirst_one_address_per_subconn branch from 2f937d0 to ff84677 Compare July 12, 2024 10:00
@arjan-bal arjan-bal requested a review from easwars July 12, 2024 12:31
@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Jul 12, 2024
@arjan-bal arjan-bal force-pushed the pickfirst_one_address_per_subconn branch from cf2ee3a to 60bfbdd Compare July 13, 2024 08:32
@arjan-bal
Copy link
Contributor Author

Closing while I figure out the correct sequence of making of the changes for Dual Stack.

@arjan-bal arjan-bal closed this Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Behavior Change Behavior changes not categorized as bugs Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants