-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lrs: add Store.Stats() to report loads for multiple clusters #3905
Conversation
…lusters - unexport PerClusterStore.Stats() - add Store.Stats(clusterNames) to report loads for the given clusters
xds/internal/client/load/store.go
Outdated
// a slice with no specific order. | ||
// | ||
// If no clusterName is given (an empty slice), all data for all known clusters | ||
// are returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/are/is/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
xds/internal/client/load/store.go
Outdated
// If no clusterName is given (an empty slice), all data for all known clusters | ||
// are returned. | ||
// | ||
// If a cluster's Data is empty (no load to report), it's not append to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/append/appended/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
xds/internal/client/load/store.go
Outdated
func appendClusterStats(ret []*Data, cluster map[string]*perClusterStore) []*Data { | ||
for _, d := range cluster { | ||
data := d.stats() | ||
if data.TotalDrops == 0 && len(data.Drops) == 0 && len(data.LocalityStats) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about adding a method on Data
to return if its empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed d.stats()
to return nil if data doesn't contain any information.
// the new slice. | ||
// | ||
// Data is only appended to ret if it's not empty. | ||
func appendClusterStats(ret []*Data, cluster map[string]*perClusterStore) []*Data { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be minor thing. I feel that instead of appendClusterStats
which takes a slice and appends to it and returns the appended slice, we could simply have fetchClusterStats
which just returns a slice of Data
for the specified cluster. The caller can do what it wants with the returned slice.
Or another option would be to get rid of this function and fold it into Stats
. And in there, first figure out the actual list of cluster names we are interested in (either all, or the provided list minus the ones for whcih there is no entry in the clusters
map).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find a cleaner way to remove this function while handling special case len(clusters)==0
..
xds/internal/client/load/store.go
Outdated
clusters map[storeKey]*PerClusterStore | ||
// mu only protects the map (2 layers). The read/write to *perClusterStore | ||
// doesn't need to hold the mu. | ||
mu sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a read-write mutex really required here?
I see that the loadStoreWrapper
in xds_client_wrapper.go
actually invokes the PerCluster
method for each of it Call*
methods. Could we change that to instead store the PerClusterReporter
locally as part of its update
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the caller to store PerClusterReporter
locally.
Kept the RWMutex, because it could still be use in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RWMutex just makes the PerCluster()
method harder to read. Unless there is a good reason for using it, we probably shouldn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to just Mutex
. Since perCluster
is cached, we wouldn't need to take this lock a lot.
@@ -87,16 +145,20 @@ func (ls *Store) PerCluster(clusterName, serviceName string) *PerClusterStore { | |||
// RWMutex. | |||
// Neither of these conditions are met here, and we should transition to a | |||
// regular map with a mutex for better type safety. | |||
type PerClusterStore struct { | |||
type perClusterStore struct { | |||
cluster, service string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing things like cluster
, service
and lastLoadReportAt
does not seem right here. I find no reason for the perClusterStore
or the returned Data
to contain these things. These should be ideally kept track of by the client which is eventually reporting stuff to the LRS server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for convenience. Store
already has a map like what you said. But we will need another wrapper to keep lastLoadReportAt
.
Another difference is whether Store.Stats()
returns a map (map[cluster]map[service]*Data
), or a list of key-value pairs ([]*Data
).
// with service name as the key. Each value (perClusterStore) contains data | ||
// for a (cluster, service) pair. | ||
// | ||
// Note that new entries are added to this map, but never removed. This is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to avoid this? Just like how the users of the xdsClient
need to cancel their watches, they are also required to call the cancel
function returned from ReportLoad
. Could we use that somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no "correct" time to cancel reporting for a cluster,service
pair. Because even when cluster,service
is updated in the parent balancer, we cannot cancel reporting for the old pair until we know the sub-balancers has processed the new addresses and updated pickers to stop using the old subconns.
This is impossible with the current structure with sub-balancers. But will be possible when we pass attach ``cluster,service` to each address, and do all load reporting work inside CDS.
We may have a way to do the cleanup when CDS does all the tracking and load reporting.
// | ||
// If a cluster's Data is empty (no load to report), it's not append to the | ||
// returned slice. | ||
func (s *Store) Stats(clusterNames []string) []*Data { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this top-level Store
should export any methods at all, except for the ones to retrieve the per-cluster store.
Also, I'm wondering whether we should fix the xdsClient API to support reporting loads to multiple clusters in a better way. Things on top of my head (without a lot of serious thought):
ReportLoad
which takes(cluster, service)
should return thePerClusterStore
and acancel
function- Callers use the returned
PerClusterStore
to report loads, and eventually invoke thecancel
function - Internally, the
xdsClient
keeps track of everycluster+service
combination for which it is reporting load, and the associatedPerClusterStore
- When it is time to report load (every
loadReporting
internal), it traverses the list ofPerClusterStore
and fetches stats from them and creates the load report and sends it to the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to what I have in the follow up PR, see https://github.com/menghanl/grpc-go/pull/43/files#diff-5f29997be1764e4c5a06ac2f4d3d1bb7R51
Difference:
ReportLoad
doesn't takecluster, service
. It only takes the server address. The store for each server address can be reused to get the per-cluster load reporter- There's no "correct" time to cancel the load report for the
cluster, service
combo. But we can just stop reporting to the old server when EDS contains a different server name.
} | ||
|
||
func (lsw *loadStoreWrapper) update(store *load.Store, service string) { | ||
lsw.mu.Lock() | ||
defer lsw.mu.Unlock() | ||
if store == lsw.store && service == lsw.service { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not convinced if we need this check here and in the lrs/balancer.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we don't need this check? Or we should move this check somewhere else?
The purpose of this check is to use the cached perCluster
to avoid getting the lock inside store
. This update can happen frequently, with the same parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we don't need this check because I thought this update will not happen frequently. Why would this update happen frequently with the same parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called on each received CDS response. (Not that frequently comparing to RPCs.)
A normal unnecessary (from lrs's point of view) update would be, something else (the clusters) in CDS changed, but the lrs-server string doesn't.
A not-so-normal unnecessary update is when the server sends the same CDS resp multiple times.
xds/internal/client/load/store.go
Outdated
clusters map[storeKey]*PerClusterStore | ||
// mu only protects the map (2 layers). The read/write to *perClusterStore | ||
// doesn't need to hold the mu. | ||
mu sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RWMutex just makes the PerCluster()
method harder to read. Unless there is a good reason for using it, we probably shouldn't.
xds/internal/client/load/store.go
Outdated
ls.mu.Lock() | ||
defer ls.mu.Unlock() | ||
p, ok := ls.clusters[k] | ||
p, ok := c[serviceName] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
if p, ok := c[serviceName]; ok {
return p
}
p = &perClusterStore { ... }
c[serviceName] = p
return p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
perClusterStore
and it'sstats()
Store.Stats(clusterNames)
to report loads for the given clusterslastLoadReportAt
from client ton the load store, because a client can now have multiple clusters, each with a differentlastLoadReportAt
ReportInterval
when comparing Data