Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lrs: add Store.Stats() to report loads for multiple clusters #3905

Merged
merged 4 commits into from
Oct 7, 2020

Conversation

menghanl
Copy link
Contributor

  • unexport perClusterStore and it's stats()
  • add Store.Stats(clusterNames) to report loads for the given clusters
    • refactor store's map to a two layer map
  • move lastLoadReportAt from client ton the load store, because a client can now have multiple clusters, each with a different lastLoadReportAt
    • all tests will ignore ReportInterval when comparing Data

…lusters

- unexport PerClusterStore.Stats()
- add Store.Stats(clusterNames) to report loads for the given clusters
@menghanl menghanl added no release notes Type: Feature New features or improvements in behavior labels Sep 25, 2020
// a slice with no specific order.
//
// If no clusterName is given (an empty slice), all data for all known clusters
// are returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/are/is/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// If no clusterName is given (an empty slice), all data for all known clusters
// are returned.
//
// If a cluster's Data is empty (no load to report), it's not append to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/append/appended/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

func appendClusterStats(ret []*Data, cluster map[string]*perClusterStore) []*Data {
for _, d := range cluster {
data := d.stats()
if data.TotalDrops == 0 && len(data.Drops) == 0 && len(data.LocalityStats) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about adding a method on Data to return if its empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed d.stats() to return nil if data doesn't contain any information.

// the new slice.
//
// Data is only appended to ret if it's not empty.
func appendClusterStats(ret []*Data, cluster map[string]*perClusterStore) []*Data {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be minor thing. I feel that instead of appendClusterStats which takes a slice and appends to it and returns the appended slice, we could simply have fetchClusterStats which just returns a slice of Data for the specified cluster. The caller can do what it wants with the returned slice.

Or another option would be to get rid of this function and fold it into Stats. And in there, first figure out the actual list of cluster names we are interested in (either all, or the provided list minus the ones for whcih there is no entry in the clusters map).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find a cleaner way to remove this function while handling special case len(clusters)==0..

clusters map[storeKey]*PerClusterStore
// mu only protects the map (2 layers). The read/write to *perClusterStore
// doesn't need to hold the mu.
mu sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a read-write mutex really required here?

I see that the loadStoreWrapper in xds_client_wrapper.go actually invokes the PerCluster method for each of it Call* methods. Could we change that to instead store the PerClusterReporter locally as part of its update method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the caller to store PerClusterReporter locally.

Kept the RWMutex, because it could still be use in some cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RWMutex just makes the PerCluster() method harder to read. Unless there is a good reason for using it, we probably shouldn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to just Mutex. Since perCluster is cached, we wouldn't need to take this lock a lot.

@@ -87,16 +145,20 @@ func (ls *Store) PerCluster(clusterName, serviceName string) *PerClusterStore {
// RWMutex.
// Neither of these conditions are met here, and we should transition to a
// regular map with a mutex for better type safety.
type PerClusterStore struct {
type perClusterStore struct {
cluster, service string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing things like cluster, service and lastLoadReportAt does not seem right here. I find no reason for the perClusterStore or the returned Data to contain these things. These should be ideally kept track of by the client which is eventually reporting stuff to the LRS server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for convenience. Store already has a map like what you said. But we will need another wrapper to keep lastLoadReportAt.

Another difference is whether Store.Stats() returns a map (map[cluster]map[service]*Data), or a list of key-value pairs ([]*Data).

// with service name as the key. Each value (perClusterStore) contains data
// for a (cluster, service) pair.
//
// Note that new entries are added to this map, but never removed. This is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid this? Just like how the users of the xdsClient need to cancel their watches, they are also required to call the cancel function returned from ReportLoad. Could we use that somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no "correct" time to cancel reporting for a cluster,service pair. Because even when cluster,service is updated in the parent balancer, we cannot cancel reporting for the old pair until we know the sub-balancers has processed the new addresses and updated pickers to stop using the old subconns.
This is impossible with the current structure with sub-balancers. But will be possible when we pass attach ``cluster,service` to each address, and do all load reporting work inside CDS.

We may have a way to do the cleanup when CDS does all the tracking and load reporting.

//
// If a cluster's Data is empty (no load to report), it's not append to the
// returned slice.
func (s *Store) Stats(clusterNames []string) []*Data {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this top-level Store should export any methods at all, except for the ones to retrieve the per-cluster store.

Also, I'm wondering whether we should fix the xdsClient API to support reporting loads to multiple clusters in a better way. Things on top of my head (without a lot of serious thought):

  • ReportLoad which takes (cluster, service) should return the PerClusterStore and a cancel function
  • Callers use the returned PerClusterStore to report loads, and eventually invoke the cancel function
  • Internally, the xdsClient keeps track of every cluster+service combination for which it is reporting load, and the associated PerClusterStore
  • When it is time to report load (every loadReporting internal), it traverses the list of PerClusterStore and fetches stats from them and creates the load report and sends it to the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to what I have in the follow up PR, see https://github.com/menghanl/grpc-go/pull/43/files#diff-5f29997be1764e4c5a06ac2f4d3d1bb7R51

Difference:

  • ReportLoad doesn't take cluster, service. It only takes the server address. The store for each server address can be reused to get the per-cluster load reporter
  • There's no "correct" time to cancel the load report for the cluster, service combo. But we can just stop reporting to the old server when EDS contains a different server name.

@easwars easwars assigned menghanl and unassigned easwars Sep 29, 2020
@dfawley dfawley changed the title lrs: add Stats() to store to report loads for multiple clusters lrs: add Stats() to Store to report loads for multiple clusters Oct 1, 2020
@dfawley dfawley assigned easwars and unassigned menghanl Oct 1, 2020
}

func (lsw *loadStoreWrapper) update(store *load.Store, service string) {
lsw.mu.Lock()
defer lsw.mu.Unlock()
if store == lsw.store && service == lsw.service {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not convinced if we need this check here and in the lrs/balancer.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we don't need this check? Or we should move this check somewhere else?

The purpose of this check is to use the cached perCluster to avoid getting the lock inside store. This update can happen frequently, with the same parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we don't need this check because I thought this update will not happen frequently. Why would this update happen frequently with the same parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called on each received CDS response. (Not that frequently comparing to RPCs.)

A normal unnecessary (from lrs's point of view) update would be, something else (the clusters) in CDS changed, but the lrs-server string doesn't.
A not-so-normal unnecessary update is when the server sends the same CDS resp multiple times.

clusters map[storeKey]*PerClusterStore
// mu only protects the map (2 layers). The read/write to *perClusterStore
// doesn't need to hold the mu.
mu sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RWMutex just makes the PerCluster() method harder to read. Unless there is a good reason for using it, we probably shouldn't.

ls.mu.Lock()
defer ls.mu.Unlock()
p, ok := ls.clusters[k]
p, ok := c[serviceName]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

if p, ok := c[serviceName]; ok {
  return p
}
p = &perClusterStore { ... }
c[serviceName] = p
return p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@easwars easwars assigned menghanl and unassigned easwars Oct 5, 2020
@menghanl menghanl changed the title lrs: add Stats() to Store to report loads for multiple clusters lrs: add Store.Stats() to report loads for multiple clusters Oct 7, 2020
@menghanl menghanl merged commit 5af6040 into grpc:master Oct 7, 2020
@menghanl menghanl deleted the lrs_stream branch October 7, 2020 17:40
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants