Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keepalive to agents #2294

Merged
merged 2 commits into from
Jan 8, 2018
Merged

Add keepalive to agents #2294

merged 2 commits into from
Jan 8, 2018

Conversation

jmccann
Copy link

@jmccann jmccann commented Jan 8, 2018

This adds keep alive to agents to send pings back to server during build inactivity.

This has helped me address issues between server<>agent when there are no builds for a period of time and firewalls kill the inactive connection. May be helpful for other server<>agent comm issues as well.

I'm in progress of testing this specific iteration. I had tested with hardcoded values and it worked beautifully. Wanted to get this out there for review at least. :)

@@ -80,6 +80,16 @@ func main() {
Name: "healthcheck",
Usage: "enables the healthcheck endpoint",
},
cli.StringFlag{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use DurationFlag here and then remove the time.ParseDuration

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will implement this

conn, err := grpc.Dial(
c.String("server"),
grpc.WithInsecure(),
grpc.WithPerRPCCredentials(&credentials{
username: c.String("username"),
password: c.String("password"),
}),
grpc.WithKeepaliveParams(agentKeepalive),
Copy link

@bradrydzewski bradrydzewski Jan 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to simplify like this:

grpc.WithKeepaliveParams(
  keepalive.ClientParameters{
    Timeout: c.Duration("keepalive-timeout"),
    Time:    c.Duration("keepalive-time"),
  },
),

If c.Duration is empty it will return the zero value, so this snippet here:

agentKeepalive := keepalive.ClientParameters{}
if dur := c.Duration("keepalive-time"); dur != 0 {
  agentKeepalive.Time = dur
}

should behave the same as this snippet here:

agentKeepalive := keepalive.ClientParameters{
  Time: c.Duration("keepalive-time"),
}

:)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my only concern is I'm trying to maintain default values for Time (infinity) and Timeout (20s).

grpc.WithKeepaliveParams(
  keepalive.ClientParameters{
    Timeout: c.Duration("keepalive-timeout"),
    Time: c.Duration("keepalive-time"),
  },
),

would return 0 for both as defaults instead if user doesn't specify keep alive config. Let me know if I'm wrong and I can implement it that way. Otherwise I think I'll change what I have to:

agentKeepalive := keepalive.ClientParameters{}
if dur := c.Duration("keepalive-time"); dur != 0 {
  agentKeepalive.Time = dur
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will still work as expected. There is no nil value for time.Duration so the grpc code will likely use infinity if the value is 0. I will look through the code to verify, though

Copy link

@bradrydzewski bradrydzewski Jan 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, looks like if c.Duration returns a zero value, the kubernetes defaults will kick in

var defaultClientKeepaliveTime = infinity

...

	if kp.Time == 0 {
		kp.Time = defaultClientKeepaliveTime
	}
	if kp.Timeout == 0 {
		kp.Timeout = defaultClientKeepaliveTimeout
	}

@bradrydzewski bradrydzewski merged commit fc22c5b into harness:master Jan 8, 2018
@jmccann jmccann deleted the keepalive branch January 8, 2018 19:00
@evert0n
Copy link

evert0n commented Jan 10, 2018

@jmccann awesome! This one having been haunting me for a couple of weeks since 0.8.x

@Sharsie
Copy link

Sharsie commented Jan 14, 2018

@jmccann Thanks for this, from the short time I have been testing it it seems to work just fine over two swarm networks (server running on one swarm, the agents running on another swarm)

There's just one thing, though I think it's just about aesthetics, however I thought I would point it out anyway in case it would have another side effect I am not aware of:
/api/info/queue returns

{
    "pending": null,
    "running": null,
    "stats": {
        "worker_count": 43,
        "pending_count": 0,
        "running_count": 0,
        "completed_count": 0
    }
}

I have three workers/agents running and the keepalives seem to behave as if new agents kept joining the server

edit: Well, I seem to have a hit a sweet spot where none of the workers were connected at the time, i now have worker count at 52 (and still increasing) but a build stuck in pending

@bradrydzewski
Copy link

Well, I seem to have a hit a sweet spot where none of the workers were connected at the time, i now have worker count at 52 (and still increasing) but a build stuck in pending

This would tell me that something is terminating your TCP connections (loadbalancers, etc) without sending the TCP RST. The server therefore thinks the connection is still open, as it has not received the close event.

There is a server-side keepalive message that can be configured. This would enable bi-directional keepalive pings and would likely prevent this sort of problem grpc/grpc-go#1119

@krasi-georgiev
Copy link

@bradrydzewski shouldn't this also close #2246 ?

@Sharsie
Copy link

Sharsie commented Mar 14, 2018

I did not manage to get it working on a swarm behind nginx proxy using keepalives from #2297 .

So if what @bradrydzewski mentioned above is not configurable in the current build I do not think the issue is resolved.

I ended up migrating the drone agent+server outside of swarm at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants