Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix docker logs tailer with file rotation #4860

Merged
merged 7 commits into from
Feb 13, 2020

Conversation

clamoriniere
Copy link
Contributor

What does this PR do?

This PR changes the way the docker logs tailer handle the io.EOF error. Previously this error didn't trigger any recovery logic since it was considered that it can append only when the container was stopped.
With this change, the tailer considers the io.EOF as a "normal" error. it stops tailing the file and returns an error that will ask the docker.Luncher to recreate a NewTailer.

Motivation

Avoid to stop tailing docker file in some specifics configuration when the logs file is rotated.

Additional Notes

Anything else we should know when reviewing?

Signed-off-by: cedric lamoriniere <[email protected]>
vboulineau
vboulineau previously approved these changes Feb 10, 2020
Copy link
Member

@olivielpeau olivielpeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix makes sense, thanks!

The fix looks relatively harmless (it basically makes the tailer handle io.EOF like other "unknown" errors), although it's not entirely clear to me if/how it changes the behavior for short-lived containers introduced in #4085), especially since there are no tests on #4085. An e2e test was worked on in #3981 but was never merged. So let's make sure this doesn't introduce a regression there, cc @ajacquemot if you could help to determine there's no regression.

Could you describe how you're testing this change as well? An automated test would be ideal (but can be planned later in separate PR)

pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved
pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ajacquemot ajacquemot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks good to me 💪

Please keep in mind that this case is triggered:

  • on file rotation
  • on container stop
  • on no log was outputted

So we need to make sure that we test throughly those 3 scenarios during the QA to ensure no regression.

Also we need to take extra care of the Since because it can lead to data loss or duplicate:

  • on file rotation we should start from beginning
  • on container stop we should start from last committed offset or the end if no offset
  • on no output we should start from beginning

Finally we could probably write this scenario with unit-tests but it would mean to refactor the code, I started to think about it, we could probably pass a provider to construct the reader and mock the different behaviours, up to you to decide :-)

pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved
pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved
pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved
Signed-off-by: cedric lamoriniere <[email protected]>
olivielpeau
olivielpeau previously approved these changes Feb 10, 2020
Copy link
Member

@olivielpeau olivielpeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Again, let's make sure this doesn't introduce a regression for short-lived containers

ajacquemot
ajacquemot previously approved these changes Feb 10, 2020
Copy link
Contributor

@ajacquemot ajacquemot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@clamoriniere clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch 4 times, most recently from 74f0c9b to f2b6654 Compare February 10, 2020 22:07
Copy link
Contributor

@ajacquemot ajacquemot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the unit-tests, that's super nice, left two comments, we can sit down together if needed.

pkg/logs/auditor/auditor.go Outdated Show resolved Hide resolved
pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved
@clamoriniere clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch from f2b6654 to c0e94f9 Compare February 11, 2020 17:55
@clamoriniere clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch 7 times, most recently from 1a75f1c to 7c33479 Compare February 12, 2020 16:11
* fix map race access in docker logs launcher
* fix map race access in docker logs auditor
* implement backoff in `safereader`
* implement restartReader()

docker log tailer, add restartReader() with backoff policy

Signed-off-by: cedric lamoriniere <[email protected]>
@clamoriniere clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch from 7c33479 to 499089a Compare February 12, 2020 18:22
@clamoriniere clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch from 499089a to fb97bc1 Compare February 13, 2020 08:39
a.registry = a.recoverRegistry()
a.cleanupRegistry()
go a.run()
}

// Stop stops the Auditor
func (a *Auditor) Stop() {
a.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/DataDog/datadog-agent/pull/4860/files#diff-5142632c5dd9990aa6feb0918f9334f7L85 is in the critical path but I think it should not because any method for the run method acquiring the mutex can be called concurrently

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably create a setup and reset methods which are protected by the mutex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's technically a chance of a deadlock if one of these tickers tick right after Stop here has locked the mutex, since cleanupRegistry and flushRegistry both need to hold the lock. The chances of running into it are likely extremely low, so not sure if it's worth addressing. To avoid this we could have a separate mutex for registry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking to move it in the New() function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. I added the new mutex.

@@ -185,19 +205,25 @@ func (t *Tailer) readForever() {
// This error is raised when the agent is stopping
return
case err == io.EOF:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would have reused the same logic as in case isContextCanceled(err): and reused the method wait() to avoid hammering the docker socket and avoid the CPU to go high.

Copy link
Contributor Author

@clamoriniere clamoriniere Feb 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we try only once to setup the reader then ask the launcher to restart the tailer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, does it make sense to have different retry strategies depending on the error (isContextCanceled, io.EOF, and default)? If so, why? (this is an honest question, I don't know these different types of errors enough to answer).

Otherwise I feel all errors that require a retry could use the same overall logic: retry setting up the reader, and if that failed ask the launcher to restart the tailer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same feeling.
pr updated in this direction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

Copy link
Member

@olivielpeau olivielpeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments, but overall the changes look good to me if they fix the issue 👍

a.registry = a.recoverRegistry()
a.cleanupRegistry()
go a.run()
}

// Stop stops the Auditor
func (a *Auditor) Stop() {
a.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's technically a chance of a deadlock if one of these tickers tick right after Stop here has locked the mutex, since cleanupRegistry and flushRegistry both need to hold the lock. The chances of running into it are likely extremely low, so not sure if it's worth addressing. To avoid this we could have a separate mutex for registry.

@@ -185,19 +205,25 @@ func (t *Tailer) readForever() {
// This error is raised when the agent is stopping
return
case err == io.EOF:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, does it make sense to have different retry strategies depending on the error (isContextCanceled, io.EOF, and default)? If so, why? (this is an honest question, I don't know these different types of errors enough to answer).

Otherwise I feel all errors that require a retry could use the same overall logic: retry setting up the reader, and if that failed ask the launcher to restart the tailer.

@clamoriniere clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch 2 times, most recently from 6705ce4 to 874c508 Compare February 13, 2020 14:49
@clamoriniere clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch from 874c508 to ea5d600 Compare February 13, 2020 14:56
Copy link
Member

@olivielpeau olivielpeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Contributor

@ajacquemot ajacquemot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the fix 👌

@clamoriniere clamoriniere merged commit 61a866b into master Feb 13, 2020
@clamoriniere clamoriniere deleted the clamoriniere/fixdockerlogsrotation branch February 13, 2020 18:47
clamoriniere added a commit that referenced this pull request Feb 13, 2020
* fix docker logs tailer with file rotation
* Improve log reader EOF detection, update unit-tests
* fix map race access in docker logs launcher
* fix map race access in docker logs auditor
clamoriniere added a commit that referenced this pull request Feb 13, 2020
* fix docker logs tailer with file rotation
* Improve log reader EOF detection, update unit-tests
* fix map race access in docker logs launcher
* fix map race access in docker logs auditor

Signed-off-by: cedric lamoriniere <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants