fix docker logs tailer with file rotation #4860

clamoriniere · 2020-02-10T09:34:29Z

What does this PR do?

This PR changes the way the docker logs tailer handle the io.EOF error. Previously this error didn't trigger any recovery logic since it was considered that it can append only when the container was stopped.
With this change, the tailer considers the io.EOF as a "normal" error. it stops tailing the file and returns an error that will ask the docker.Luncher to recreate a NewTailer.

Motivation

Avoid to stop tailing docker file in some specifics configuration when the logs file is rotated.

Additional Notes

Anything else we should know when reviewing?

Signed-off-by: cedric lamoriniere <[email protected]>

olivielpeau

This fix makes sense, thanks!

The fix looks relatively harmless (it basically makes the tailer handle io.EOF like other "unknown" errors), although it's not entirely clear to me if/how it changes the behavior for short-lived containers introduced in #4085), especially since there are no tests on #4085. An e2e test was worked on in #3981 but was never merged. So let's make sure this doesn't introduce a regression there, cc @ajacquemot if you could help to determine there's no regression.

Could you describe how you're testing this change as well? An automated test would be ideal (but can be planned later in separate PR)

pkg/logs/input/docker/tailer.go

releasenotes/notes/logs_fix_docker_tailer_with_file_rotation-536675ed2caa7e6e.yaml

ajacquemot

Fix looks good to me 💪

Please keep in mind that this case is triggered:

on file rotation
on container stop
on no log was outputted

So we need to make sure that we test throughly those 3 scenarios during the QA to ensure no regression.

Also we need to take extra care of the Since because it can lead to data loss or duplicate:

on file rotation we should start from beginning
on container stop we should start from last committed offset or the end if no offset
on no output we should start from beginning

Finally we could probably write this scenario with unit-tests but it would mean to refactor the code, I started to think about it, we could probably pass a provider to construct the reader and mock the different behaviours, up to you to decide :-)

pkg/logs/input/docker/tailer.go

Signed-off-by: cedric lamoriniere <[email protected]>

olivielpeau

LGTM.

Again, let's make sure this doesn't introduce a regression for short-lived containers

pkg/logs/input/docker/tailer.go

ajacquemot

🚀

ajacquemot

Thanks for the unit-tests, that's super nice, left two comments, we can sit down together if needed.

pkg/logs/auditor/auditor.go

pkg/logs/input/docker/tailer.go

pkg/logs/auditor/auditor.go

Signed-off-by: cedric lamoriniere <[email protected]>

* fix map race access in docker logs launcher * fix map race access in docker logs auditor * implement backoff in `safereader` * implement restartReader() docker log tailer, add restartReader() with backoff policy Signed-off-by: cedric lamoriniere <[email protected]>

Signed-off-by: cedric lamoriniere <[email protected]>

ajacquemot · 2020-02-13T09:22:22Z

pkg/logs/auditor/auditor.go

 a.registry = a.recoverRegistry()
 a.cleanupRegistry()
 go a.run()
 }

 // Stop stops the Auditor
 func (a *Auditor) Stop() {
+ a.mu.Lock()


https://github.com/DataDog/datadog-agent/pull/4860/files#diff-5142632c5dd9990aa6feb0918f9334f7L85 is in the critical path but I think it should not because any method for the run method acquiring the mutex can be called concurrently

You could probably create a setup and reset methods which are protected by the mutex

I guess there's technically a chance of a deadlock if one of these tickers tick right after Stop here has locked the mutex, since cleanupRegistry and flushRegistry both need to hold the lock. The chances of running into it are likely extremely low, so not sure if it's worth addressing. To avoid this we could have a separate mutex for registry.

I was also thinking to move it in the New() function.

make sense. I added the new mutex.

pkg/logs/input/docker/reader.go

ajacquemot · 2020-02-13T09:34:38Z

pkg/logs/input/docker/tailer.go

@@ -185,19 +205,25 @@ func (t *Tailer) readForever() {
 // This error is raised when the agent is stopping
 return
 case err == io.EOF:


Here I would have reused the same logic as in case isContextCanceled(err): and reused the method wait() to avoid hammering the docker socket and avoid the CPU to go high.

should we try only once to setup the reader then ask the launcher to restart the tailer?

in general, does it make sense to have different retry strategies depending on the error (isContextCanceled, io.EOF, and default)? If so, why? (this is an honest question, I don't know these different types of errors enough to answer).

Otherwise I feel all errors that require a retry could use the same overall logic: retry setting up the reader, and if that failed ask the launcher to restart the tailer.

I have the same feeling.
pr updated in this direction.

pkg/logs/input/docker/tailer.go

olivielpeau

Left a couple of comments, but overall the changes look good to me if they fix the issue 👍

olivielpeau · 2020-02-13T13:50:14Z

pkg/logs/auditor/auditor.go

 a.registry = a.recoverRegistry()
 a.cleanupRegistry()
 go a.run()
 }

 // Stop stops the Auditor
 func (a *Auditor) Stop() {
+ a.mu.Lock()


I guess there's technically a chance of a deadlock if one of these tickers tick right after Stop here has locked the mutex, since cleanupRegistry and flushRegistry both need to hold the lock. The chances of running into it are likely extremely low, so not sure if it's worth addressing. To avoid this we could have a separate mutex for registry.

olivielpeau · 2020-02-13T14:14:00Z

pkg/logs/input/docker/tailer.go

@@ -185,19 +205,25 @@ func (t *Tailer) readForever() {
 // This error is raised when the agent is stopping
 return
 case err == io.EOF:


in general, does it make sense to have different retry strategies depending on the error (isContextCanceled, io.EOF, and default)? If so, why? (this is an honest question, I don't know these different types of errors enough to answer).

Otherwise I feel all errors that require a retry could use the same overall logic: retry setting up the reader, and if that failed ask the launcher to restart the tailer.

Signed-off-by: cedric lamoriniere <[email protected]>

olivielpeau

LGTM, thanks!

ajacquemot

Looks good, thanks for the fix 👌

* fix docker logs tailer with file rotation * Improve log reader EOF detection, update unit-tests * fix map race access in docker logs launcher * fix map race access in docker logs auditor

* fix docker logs tailer with file rotation * Improve log reader EOF detection, update unit-tests * fix map race access in docker logs launcher * fix map race access in docker logs auditor Signed-off-by: cedric lamoriniere <[email protected]>

fix docker logs tailer with file rotation

536675e

Signed-off-by: cedric lamoriniere <[email protected]>

clamoriniere added kind/bug team/containers component/logs labels Feb 10, 2020

clamoriniere added this to the 7.17.1 milestone Feb 10, 2020

clamoriniere requested review from ajacquemot and a team February 10, 2020 09:34

clamoriniere requested review from a team as code owners February 10, 2020 09:34

clamoriniere modified the milestones: 7.17.1, 7.18.0 Feb 10, 2020

add releasenote

9d05a7d

Signed-off-by: cedric lamoriniere <[email protected]>

vboulineau previously approved these changes Feb 10, 2020

View reviewed changes

olivielpeau reviewed Feb 10, 2020

View reviewed changes

pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved

pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved

releasenotes/notes/logs_fix_docker_tailer_with_file_rotation-536675ed2caa7e6e.yaml Outdated Show resolved Hide resolved

ajacquemot reviewed Feb 10, 2020

View reviewed changes

pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved

pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved

pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved

update comments after review

e5440e0

Signed-off-by: cedric lamoriniere <[email protected]>

clamoriniere dismissed vboulineau’s stale review via e5440e0 February 10, 2020 10:36

olivielpeau previously approved these changes Feb 10, 2020

View reviewed changes

ajacquemot reviewed Feb 10, 2020

View reviewed changes

pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved

ajacquemot previously approved these changes Feb 10, 2020

View reviewed changes

clamoriniere dismissed stale reviews from ajacquemot and olivielpeau via 47e49c3 February 10, 2020 21:34

clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch 4 times, most recently from 74f0c9b to f2b6654 Compare February 10, 2020 22:07

ajacquemot reviewed Feb 11, 2020

View reviewed changes

pkg/logs/auditor/auditor.go Outdated Show resolved Hide resolved

pkg/logs/input/docker/tailer.go Outdated Show resolved Hide resolved

olivielpeau reviewed Feb 11, 2020

View reviewed changes

pkg/logs/auditor/auditor.go Outdated Show resolved Hide resolved

clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch from f2b6654 to c0e94f9 Compare February 11, 2020 17:55

Improve log reader EOF detection, update unit-tests

9d68f6c

Signed-off-by: cedric lamoriniere <[email protected]>

clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch 7 times, most recently from 1a75f1c to 7c33479 Compare February 12, 2020 16:11

clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch from 7c33479 to 499089a Compare February 12, 2020 18:22

improves docker logs status in case of EOF error

fb97bc1

Signed-off-by: cedric lamoriniere <[email protected]>

clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch from 499089a to fb97bc1 Compare February 13, 2020 08:39

ajacquemot reviewed Feb 13, 2020

View reviewed changes

pkg/logs/input/docker/reader.go Outdated Show resolved Hide resolved

ajacquemot reviewed Feb 13, 2020

View reviewed changes

pkg/logs/input/docker/tailer.go Show resolved Hide resolved

olivielpeau reviewed Feb 13, 2020

View reviewed changes

clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch 2 times, most recently from 6705ce4 to 874c508 Compare February 13, 2020 14:49

simplify the docker logs tailer EOF error management

ea5d600

Signed-off-by: cedric lamoriniere <[email protected]>

clamoriniere force-pushed the clamoriniere/fixdockerlogsrotation branch from 874c508 to ea5d600 Compare February 13, 2020 14:56

olivielpeau approved these changes Feb 13, 2020

View reviewed changes

olivielpeau mentioned this pull request Feb 13, 2020

[Docker Agent 1.17] After Containers rotates logs datadog agents stop send logs #4886

Closed

ajacquemot approved these changes Feb 13, 2020

View reviewed changes

clamoriniere merged commit 61a866b into master Feb 13, 2020

clamoriniere deleted the clamoriniere/fixdockerlogsrotation branch February 13, 2020 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix docker logs tailer with file rotation #4860

fix docker logs tailer with file rotation #4860

clamoriniere commented Feb 10, 2020

olivielpeau left a comment

ajacquemot left a comment

olivielpeau left a comment

ajacquemot left a comment

ajacquemot left a comment

ajacquemot Feb 13, 2020

ajacquemot Feb 13, 2020

olivielpeau Feb 13, 2020

clamoriniere Feb 13, 2020

clamoriniere Feb 13, 2020

ajacquemot Feb 13, 2020

clamoriniere Feb 13, 2020 •

edited

Loading

olivielpeau Feb 13, 2020

clamoriniere Feb 13, 2020

ajacquemot Feb 13, 2020

olivielpeau left a comment

olivielpeau Feb 13, 2020

olivielpeau Feb 13, 2020

olivielpeau left a comment

ajacquemot left a comment

fix docker logs tailer with file rotation #4860

fix docker logs tailer with file rotation #4860

Conversation

clamoriniere commented Feb 10, 2020

What does this PR do?

Motivation

Additional Notes

olivielpeau left a comment

Choose a reason for hiding this comment

ajacquemot left a comment

Choose a reason for hiding this comment

olivielpeau left a comment

Choose a reason for hiding this comment

ajacquemot left a comment

Choose a reason for hiding this comment

ajacquemot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clamoriniere Feb 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olivielpeau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olivielpeau left a comment

Choose a reason for hiding this comment

ajacquemot left a comment

Choose a reason for hiding this comment

clamoriniere Feb 13, 2020 •

edited

Loading