-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix docker logs tailer with file rotation #4860
Conversation
Signed-off-by: cedric lamoriniere <[email protected]>
Signed-off-by: cedric lamoriniere <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix makes sense, thanks!
The fix looks relatively harmless (it basically makes the tailer handle io.EOF
like other "unknown" errors), although it's not entirely clear to me if/how it changes the behavior for short-lived containers introduced in #4085), especially since there are no tests on #4085. An e2e test was worked on in #3981 but was never merged. So let's make sure this doesn't introduce a regression there, cc @ajacquemot if you could help to determine there's no regression.
Could you describe how you're testing this change as well? An automated test would be ideal (but can be planned later in separate PR)
releasenotes/notes/logs_fix_docker_tailer_with_file_rotation-536675ed2caa7e6e.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix looks good to me 💪
Please keep in mind that this case is triggered:
- on file rotation
- on container stop
- on no log was outputted
So we need to make sure that we test throughly those 3 scenarios during the QA to ensure no regression.
Also we need to take extra care of the Since
because it can lead to data loss or duplicate:
- on file rotation we should start from beginning
- on container stop we should start from last committed offset or the end if no offset
- on no output we should start from beginning
Finally we could probably write this scenario with unit-tests but it would mean to refactor the code, I started to think about it, we could probably pass a provider to construct the reader and mock the different behaviours, up to you to decide :-)
Signed-off-by: cedric lamoriniere <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Again, let's make sure this doesn't introduce a regression for short-lived containers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
47e49c3
74f0c9b
to
f2b6654
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the unit-tests, that's super nice, left two comments, we can sit down together if needed.
f2b6654
to
c0e94f9
Compare
Signed-off-by: cedric lamoriniere <[email protected]>
1a75f1c
to
7c33479
Compare
* fix map race access in docker logs launcher * fix map race access in docker logs auditor * implement backoff in `safereader` * implement restartReader() docker log tailer, add restartReader() with backoff policy Signed-off-by: cedric lamoriniere <[email protected]>
7c33479
to
499089a
Compare
Signed-off-by: cedric lamoriniere <[email protected]>
499089a
to
fb97bc1
Compare
pkg/logs/auditor/auditor.go
Outdated
a.registry = a.recoverRegistry() | ||
a.cleanupRegistry() | ||
go a.run() | ||
} | ||
|
||
// Stop stops the Auditor | ||
func (a *Auditor) Stop() { | ||
a.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/DataDog/datadog-agent/pull/4860/files#diff-5142632c5dd9990aa6feb0918f9334f7L85 is in the critical path but I think it should not because any method for the run method acquiring the mutex can be called concurrently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably create a setup and reset methods which are protected by the mutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there's technically a chance of a deadlock if one of these tickers tick right after Stop
here has locked the mutex, since cleanupRegistry
and flushRegistry
both need to hold the lock. The chances of running into it are likely extremely low, so not sure if it's worth addressing. To avoid this we could have a separate mutex for registry
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking to move it in the New() function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. I added the new mutex.
@@ -185,19 +205,25 @@ func (t *Tailer) readForever() { | |||
// This error is raised when the agent is stopping | |||
return | |||
case err == io.EOF: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I would have reused the same logic as in case isContextCanceled(err):
and reused the method wait()
to avoid hammering the docker socket and avoid the CPU to go high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we try only once to setup the reader then ask the launcher to restart the tailer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, does it make sense to have different retry strategies depending on the error (isContextCanceled
, io.EOF
, and default
)? If so, why? (this is an honest question, I don't know these different types of errors enough to answer).
Otherwise I feel all errors that require a retry could use the same overall logic: retry setting up the reader, and if that failed ask the launcher to restart the tailer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same feeling.
pr updated in this direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments, but overall the changes look good to me if they fix the issue 👍
pkg/logs/auditor/auditor.go
Outdated
a.registry = a.recoverRegistry() | ||
a.cleanupRegistry() | ||
go a.run() | ||
} | ||
|
||
// Stop stops the Auditor | ||
func (a *Auditor) Stop() { | ||
a.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there's technically a chance of a deadlock if one of these tickers tick right after Stop
here has locked the mutex, since cleanupRegistry
and flushRegistry
both need to hold the lock. The chances of running into it are likely extremely low, so not sure if it's worth addressing. To avoid this we could have a separate mutex for registry
.
@@ -185,19 +205,25 @@ func (t *Tailer) readForever() { | |||
// This error is raised when the agent is stopping | |||
return | |||
case err == io.EOF: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, does it make sense to have different retry strategies depending on the error (isContextCanceled
, io.EOF
, and default
)? If so, why? (this is an honest question, I don't know these different types of errors enough to answer).
Otherwise I feel all errors that require a retry could use the same overall logic: retry setting up the reader, and if that failed ask the launcher to restart the tailer.
6705ce4
to
874c508
Compare
Signed-off-by: cedric lamoriniere <[email protected]>
874c508
to
ea5d600
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the fix 👌
* fix docker logs tailer with file rotation * Improve log reader EOF detection, update unit-tests * fix map race access in docker logs launcher * fix map race access in docker logs auditor
* fix docker logs tailer with file rotation * Improve log reader EOF detection, update unit-tests * fix map race access in docker logs launcher * fix map race access in docker logs auditor Signed-off-by: cedric lamoriniere <[email protected]>
What does this PR do?
This PR changes the way the docker logs tailer handle the
io.EOF
error. Previously this error didn't trigger any recovery logic since it was considered that it can append only when the container was stopped.With this change, the tailer considers the
io.EOF
as a "normal" error. it stops tailing the file and returns an error that will ask thedocker.Luncher
to recreate a NewTailer.Motivation
Avoid to stop tailing docker file in some specifics configuration when the logs file is rotated.
Additional Notes
Anything else we should know when reviewing?