Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Log] make glog flush and RAY_LOG thread-safe #11002

Merged
merged 14 commits into from
Sep 28, 2020

Conversation

ashione
Copy link
Member

@ashione ashione commented Sep 24, 2020

Current RAY_LOG will make duplicated output in log files.
For example, all logs of RAY_LOG(ERROR) will be flushed three times to destination file since we use a same sink file for INFO, WARN and ERROR level.

More important thing is this implementation is not thread-safe, so we'd better use glog default logobject.

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ashione
Copy link
Member Author

ashione commented Sep 24, 2020

It's runnning as same as previous version in my laptop. @mehrdadn Could you double check the log destination file and directory? We'd better make the proper log files end up in the proper folder. (also works well Windows)

Copy link
Member

@kfstorm kfstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, GLOG has already done locking here, so we don't need to worry about thread-safety in custom loggers.

src/ray/util/logging.cc Show resolved Hide resolved
@rkooo567 rkooo567 self-assigned this Sep 24, 2020
@ashione
Copy link
Member Author

ashione commented Sep 24, 2020

AFAIK, GLOG has already done locking here, so we don't need to worry about thread-safety in custom loggers.

Glog guarantees base logger thread safe by holding muted lock inside. But we cannot assert it works well definitely. For example, other thread is logging somethings to destination file when stream logger singleton has been closed in ShutdownRayLog before invoking glog shutdown.

@kfstorm
Copy link
Member

kfstorm commented Sep 24, 2020 via email

@rkooo567 rkooo567 changed the title make glog flush and RAY_LOG thread-safe [Log] make glog flush and RAY_LOG thread-safe Sep 24, 2020
@ashione
Copy link
Member Author

ashione commented Sep 25, 2020

@kfstorm @rkooo567 @mehrdadn Any comments about this PR? Streaming ci failure in travis will be unrelated with this.

Copy link
Member

@kfstorm kfstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wait for @mehrdadn's comments since you are trying to remove StreamLogger.

thirdparty/patches/glog-suffix-log.patch Show resolved Hide resolved
@kfstorm
Copy link
Member

kfstorm commented Sep 25, 2020

@mehrdadn Could you explain the motivation of replacing the default logger with StreamLogger?

@mehrdadn
Copy link
Contributor

@kfstorm It was just to combine messages of different severities into the same log files (#9230), because my previous changes had resulted in them going into separate log files. The previous changes were trying to make sure that the log files were in the correct log directory (because in some cases logs used to be created in the wrong folder).

Copy link
Contributor

@mehrdadn mehrdadn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashione To be honest glog was quite a headache for me to figure out and I only touched it incidentally (and at this point I've forgotten some details already). Also, I've actually stopped working on Ray. You might want to reach out to others for reviewing haha.

@ashione
Copy link
Member Author

ashione commented Sep 25, 2020

@ashione To be honest glog was quite a headache for me to figure out and I only touched it incidentally (and at this point I've forgotten some details already). Also, I've actually stopped working on Ray. You might want to reach out to others for reviewing haha.

Sorry to hear that. But thanks a lot for your response.

@ashione
Copy link
Member Author

ashione commented Sep 25, 2020

I find these comments in glog/logging.cc

 // If no base filename for logs of this severity has been set, use a
      // default base filename of
      // "<program name>.<hostname>.<user name>.log.<severity level>.".  So
      // logfiles will have names like
      // webserver.examplehost.root.log.INFO.19990817-150000.4354, where
      // 19990817 is a date (1999 August 17), 150000 is a time (15:00:00),
      // and 4354 is the pid of the logging process.  The date & time reflect
      // when the file was created for output.
      //
      // Where does the file get put?  Successively try the directories
      // "/tmp", and "."

So raylet's log will be printed in /tmp with ...log.. format.
@kfstorm

@kfstorm
Copy link
Member

kfstorm commented Sep 25, 2020

@ashione I see. So to fix the wrong log folder issue, can we just apply a patch to the GLOG source code to disable the "default base filename" feature? Then we can revert logging.cc to the version several months ago.

@kfstorm
Copy link
Member

kfstorm commented Sep 25, 2020

I must say that the logging to "/tmp" by default behavior is a really bad design. And it even can't be turned off. 🤷‍♂️

@kfstorm
Copy link
Member

kfstorm commented Sep 25, 2020

I find these comments in glog/logging.cc

 // If no base filename for logs of this severity has been set, use a
      // default base filename of
      // "<program name>.<hostname>.<user name>.log.<severity level>.".  So
      // logfiles will have names like
      // webserver.examplehost.root.log.INFO.19990817-150000.4354, where
      // 19990817 is a date (1999 August 17), 150000 is a time (15:00:00),
      // and 4354 is the pid of the logging process.  The date & time reflect
      // when the file was created for output.
      //
      // Where does the file get put?  Successively try the directories
      // "/tmp", and "."

So raylet's log will be printed in /tmp with ...log.. format.
@kfstorm

@mehrdadn is this "in some cases logs used to be created in the wrong folder" you met earlier?

@mehrdadn
Copy link
Contributor

@kfstorm Yeah I meant that's what it was like prior to PR #9025, where I first tried to fix it.

@ashione
Copy link
Member Author

ashione commented Sep 25, 2020

I find these comments in glog/logging.cc

 // If no base filename for logs of this severity has been set, use a
      // default base filename of
      // "<program name>.<hostname>.<user name>.log.<severity level>.".  So
      // logfiles will have names like
      // webserver.examplehost.root.log.INFO.19990817-150000.4354, where
      // 19990817 is a date (1999 August 17), 150000 is a time (15:00:00),
      // and 4354 is the pid of the logging process.  The date & time reflect
      // when the file was created for output.
      //
      // Where does the file get put?  Successively try the directories
      // "/tmp", and "."

So raylet's log will be printed in /tmp with ...log.. format.
@kfstorm

@mehrdadn is this "in some cases logs used to be created in the wrong folder" you met earlier?

It's only extra log file on linux machine. Finally I fix it by mapping all severities to destination.

@mehrdadn
Copy link
Contributor

By the way, not sure if this is relevant to what you're doing, but here's a repro I had for an issue where different severities would end up in different log files (and it explains why I used StreamLogger)... might be helpful to read: #9063 (comment)

@ashione
Copy link
Member Author

ashione commented Sep 25, 2020

By the way, not sure if this is relevant to what you're doing, but here's a repro I had for an issue where different severities would end up in different log files (and it explains why I used StreamLogger)... might be helpful to read: #9063 (comment)

Just set up all other levels of logging to empty destination. So there should be no such log files left by now.

@ashione
Copy link
Member Author

ashione commented Sep 25, 2020

But none of logging items can be redirected to raylet.out. Glog will print all of logs in stderr.

@ashione
Copy link
Member Author

ashione commented Sep 27, 2020

Also, I tested locally, and there are several issues.

Seems like raylet logs are not properly flushed.

Seems like there's weird file name for core worker (pid 010000).

I can't open the second snapshot figure. Could you upload it again?

@ashione
Copy link
Member Author

ashione commented Sep 27, 2020

@ashione I doubt the weird pid is unrelated to this PR because I cannot find it when I run the same code in other branch. Can you also try

import ray
ray.init()
import time
time.sleep(100)

for this branch & nightly, and see if both do or don't have that weird pid file?

Perphas it does not stand for PID. I find it in code of core_worker.cc

app_name << "-" << global_worker_id_;

.
So that's 4 bytes jobid.

@ashione
Copy link
Member Author

ashione commented Sep 27, 2020

@rkooo567 A new pytest case for checking raylet.out/raylet.err has been added.

src/ray/util/logging.cc Outdated Show resolved Hide resolved
Copy link
Member

@kfstorm kfstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better check for worker logs as well.

python/ray/tests/test_raylet_output.py Outdated Show resolved Hide resolved
python/ray/tests/test_raylet_output.py Outdated Show resolved Hide resolved
python/ray/tests/test_raylet_output.py Outdated Show resolved Hide resolved
Comment on lines +46 to +48
# Make subprocess happy in bazel.
os.environ["LC_ALL"] = "en_US.UTF-8"
os.environ["LANG"] = "en_US.UTF-8"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need these? I can't find them in

if __name__ == "__main__":
import pytest
sys.exit(pytest.main(["-v", __file__]))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is needed for ray_start_regular (because of some click dependency issues)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just paste them from test_ouput.py.

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 27, 2020

Seems like there are memory corruption (look at asan test) and lint failure!

Once you resolved these, I will test locally again and approve!

@ashione
Copy link
Member Author

ashione commented Sep 28, 2020

Seems like there are memory corruption (look at asan test) and lint failure!

Once you resolved these, I will test locally again and approve!

It's fixed in latest commit.

src/ray/util/logging.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me test locally one last time before we merge.

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 28, 2020

Btw, this is a really nice fix :). I always wondered why we have multiple Error logs haha

@ashione
Copy link
Member Author

ashione commented Sep 28, 2020

Btw, this is a really nice fix :). I always wondered why we have multiple Error logs haha

Because INFO and more high level logs will be written to info-log-file, and WARN and its more higher level also could be printed in warn-log-file, warn-logs have been recorded twice when info-log-file and warn-log-file are same destination.

@rkooo567
Copy link
Contributor

Okay. Confirm it looks great :)

@rkooo567
Copy link
Contributor

Please add the test_ok tag when it is ready to be merged.

@ashione ashione added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 28, 2020
@ashione
Copy link
Member Author

ashione commented Sep 28, 2020

I'll check travis failures.

@ashione
Copy link
Member Author

ashione commented Sep 28, 2020

@kfstorm @rkooo567 Could you merge it?

@kfstorm kfstorm merged commit 27e1f51 into ray-project:master Sep 28, 2020
@kfstorm kfstorm deleted the fix-unsafe-glog branch September 29, 2020 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants