Tags: bytedance/byteps
Tags
launcher: join workers as they exit (#429) check worker exit status in the order they exit. This way failed workers can be discovered early, and the entire job terminated as soon as possible. Signed-off-by: yulu.jia <[email protected]>
ps-lite: disable ucx error handling by default (#424) disable ucx signal handlers so that some faulty user code can still run even if some child process of the program encounters a segfault. Signed-off-by: Yulu Jia <[email protected]>
ps-lite: update ps-lite (#423) update ps-lite to the latest commit Signed-off-by: Yulu Jia <[email protected]>
tensorflow: fix bug in broadcast_variables (#416) When there's only one rank in total, broadcast_variables should still return a tf operation. Signed-off-by: Yulu Jia <[email protected]>
server: improve thread safety (#412) protect update_buf_ with a lock.
update doc for core affinity envs (#407) change semicolon-separated to colon-separated Signed-off-by: Yulu Jia <[email protected]>
fix bool env, disable avx512 (#399) - fix bool env parsing in server.cc - disable avx512 when compiling. enabling avx512 may cause tensorflow extension build failure. avx512 support in Eigen is likely not stable yet. Signed-off-by: yulu.jia <[email protected]>
tf: skip bcast if there's only one worker (#385) Skip broadcasting variables if there's only one worker Signed-off-by: Yulu Jia <[email protected]>
torch: fix hang after int tensor push_pull (#358) mark task done after averaging an int tensor. This fixes a bug introduced in 46944e8. Signed-off-by: Yulu Jia <[email protected]>
build: skip installing disabled extensions (#354) If an extension is explictly disabled by: export BYTEPS_WITHOUT_MXNET=1 export BYTEPS_WITHOUT_PYTORCH=1 export BYTEPS_WITHOUT_TENSORFLOW=1 do not try to install it. Signed-off-by: Yulu Jia <[email protected]>
PreviousNext