Fix bug with number of evaluation steps #384

sdtblck · 2021-07-30T11:05:21Z

we were running way too many evaluation steps if the model is pipe parallel + has g.a.s on because of this line

            for _ in range(neox_args.gradient_accumulation_steps):

fixing this to 1 if the model is pipe parallel fixes the issue, as .eval_batch() already takes gradient accumulation steps into account.

for _ in range(1 if neox_args.is_pipe_parallel else neox_args.gradient_accumulation_steps):

This also relates to this issue, and is probably the reason we were seeing a stopiteration error

we were running way to many evaluation steps if the model is pipe parallel + has g.a.s on because of this line ```python for _ in range(neox_args.gradient_accumulation_steps): ``` - fixing this to 1 if the model is pipe parallel fixes the issue, as .eval_batch() already takes gradient accumulation steps into account.

sweinbach · 2021-07-30T11:10:54Z

Tested in a running training as well...

* optimize data preprocessing semaphore is a little too small and slows down tokenizing * Make killall.sh less bruteforce * [temporary] fix to index errors * [temporary] fix to index errors * print sizes of tensors when inspecting checkpoint (#382) Co-authored-by: Samuel Weinbach <[email protected]> * Use lru_cache for GPT2Tokenizer.bpe (#383) GPT2Tokenizer currently uses an unbounded cache, which causes very high memory usage with tools/preprocess_data.py * Fix bug with number of evaluation steps (#384) we were running way to many evaluation steps if the model is pipe parallel + has g.a.s on because of this line ```python for _ in range(neox_args.gradient_accumulation_steps): ``` - fixing this to 1 if the model is pipe parallel fixes the issue, as .eval_batch() already takes gradient accumulation steps into account. * Create CITATION.cff * Update CITATION.cff * Update documentation (#392) * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * add info about installing fused kernels * Update README.md * Update README.md * sparsity + minor typos add the instructions to install triton * change path to ssd-1 * typo * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md Co-authored-by: Shivanshu Purohit <[email protected]> Co-authored-by: Stella Biderman <[email protected]> Co-authored-by: Samuel Weinbach <[email protected]> Co-authored-by: Samuel Weinbach <[email protected]> Co-authored-by: iczero <[email protected]> Co-authored-by: Shivanshu Purohit <[email protected]>

sdtblck requested a review from a team as a code owner July 30, 2021 11:05

sdtblck requested review from lucidrains and ConnorJL July 30, 2021 11:05

sweinbach approved these changes Jul 30, 2021

View reviewed changes

sdtblck merged commit 54e622b into main Jul 30, 2021

sdtblck deleted the sdtblck-patch-1 branch July 30, 2021 11:12

sdtblck mentioned this pull request Jul 30, 2021

Index Errors with Weighted Dataset #377

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug with number of evaluation steps #384

Fix bug with number of evaluation steps #384

sdtblck commented Jul 30, 2021 •

edited

Loading

sweinbach commented Jul 30, 2021

Fix bug with number of evaluation steps #384

Fix bug with number of evaluation steps #384

Conversation

sdtblck commented Jul 30, 2021 • edited Loading

sweinbach commented Jul 30, 2021

sdtblck commented Jul 30, 2021 •

edited

Loading