Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug with number of evaluation steps #384

Merged
merged 1 commit into from
Jul 30, 2021
Merged

Fix bug with number of evaluation steps #384

merged 1 commit into from
Jul 30, 2021

Conversation

sdtblck
Copy link
Contributor

@sdtblck sdtblck commented Jul 30, 2021

we were running way too many evaluation steps if the model is pipe parallel + has g.a.s on because of this line

            for _ in range(neox_args.gradient_accumulation_steps):

fixing this to 1 if the model is pipe parallel fixes the issue, as .eval_batch() already takes gradient accumulation steps into account.

for _ in range(1 if neox_args.is_pipe_parallel else neox_args.gradient_accumulation_steps):

This also relates to this issue, and is probably the reason we were seeing a stopiteration error

we were running way to many evaluation steps if the model is pipe parallel + has g.a.s on because of this line

```python
            for _ in range(neox_args.gradient_accumulation_steps):
```

- fixing this to 1 if the model is pipe parallel fixes the issue, as .eval_batch() already takes gradient accumulation steps into account.
@sdtblck sdtblck requested a review from a team as a code owner July 30, 2021 11:05
@sweinbach
Copy link
Contributor

Tested in a running training as well...

@sdtblck sdtblck merged commit 54e622b into main Jul 30, 2021
@sdtblck sdtblck deleted the sdtblck-patch-1 branch July 30, 2021 11:12
sdtblck added a commit that referenced this pull request Aug 21, 2021
* optimize data preprocessing

semaphore is a little too small and slows down tokenizing

* Make killall.sh less bruteforce

* [temporary] fix to index errors

* [temporary] fix to index errors

* print sizes of tensors when inspecting checkpoint (#382)

Co-authored-by: Samuel Weinbach <[email protected]>

* Use lru_cache for GPT2Tokenizer.bpe (#383)

GPT2Tokenizer currently uses an unbounded cache, which causes very
high memory usage with tools/preprocess_data.py

* Fix bug with number of evaluation steps (#384)

we were running way to many evaluation steps if the model is pipe parallel + has g.a.s on because of this line

```python
            for _ in range(neox_args.gradient_accumulation_steps):
```

- fixing this to 1 if the model is pipe parallel fixes the issue, as .eval_batch() already takes gradient accumulation steps into account.

* Create CITATION.cff

* Update CITATION.cff

* Update documentation (#392)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* add info about installing fused kernels

* Update README.md

* Update README.md

* sparsity + minor typos

add the instructions to install triton

* change path to ssd-1

* typo

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

Co-authored-by: Shivanshu Purohit <[email protected]>

Co-authored-by: Stella Biderman <[email protected]>
Co-authored-by: Samuel Weinbach <[email protected]>
Co-authored-by: Samuel Weinbach <[email protected]>
Co-authored-by: iczero <[email protected]>
Co-authored-by: Shivanshu Purohit <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants