Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bnb/dev #117

Merged
merged 8 commits into from
Nov 19, 2022
Merged

Bnb/dev #117

merged 8 commits into from
Nov 19, 2022

Conversation

bnb32
Copy link
Collaborator

@bnb32 bnb32 commented Nov 18, 2022

Added check for forward pass output chunk size. I used a padding that put the output chunks above 1.5 GB and didnt find out the output was constant until after collection! C'mon tensorflow.

Copy link
Member

@grantbuster grantbuster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to chat more about the OOM tensor issue...

Also, unrelated, can you change this log message to INFO? It's the most useful monitoring message IMO and should be always present in the logs: https://github.com/NREL/sup3r/blob/main/sup3r/pipeline/forward_pass.py#L1829

' result in constant model output.')
if self.output_chunk_mem / 1e9 > 1.5:
logger.warning(msg)
warnings.warn(msg)
Copy link
Member

@grantbuster grantbuster Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings about this, do we know what's actually causing this? Is this a documented issue? I can't find anything online.

Most importantly, the size of the output chunk at 1.5 GB is likely not the cause of this problem. It's more likely a memory overflow mid-network where the tensor is much larger and totally dependent on model architecture. If we can confirm that tensorflow basically does a silent OOM error, we should just include a psutil check in the model.generate() method between layers.

OR maybe the best option would be to check the output values for uniformity and if so, raise a hard error. This would be the most flexible and accurate way of doing things. You'd get a hard stop and you'd know with certainty that the output is garbage.

Copy link
Collaborator Author

@bnb32 bnb32 Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to find anything either. I guess you're right that its probably not dependent on output size but some intermediate size. I'd prefer we do a check before rather than waiting to see if the output is bad. It's very clear that it's related to memory though since just increasing the padding above a threshold gives constants all of a sudden.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you do the check before the first forward pass you're assuming A) this threshold you've set is going to be applicable to any model architecture (bad assumption) and B) the user is going to be parsing the log files for warning statements (unlikely if you're running over night).

If you raise an exception in response to a bad chunk output you will 1) save a lot of compute time because the job won't continue to run, 2) save the human time of having to monitor a huge log file for really bad messages, and 3) have a check that works for any model architecture.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I definitely agree on the first part. Im hoping there is a model agnostic pre check we could do. If we do a check on output then the only compute were avoiding is the h5 write.

Copy link
Member

@grantbuster grantbuster Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a misunderstanding - we should check every chunk output, so the exception would be raise after the first forward pass chunk is complete and all other chunks would not be run. So you're wasting one chunk forward pass but you're not running many subsequent chunks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this makes sense. I was thinking that not every chunk was guaranteed to fail but nearly all will, except possibly the ones on the edges.

@@ -389,7 +389,7 @@ def interpolate_data(self, feature, low_res):
self.save_cache(var_itp, file_name)
return var_itp

def get_stats(self, var, interp=False):
def get_stats(self, var, interp=False, period=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add period to docstring?

return self._failed_chunks

@failed_chunks.setter
def failed_chunks(self, failed):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property+setter is functionally identical to just having a public self.failed_chunks attribute without the methods, any reason for doing the property+setter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha just figured we should make it clear there is this dedicated thing to check for failures rather than adding to the big list of attrs in init.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, sounds good!

…eturning constant output with relu). Does this tell us something?
@bnb32 bnb32 merged commit 856f95a into main Nov 19, 2022
@bnb32 bnb32 deleted the bnb/dev branch November 19, 2022 16:19
github-actions bot pushed a commit that referenced this pull request Nov 19, 2022
added checks for constant output from forward passes. This appears to be a quiet memory error from tensorflow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants