-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bnb/dev #117
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to chat more about the OOM tensor issue...
Also, unrelated, can you change this log message to INFO? It's the most useful monitoring message IMO and should be always present in the logs: https://github.com/NREL/sup3r/blob/main/sup3r/pipeline/forward_pass.py#L1829
sup3r/pipeline/forward_pass.py
Outdated
' result in constant model output.') | ||
if self.output_chunk_mem / 1e9 > 1.5: | ||
logger.warning(msg) | ||
warnings.warn(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have mixed feelings about this, do we know what's actually causing this? Is this a documented issue? I can't find anything online.
Most importantly, the size of the output chunk at 1.5 GB is likely not the cause of this problem. It's more likely a memory overflow mid-network where the tensor is much larger and totally dependent on model architecture. If we can confirm that tensorflow basically does a silent OOM error, we should just include a psutil check in the model.generate() method between layers.
OR maybe the best option would be to check the output values for uniformity and if so, raise a hard error. This would be the most flexible and accurate way of doing things. You'd get a hard stop and you'd know with certainty that the output is garbage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been able to find anything either. I guess you're right that its probably not dependent on output size but some intermediate size. I'd prefer we do a check before rather than waiting to see if the output is bad. It's very clear that it's related to memory though since just increasing the padding above a threshold gives constants all of a sudden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you do the check before the first forward pass you're assuming A) this threshold you've set is going to be applicable to any model architecture (bad assumption) and B) the user is going to be parsing the log files for warning statements (unlikely if you're running over night).
If you raise an exception in response to a bad chunk output you will 1) save a lot of compute time because the job won't continue to run, 2) save the human time of having to monitor a huge log file for really bad messages, and 3) have a check that works for any model architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I definitely agree on the first part. Im hoping there is a model agnostic pre check we could do. If we do a check on output then the only compute were avoiding is the h5 write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a misunderstanding - we should check every chunk output, so the exception would be raise after the first forward pass chunk is complete and all other chunks would not be run. So you're wasting one chunk forward pass but you're not running many subsequent chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this makes sense. I was thinking that not every chunk was guaranteed to fail but nearly all will, except possibly the ones on the edges.
@@ -389,7 +389,7 @@ def interpolate_data(self, feature, low_res): | |||
self.save_cache(var_itp, file_name) | |||
return var_itp | |||
|
|||
def get_stats(self, var, interp=False): | |||
def get_stats(self, var, interp=False, period=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add period to docstring?
return self._failed_chunks | ||
|
||
@failed_chunks.setter | ||
def failed_chunks(self, failed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This property+setter is functionally identical to just having a public self.failed_chunks attribute without the methods, any reason for doing the property+setter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha just figured we should make it clear there is this dedicated thing to check for failures rather than adding to the big list of attrs in init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, sounds good!
…eturning constant output with relu). Does this tell us something?
added checks for constant output from forward passes. This appears to be a quiet memory error from tensorflow.
Added check for forward pass output chunk size. I used a padding that put the output chunks above 1.5 GB and didnt find out the output was constant until after collection! C'mon tensorflow.