Bnb/dev #117

bnb32 · 2022-11-18T14:18:11Z

Added check for forward pass output chunk size. I used a padding that put the output chunks above 1.5 GB and didnt find out the output was constant until after collection! C'mon tensorflow.

grantbuster

I want to chat more about the OOM tensor issue...

Also, unrelated, can you change this log message to INFO? It's the most useful monitoring message IMO and should be always present in the logs: https://github.com/NREL/sup3r/blob/main/sup3r/pipeline/forward_pass.py#L1829

grantbuster · 2022-11-18T17:03:31Z

sup3r/pipeline/forward_pass.py

+ ' result in constant model output.')
+ if self.output_chunk_mem / 1e9 > 1.5:
+ logger.warning(msg)
+ warnings.warn(msg)


I have mixed feelings about this, do we know what's actually causing this? Is this a documented issue? I can't find anything online.

Most importantly, the size of the output chunk at 1.5 GB is likely not the cause of this problem. It's more likely a memory overflow mid-network where the tensor is much larger and totally dependent on model architecture. If we can confirm that tensorflow basically does a silent OOM error, we should just include a psutil check in the model.generate() method between layers.

OR maybe the best option would be to check the output values for uniformity and if so, raise a hard error. This would be the most flexible and accurate way of doing things. You'd get a hard stop and you'd know with certainty that the output is garbage.

I haven't been able to find anything either. I guess you're right that its probably not dependent on output size but some intermediate size. I'd prefer we do a check before rather than waiting to see if the output is bad. It's very clear that it's related to memory though since just increasing the padding above a threshold gives constants all of a sudden.

But if you do the check before the first forward pass you're assuming A) this threshold you've set is going to be applicable to any model architecture (bad assumption) and B) the user is going to be parsing the log files for warning statements (unlikely if you're running over night).

If you raise an exception in response to a bad chunk output you will 1) save a lot of compute time because the job won't continue to run, 2) save the human time of having to monitor a huge log file for really bad messages, and 3) have a check that works for any model architecture.

Yeah I definitely agree on the first part. Im hoping there is a model agnostic pre check we could do. If we do a check on output then the only compute were avoiding is the h5 write.

I think there's a misunderstanding - we should check every chunk output, so the exception would be raise after the first forward pass chunk is complete and all other chunks would not be run. So you're wasting one chunk forward pass but you're not running many subsequent chunks.

Yeah this makes sense. I was thinking that not every chunk was guaranteed to fail but nearly all will, except possibly the ones on the edges.

grantbuster · 2022-11-18T17:04:22Z

sup3r/qa/stats.py

@@ -389,7 +389,7 @@ def interpolate_data(self, feature, low_res):
 self.save_cache(var_itp, file_name)
 return var_itp

- def get_stats(self, var, interp=False):
+ def get_stats(self, var, interp=False, period=None):


add period to docstring?

grantbuster · 2022-11-18T18:49:23Z

sup3r/pipeline/forward_pass.py

+ return self._failed_chunks
+
+ @failed_chunks.setter
+ def failed_chunks(self, failed):


This property+setter is functionally identical to just having a public self.failed_chunks attribute without the methods, any reason for doing the property+setter?

Haha just figured we should make it clear there is this dedicated thing to check for failures rather than adding to the big list of attrs in init.

Gotcha, sounds good!

…eturning constant output with relu). Does this tell us something?

added checks for constant output from forward passes. This appears to be a quiet memory error from tensorflow.

bnb32 added 4 commits November 17, 2022 09:30

added period arg to stats utils

0f36962

keep range symmetric

5952dfe

output size check for constant output bug

1e61827

linting

578cb7f

grantbuster requested changes Nov 18, 2022

View reviewed changes

bnb32 added 2 commits November 18, 2022 10:16

debug -> log. missed doc string

48fe650

chunk fail checking

13ae5b1

grantbuster reviewed Nov 18, 2022

View reviewed changes

grantbuster approved these changes Nov 18, 2022

View reviewed changes

bnb32 added 2 commits November 18, 2022 17:20

had to remove relu activations from config to get test to pass (was r…

a5b6daf

…eturning constant output with relu). Does this tell us something?

test fix

b133c7f

bnb32 merged commit 856f95a into main Nov 19, 2022

bnb32 deleted the bnb/dev branch November 19, 2022 16:19

github-actions bot pushed a commit that referenced this pull request Nov 19, 2022

Merge pull request #117 from NREL/bnb/dev

4740d5d

added checks for constant output from forward passes. This appears to be a quiet memory error from tensorflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bnb/dev #117

Bnb/dev #117

bnb32 commented Nov 18, 2022

grantbuster left a comment

grantbuster Nov 18, 2022 •

edited

Loading

bnb32 Nov 18, 2022 •

edited

Loading

grantbuster Nov 18, 2022

bnb32 Nov 18, 2022

grantbuster Nov 18, 2022 •

edited

Loading

bnb32 Nov 18, 2022

grantbuster Nov 18, 2022

grantbuster Nov 18, 2022

bnb32 Nov 18, 2022

grantbuster Nov 18, 2022

Bnb/dev #117

Bnb/dev #117

Conversation

bnb32 commented Nov 18, 2022

grantbuster left a comment

Choose a reason for hiding this comment

grantbuster Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

bnb32 Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

grantbuster Nov 18, 2022

Choose a reason for hiding this comment

bnb32 Nov 18, 2022

Choose a reason for hiding this comment

grantbuster Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

bnb32 Nov 18, 2022

Choose a reason for hiding this comment

grantbuster Nov 18, 2022

Choose a reason for hiding this comment

grantbuster Nov 18, 2022

Choose a reason for hiding this comment

bnb32 Nov 18, 2022

Choose a reason for hiding this comment

grantbuster Nov 18, 2022

Choose a reason for hiding this comment

grantbuster Nov 18, 2022 •

edited

Loading

bnb32 Nov 18, 2022 •

edited

Loading

grantbuster Nov 18, 2022 •

edited

Loading