-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added simple feature to allow for incremental progress if fwp chunks … #93
Conversation
"""This routine runs forward passes on all spatiotemporal chunks for | ||
the given node index""" | ||
for chunk_index in strategy.node_chunks[node_index]: | ||
fwp = cls(strategy, chunk_index, node_index) | ||
fwp.run_chunk() | ||
out_file = strategy.out_files[chunk_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thought about adding this this morning! Good call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add this arg to fwp cli, just in case we want to overwrite easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah thanks, i've been having a lot of random job failures, i think due to too many concurrent reads? but it's hard to tell whats going on. I think generally lustre is struggling with too much parallel io.... This should help but reducing our reliance on cache files will also be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I've had a few as well, since I moved my env to lustre (hmmm). Basically the only overlapping io is to the conda env right? There's no overlapping cache read/write between fwp calls (except for the time index files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think in my case i had just had a few random job failures when writing cache files, possibly due to writing too many small cache files to a single OST in parallel. Removing the cache pattern input and clearing the cache dir fixed my jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet! whats OST btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object storage target (i think). when you stripe a directory it distributes to many OST (up to 30ish on eagle?). single OSTs can get overloaded by parallel io i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah gotcha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the option to provide increment=False through the config. All good otherwise!
"""This routine runs forward passes on all spatiotemporal chunks for | ||
the given node index""" | ||
for chunk_index in strategy.node_chunks[node_index]: | ||
fwp = cls(strategy, chunk_index, node_index) | ||
fwp.run_chunk() | ||
out_file = strategy.out_files[chunk_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add this arg to fwp cli, just in case we want to overwrite easily
"""This routine runs forward passes on all spatiotemporal chunks for | ||
the given node index""" | ||
for chunk_index in strategy.node_chunks[node_index]: | ||
fwp = cls(strategy, chunk_index, node_index) | ||
fwp.run_chunk() | ||
out_file = strategy.out_files[chunk_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I've had a few as well, since I moved my env to lustre (hmmm). Basically the only overlapping io is to the conda env right? There's no overlapping cache read/write between fwp calls (except for the time index files).
added simple feature to allow for incremental progress if fwp chunks …
…fail