Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added simple feature to allow for incremental progress if fwp chunks … #93

Merged
merged 2 commits into from
Sep 19, 2022

Conversation

grantbuster
Copy link
Member

…fail

"""This routine runs forward passes on all spatiotemporal chunks for
the given node index"""
for chunk_index in strategy.node_chunks[node_index]:
fwp = cls(strategy, chunk_index, node_index)
fwp.run_chunk()
out_file = strategy.out_files[chunk_index]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thought about adding this this morning! Good call

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add this arg to fwp cli, just in case we want to overwrite easily

Copy link
Member Author

@grantbuster grantbuster Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah thanks, i've been having a lot of random job failures, i think due to too many concurrent reads? but it's hard to tell whats going on. I think generally lustre is struggling with too much parallel io.... This should help but reducing our reliance on cache files will also be good.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've had a few as well, since I moved my env to lustre (hmmm). Basically the only overlapping io is to the conda env right? There's no overlapping cache read/write between fwp calls (except for the time index files).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think in my case i had just had a few random job failures when writing cache files, possibly due to writing too many small cache files to a single OST in parallel. Removing the cache pattern input and clearing the cache dir fixed my jobs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet! whats OST btw?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object storage target (i think). when you stripe a directory it distributes to many OST (up to 30ish on eagle?). single OSTs can get overloaded by parallel io i think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah gotcha

Copy link
Collaborator

@bnb32 bnb32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the option to provide increment=False through the config. All good otherwise!

"""This routine runs forward passes on all spatiotemporal chunks for
the given node index"""
for chunk_index in strategy.node_chunks[node_index]:
fwp = cls(strategy, chunk_index, node_index)
fwp.run_chunk()
out_file = strategy.out_files[chunk_index]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add this arg to fwp cli, just in case we want to overwrite easily

"""This routine runs forward passes on all spatiotemporal chunks for
the given node index"""
for chunk_index in strategy.node_chunks[node_index]:
fwp = cls(strategy, chunk_index, node_index)
fwp.run_chunk()
out_file = strategy.out_files[chunk_index]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've had a few as well, since I moved my env to lustre (hmmm). Basically the only overlapping io is to the conda env right? There's no overlapping cache read/write between fwp calls (except for the time index files).

@grantbuster grantbuster merged commit a65e8e0 into main Sep 19, 2022
@grantbuster grantbuster deleted the gb/fwp_partial_progress branch September 19, 2022 15:14
github-actions bot pushed a commit that referenced this pull request Sep 19, 2022
added simple feature to allow for incremental progress if fwp chunks …
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants