Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelize sequential flag with drmaa or multiple cores #133

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Terf
Copy link

@Terf Terf commented Aug 9, 2021

Submitting validator jobs to the cluster (in batches) provides massive speedup, running on multiple cores provides more modest speedup. one drawback is that the progress bar is kinda meaningless since jobs are run asynchronously, and when qsubing jobs a bunch of tempfiles need to be created in a networked-mounted directory (uses bids_dir) so they're visible to the exec nodes, which is a bit hacky.

@Terf
Copy link
Author

Terf commented Aug 9, 2021

also when running on CUBIC you'll see the output

User-specified core binding strategy: linear
User-specified core binding type: set
User-specified core binding amount: 1

for each job submitted which kinda nukes your terminal but Mark said it's just bc he was debugging the scheduler and will take them out

call = build_validator_call(tmpdirname,
nifti_head,
subj_consist)
# TMPDIR isn't networked (available on login + exec nodes), so use bids_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be ok if bids_dir is in datalad?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm good point I hadn't thought about needing to unlock stuff. I admit it's very hacky and made me almost think this isn't a good problem to submit to the grid as it requires so many temporary files that need to be on a network drive (not $TMPDIR), but I'm not sure what'd the best solution would be. Maybe we could use a users home directory, say, ~/.cubids as the tmpdir?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to get a tmpdir on the compute node and copy the files into that?

Copy link
Author

@Terf Terf Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be possible to move more of the logic within the grid job so scripts don't have to be written to a networked drive, but since it's impossible to connect the stdout of the grid job to the main process, the output will ultimately have to get written out to some file which needs to be on a networked drive unless all the jobs, including the main process, are running on the same exec node

cubids/cli.py Outdated Show resolved Hide resolved
if ret.returncode != 0:
logger.error("Errors returned "
"from validator run, parsing now")
"from validator run, parsing now")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may break flake8

Copy link
Author

@Terf Terf Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a particular formatter e.g. black or autopep8 you're using for the project?

jids = []

for batch in build_drmaa_batch(queue):
tmp = tempfile.NamedTemporaryFile(delete=False, dir=opts.bids_dir, prefix=".", suffix=".sh")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something a user can customize? Or will they need to customize it? does this work out of the box on cubic?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what would need to be customized? It indeed works out of the box on CUBIC, LSF also supports DRMAA but PMACS set it up in a weird way and sounded uninteresting in changing that when I asked :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants