Time and Space Limits for Deep Learning on UoS HPCs #18

TheLostLambda · 2023-02-18T17:46:53Z

Hello! I'm looking to retrain / train a larger model inspired by https://github.com/nadavbra/protein_bert and was wondering:

Where should I store training data? It's >1TB but it seems most storage locations that large on the Sheffield HPCs are temporary storage only?
Could that data be persisted and could I reserve GPU usage for a few weeks to a month? The original model was trained for a month, but this was likely using a lower-powered GPU!

The GitHub for ProteinBERT seems to have some pretty nice instructions for the retraining, I'm just wondering if I'll run into any issues using that much disk space and that much GPU time!

Thanks a ton,
Brooks

willfurnass · 2023-02-20T09:33:27Z

I suggest the /fastdata (Lustre filesystem) areas as they don't have quotas and they're the most performant areas when working with larger files. Files in those areas will be deleted after 60 days - is that too little time?

Can the training process easily be paused/resumed? If so, that gives you more flexibility as to how the workload can be run: it could potentially be split across multiple jobs, possibly using free CPU/GPU cycles (see https://docs.hpc.shef.ac.uk/en/latest/hpc/scheduler/index.html#preemptable-jobs)

If you want to discuss your GPU resource needs in more detail then do get in touch via [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time and Space Limits for Deep Learning on UoS HPCs #18

Time and Space Limits for Deep Learning on UoS HPCs #18

TheLostLambda commented Feb 18, 2023

willfurnass commented Feb 20, 2023

Time and Space Limits for Deep Learning on UoS HPCs #18

Time and Space Limits for Deep Learning on UoS HPCs #18

Comments

TheLostLambda commented Feb 18, 2023

willfurnass commented Feb 20, 2023