-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opening an MIB dataset with one file per diffraction pattern is very slow #837
Comments
It only reads the headers to use the image sequence numbers, to make absolutely sure we got the file ordering right. But you are right, it is not much different performance-wise from reading all files, as the seek time dominates on a slow HDD (a few ms per file!). Generally, it is a very bad idea to record only one image per file, as it also means more file system overheads (especially under windows), and the HDD struggles even when recording such data sets.
We could instead rely on the file names for ordering, which would give a lot better initialization performance in this case. There could be an option |
Ok, using the file names for very large datasets sounds like a good idea since we can be reasonably sure that they were generated by the Merlin software or generally follow a pattern that natsort would understand. Maybe one can switch to lax mode only if there are way too many files?
Yes, fully agreed. Maybe we could show some sort of info or warning when opening such a dataset and/or switching to lax mode? Apparently it does happen that users record data like this, and we can perhaps help to direct users in the right way. |
Also not sure if we keep this for 0.6 or bump to 0.7 since it doesn't feel like a show stopper. |
I would agree that bumping to 0.7 is okay, maybe adding a warning in 0.6 already, at least for use from the Python API. Maybe we can add some info to the documentation about this, in the MIB format docs? |
👍 👍 👍 |
Warning added in #840, further changes pushed to 0.7 |
@matbryan52 do you think #1561 closes this issue? |
I just tried this on a busy Windows machine, with 65k files. Times:
So I'd say slow but not impossible to use. |
Reading a dataset that was saved with one file per pattern seems to read the entire dataset in the detect or initialization routine.
Version: LiberTEM current master
Expected behavior: Is there a more efficient way to load such a dataset? It is clear that performance won't be as good as multiple frames per file, but at least it should be usable if possible.
The text was updated successfully, but these errors were encountered: