Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening an MIB dataset with one file per diffraction pattern is very slow #837

Open
uellue opened this issue Jun 29, 2020 · 8 comments
Open

Comments

@uellue
Copy link
Member

uellue commented Jun 29, 2020

Reading a dataset that was saved with one file per pattern seems to read the entire dataset in the detect or initialization routine.

Version: LiberTEM current master

Expected behavior: Is there a more efficient way to load such a dataset? It is clear that performance won't be as good as multiple frames per file, but at least it should be usable if possible.

@uellue uellue added the UX/DX label Jun 29, 2020
@uellue uellue added this to the 0.6 milestone Jun 29, 2020
@sk1p
Copy link
Member

sk1p commented Jun 29, 2020

seems to read the entire dataset

It only reads the headers to use the image sequence numbers, to make absolutely sure we got the file ordering right. But you are right, it is not much different performance-wise from reading all files, as the seek time dominates on a slow HDD (a few ms per file!). Generally, it is a very bad idea to record only one image per file, as it also means more file system overheads (especially under windows), and the HDD struggles even when recording such data sets.

Is there a more efficient way to load such a dataset

We could instead rely on the file names for ordering, which would give a lot better initialization performance in this case. There could be an option pedantic (or similar) which does some extra checks to validate the ordering. Or, the other way around, like the DM data set there could be an option to be more lax in the checks performed.

@uellue
Copy link
Member Author

uellue commented Jun 29, 2020

Ok, using the file names for very large datasets sounds like a good idea since we can be reasonably sure that they were generated by the Merlin software or generally follow a pattern that natsort would understand. Maybe one can switch to lax mode only if there are way too many files?

Generally, it is a very bad idea to record only one image per file, as it also means more file system overheads (especially under windows), and the HDD struggles even when recording such data sets.

Yes, fully agreed. Maybe we could show some sort of info or warning when opening such a dataset and/or switching to lax mode? Apparently it does happen that users record data like this, and we can perhaps help to direct users in the right way.

@uellue
Copy link
Member Author

uellue commented Jun 29, 2020

Also not sure if we keep this for 0.6 or bump to 0.7 since it doesn't feel like a show stopper.

@sk1p
Copy link
Member

sk1p commented Jun 29, 2020

I would agree that bumping to 0.7 is okay, maybe adding a warning in 0.6 already, at least for use from the Python API. Maybe we can add some info to the documentation about this, in the MIB format docs?

@uellue
Copy link
Member Author

uellue commented Jun 29, 2020

maybe adding a warning in 0.6 already, at least for use from the Python API. Maybe we can add some info to the documentation about this, in the MIB format docs?

👍 👍 👍

@sk1p
Copy link
Member

sk1p commented Jun 30, 2020

Warning added in #840, further changes pushed to 0.7

@sk1p sk1p self-assigned this Jun 30, 2020
@uellue uellue modified the milestones: 0.7, 0.8 Apr 12, 2021
@sk1p sk1p modified the milestones: 0.8, backlog Aug 24, 2021
@sk1p
Copy link
Member

sk1p commented Jan 23, 2024

@matbryan52 do you think #1561 closes this issue?

@matbryan52
Copy link
Member

@matbryan52 do you think #1561 closes this issue?

I just tried this on a busy Windows machine, with 65k files.

Times:

  • The file browser in the web interface: 15 seconds
  • Dataset detect: 12 seconds
  • Dataset load: 24 seconds

So I'd say slow but not impossible to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants