Opening an MIB dataset with one file per diffraction pattern is very slow #837

uellue · 2020-06-29T13:50:01Z

Reading a dataset that was saved with one file per pattern seems to read the entire dataset in the detect or initialization routine.

Version: LiberTEM current master

Expected behavior: Is there a more efficient way to load such a dataset? It is clear that performance won't be as good as multiple frames per file, but at least it should be usable if possible.

sk1p · 2020-06-29T15:05:31Z

seems to read the entire dataset

It only reads the headers to use the image sequence numbers, to make absolutely sure we got the file ordering right. But you are right, it is not much different performance-wise from reading all files, as the seek time dominates on a slow HDD (a few ms per file!). Generally, it is a very bad idea to record only one image per file, as it also means more file system overheads (especially under windows), and the HDD struggles even when recording such data sets.

Is there a more efficient way to load such a dataset

We could instead rely on the file names for ordering, which would give a lot better initialization performance in this case. There could be an option pedantic (or similar) which does some extra checks to validate the ordering. Or, the other way around, like the DM data set there could be an option to be more lax in the checks performed.

uellue · 2020-06-29T15:35:47Z

Ok, using the file names for very large datasets sounds like a good idea since we can be reasonably sure that they were generated by the Merlin software or generally follow a pattern that natsort would understand. Maybe one can switch to lax mode only if there are way too many files?

Generally, it is a very bad idea to record only one image per file, as it also means more file system overheads (especially under windows), and the HDD struggles even when recording such data sets.

Yes, fully agreed. Maybe we could show some sort of info or warning when opening such a dataset and/or switching to lax mode? Apparently it does happen that users record data like this, and we can perhaps help to direct users in the right way.

uellue · 2020-06-29T15:37:10Z

Also not sure if we keep this for 0.6 or bump to 0.7 since it doesn't feel like a show stopper.

sk1p · 2020-06-29T18:25:10Z

I would agree that bumping to 0.7 is okay, maybe adding a warning in 0.6 already, at least for use from the Python API. Maybe we can add some info to the documentation about this, in the MIB format docs?

uellue · 2020-06-29T18:26:18Z

maybe adding a warning in 0.6 already, at least for use from the Python API. Maybe we can add some info to the documentation about this, in the MIB format docs?

👍 👍 👍

sk1p · 2020-06-30T10:29:27Z

Warning added in #840, further changes pushed to 0.7

sk1p · 2024-01-23T17:57:54Z

@matbryan52 do you think #1561 closes this issue?

matbryan52 · 2024-01-24T09:30:39Z

@matbryan52 do you think #1561 closes this issue?

I just tried this on a busy Windows machine, with 65k files.

Times:

The file browser in the web interface: 15 seconds
Dataset detect: 12 seconds
Dataset load: 24 seconds

So I'd say slow but not impossible to use.

uellue added the UX/DX label Jun 29, 2020

uellue added this to the 0.6 milestone Jun 29, 2020

sk1p mentioned this issue Jun 30, 2020

Add a warning if a MIB data set is saved in an inefficient way #840

Merged

3 tasks

sk1p modified the milestones: 0.6, 0.7 Jun 30, 2020

sk1p self-assigned this Jun 30, 2020

sk1p added the file formats and I/O label Jun 30, 2020

uellue modified the milestones: 0.7, 0.8 Apr 12, 2021

sk1p modified the milestones: 0.8, backlog Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opening an MIB dataset with one file per diffraction pattern is very slow #837

Opening an MIB dataset with one file per diffraction pattern is very slow #837

uellue commented Jun 29, 2020

sk1p commented Jun 29, 2020

uellue commented Jun 29, 2020

uellue commented Jun 29, 2020

sk1p commented Jun 29, 2020

uellue commented Jun 29, 2020

sk1p commented Jun 30, 2020

sk1p commented Jan 23, 2024

matbryan52 commented Jan 24, 2024

Opening an MIB dataset with one file per diffraction pattern is very slow #837

Opening an MIB dataset with one file per diffraction pattern is very slow #837

Comments

uellue commented Jun 29, 2020

sk1p commented Jun 29, 2020

uellue commented Jun 29, 2020

uellue commented Jun 29, 2020

sk1p commented Jun 29, 2020

uellue commented Jun 29, 2020

sk1p commented Jun 30, 2020

sk1p commented Jan 23, 2024

matbryan52 commented Jan 24, 2024