Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more features of the mib reader #219

Open
2 of 4 tasks
ericpre opened this issue Jan 28, 2024 · 13 comments
Open
2 of 4 tasks

Support more features of the mib reader #219

ericpre opened this issue Jan 28, 2024 · 13 comments

Comments

@ericpre
Copy link
Member

ericpre commented Jan 28, 2024

Support for mib files (quantum detector) has been added in #174 and this is a list of features that would be adding:

@magnunor
Copy link
Contributor

I'm very interested all of these, if anyone needs test files we can definitely supply that. Maybe also a pull request if someone has time.

@ericpre
Copy link
Member Author

ericpre commented Jan 29, 2024

As it says the first one should be very easy, the two others would need a volunteer!

@sk1p
Copy link
Contributor

sk1p commented Feb 22, 2024

For the last point, we have (MIT-licensed) decoders for most raw MIB formats, which should be re-usable, for example in the LiberTEM-live project.

That includes decoding files taken with quad arrangements, but not yet arbitrary arrangements (1x4 etc.). What's also missing is support for 24bit raw files, where two consecutive frames are sent for the high and low bits (IIRC).

Feel free to take these, or take inspiration as needed.

@M0hsend
Copy link

M0hsend commented Mar 12, 2024

Many thanks for adding this! @ericpre
Just to note that you seem to have two methods to read mib files now, one file_reader and one load_mib_data. The file_reader is applying a up-down flip on the diffraction patterns here. Would be good to add this to the metadata.
Also just wondering about the plans for adding the cross gap for quad configurations. Will this be re-added to pyxem?

@CSSFrancis
Copy link
Member

Many thanks for adding this! @ericpre Just to note that you seem to have two methods to read mib files now, one file_reader and one load_mib_data. The file_reader is applying a up-down flip on the diffraction patterns here. Would be good to add this to the metadata. Also just wondering about the plans for adding the cross gap for quad configurations. Will this be re-added to pyxem?

@M0hsend We can definitely add this back to pyxem! I just need to understand what a good workflow would be and what information is/isn't available. I don't have/use a Merlin detector so supporting some of these features is a bit difficult without a bit more direction :).

Is the cross gap information stored in the metadata? Is the cross gap information constant or could you arrange the detectors anyway you want? I think we would want something like an map_to_detector method for pyxem which would add pixels with np.nan values at a certain fraction of the dataset where the cross is. We could think about creating a Dectector class which defines a layout for a certain detector and then maps certain pixels from the original to the transformed dataset. That gives us much more freedom to start thinking about how to handle any possible detector shape.

@M0hsend
Copy link

M0hsend commented Mar 12, 2024

Hi @CSSFrancis , The detector configuration is stored in the header section. When you load a quad Merlin data using data = file_reader(path, lazy = False) then for the quad case you get data[0]['original_metadata']['mib_properties']['detector_geometry'] as 2x2. This is a physical gap between the chips (3 pixels wide) that is not actively reading data. So if this gap is not included in the diffraction relative distances / angles are not quite right. So I wonder if it is better suited to be included here as I doubt anyone would want to read wrong data.
The pyxem method could be to somehow fill that cross gap with various interpolation algorithms?

@ericpre
Copy link
Member Author

ericpre commented Mar 12, 2024

According to the manual, there is a setting on the acquisition computer, the Layout parameter in Config.ini and the string could be one of [2x2, Nx1, 2x2G, Nx1G] where G indicate that the gap has been added.
I checked some data with the gap added and the G is not there... which mean that we can rely on this metadata! @matkraj, is it correct/expected?

The file_reader is applying a up-down flip on the diffraction patterns here. Would be good to add this to the metadata.

@M0hsend, why do you think this would be good? Are they different image orientation possible?

One motivation to do that in pyxem is that adding the gap mess up the data chunking and it would expect that it make sense to do it at the same time as the correction - but I didn't checked if this statement is correct and how bad is the performance penalty.

@CSSFrancis
Copy link
Member

CSSFrancis commented Mar 14, 2024

From @matkraj in pyxem/pyxem#634

Hi all, Is there any reason hyperspy/rosettasciio doesn't load Merlin data lazily as a default? Merlin data is easily memory mapped so it should not be problem and in my opinion a preferred behaviour. The only one I see the memory mapping will not work is if the data was binary and RAW - 1-bit packed.

@CSSFrancis
Copy link
Member

@matkraj I don't know if we want to set multiple defaults for if a signal is lazy or not lazy as it starts to get confusing. Memory mapping in my opinion is easy to do but you can fairly easy run into problems. You can see the discussion here #174 (review) but the jist is that memory mapping works best when reading linearly and when you start to increase the size of the dataset you run into an increasing number of problems.

Numpy likes to keep things as memory mapped arrays until it can't and then it will transform the data to memory which is hard to predict and requires an in-depth understanding of numpy. For lazy data I'd much rather just transform it to a dask array and have explicit control over some of these operations. In reality no-one should be using some format like .mib to actually do data processing. The first thing you should do is convert the data to .zspy and (then probably throw away the raw data or compress it if you feel sentimental).

@CSSFrancis
Copy link
Member

Just to give an extreme example I often have .mrc files that are something like 400x256x256x128x128 (time, x, y, kx, ky) in which case reading along the time axis is painful because you have to jump around so much on the disk :) In that case it would be faster to create a dask array chunked in (y, kx,ky) save the data in a compressed format, load the data and rechunk it along the time axis, resave it and then operate along the time axis. I realize that sounds crazy but there isn't really a way around it :)

@matkraj
Copy link

matkraj commented Mar 14, 2024

Hi @CSSFrancis thanks for pointing this out for me. When I was working on GPU processing in past I preferred any geometry adjustments to be done after the data was processed but I recognise this will be impacted by the type of processing needed. For template matching this is trivial but ptychography might be heavily impacted by this.

I was wondering if the default behaviour should be able to capture more usage variants, i.e. not having enough RAM, being able to visualise the data quickly with .plot() etc.

@CSSFrancis
Copy link
Member

Hi @CSSFrancis thanks for pointing this out for me. When I was working on GPU processing in past I preferred any geometry adjustments to be done after the data was processed but I recognise this will be impacted by the type of processing needed. For template matching this is trivial but ptychography might be heavily impacted by this.

Yea that's an important part and I think I've struggled to disseminate this idea in the community! In general better lazy education is something that is necessary in the community. Honestly there are not great answer to questions such as:

  1. What is the ideal chunking for a 4-D STEM dataset?
  2. How should data be efficiently saved/ loaded?
  3. How should data be processed efficiently ?

Honestly the best solution is to just save things using zarr as compression just fixes so many of these questions. Just keep buying more and more and more hard drives, set up an array and at some point you get I-O speeds of like 1-10s GB/sec times what ever compression factor you get. At that point I-O is no longer your bottle neck :). Hard drives are cheap, 4D STEM datasets are large.

I was wondering if the default behaviour should be able to capture more usage variants, i.e. not having enough RAM, being able to visualise the data quickly with .plot() etc.

I feel like this goes with above. I'd default to non lazy operations and then if people want lazy they can pass lazy=True we could be better about how the lazy data is chunked and you can add distributed loading if you want.

@ericpre
Copy link
Member Author

ericpre commented Mar 16, 2024

Hi all, Is there any reason hyperspy/rosettasciio doesn't load Merlin data lazily as a default? Merlin data is easily memory mapped so it should not be problem and in my opinion a preferred behaviour. The only one I see the memory mapping will not work is if the data was binary and RAW - 1-bit packed.

If you want to acess the memory memmap directly, there is a lower level function in https://hyperspy.org/rosettasciio/user_guide/supported_formats/quantumdetector.html#rsciio.quantumdetector.load_mib_data that return it.

Hi @CSSFrancis thanks for pointing this out for me. When I was working on GPU processing in past I preferred any geometry adjustments to be done after the data was processed but I recognise this will be impacted by the type of processing needed. For template matching this is trivial but ptychography might be heavily impacted by this.

Just to note that the navigation_shape can be used to specify the navigation shape and therefore load the data as a stack of images using navigation_shape = (number_of_frames,). It is a bit annoying because it is necessary to know the number of frames when reading the data but maybe this can be fixed by not doing any data reshape for a given value of navigation_shape. I can't really think of what would be good value for that, maybe -1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants