Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performing metadata equality via different methods provides different results in Iris v3.0.1 #4506

Open
ehogan opened this issue Jan 14, 2022 · 7 comments
Labels
Dragon 馃悏 https://github.com/orgs/SciTools/projects/19?pane=info Type: Bug

Comments

@ehogan
Copy link
Contributor

ehogan commented Jan 14, 2022

馃悰 Bug Report

I have two cubes. When I use cube1.metadata.difference(cube2.metadata) my metadata shows differences, even though the values look the same. However, when I use assert cube1.attributes == cube2.attributes I get no AssertionError.

I discovered this issue because:

  • with v2.4.0 (e.g. scitools/default-previous) I can run assert cube1.metadata == cube2.metadata and I get no AssertionError.
  • with v3.0.1 (e.g. scitools/default-current) I get an AssertionError when I run assert cube1.metadata == cube2.metadata.

How To Reproduce

% module load scitools
% python
>>> import iris
>>> import numpy as np
>>> cube1 = iris.cube.Cube(1, attributes={"value1": np.float64(89.5)})
>>> cube2 = iris.cube.Cube(1, attributes={"value1": 89.5})
>>> assert cube1.metadata == cube2.metadata
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
>>> cube1.metadata.difference(cube2.metadata)
CubeMetadata(standard_name=None, long_name=None, var_name=None, units=None, attributes=({'value1': 89.5}, {'value1': 89.5}), cell_methods=None)
>>> assert cube1.attributes == cube2.attributes
>>>

Expected behaviour

I'm not sure what would be best?

Either cube1.metadata.difference(cube2.metadata) shows the types in the difference, so it's clear where the difference is (because the values look the same and this took longer than I care to admit to figure out) 馃槤

Or assert cube1.attributes == cube2.attributes fails

Or both?

@bjlittle
Copy link
Member

bjlittle commented Jan 19, 2022

@ehogan Under the hood for metadata attributes dictionary comparisons we use python-xxhash (see xxHash).

The fact that you can put scalar numpy arrays into the dictionary is the issue here. Although np.float(89.5) is equivalent to 89.5 they are not the same thing, and we've chosen to discriminate based on that.

This is why assert cube1.metadata == cube2.metadata fails, and the difference method is consistent with that behaviour and returns that there is indeed a difference between the attributes dictionaries.

Perhaps it might be more helpful to the user if the difference method highlighted this explicitly 馃

You might reasonably ask why oh why do we care about such a difference?

... one concrete example is for cube arithmetic when the metadata for two operand cubes are combined into one resultant cube. Given your example above, which version of attributes["value1"] should be used in the resultant cubes metadata? The decision is not clear, hence the difference does kinda matter.

Note that, cube1.attributes == cube2.attributes is a simple Python dictionary comparison... and in this case gives a different result, as the Python builtin comparison being performed is different.

I can't imagine that we'll subclass a Dict and specialise the __eq__ behaviour to align with metadata __eq__.

At the very least this behaviour should be highlighted and clarified... but do note that you are highlighting a special case here of using a numpy scalar array. Your example doesn't extend to non-scalar numpy arrays e.g.,

>>> import numpy as np
>>> a = dict(values=np.arange(3))
>>> b = dict(values=[0, 1, 2]) 
>>> a == b
a == b
Traceback (most recent call last):
  File "<input>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Also, note that metadata combine, compare and difference behaviours avoid this type of ValueError due to the approach that we've adopted for special treatment of dealing with attributes dictionaries.

@bjlittle
Copy link
Member

@ehogan Note also that his change in behaviour was introduced into iris 3.0.0, and was one of the many reasons for a major release of iris i.e., it breaks backwards compatibility for some behaviours. Reference #3785

@larsbarring
Copy link
Contributor

larsbarring commented Jan 27, 2022

And here is view from someone that all to frequently (to say the least...) have to deal with all sorts of metadata (and data) imperfections. Model data, that is supposed to be one homogeneous set, that was produced in different batches, on different machines, quality controlled by different people, using different [versions of] software tools, and good know what is the reason behind the imperfections. In the best of worlds such problems should be non-existing or data just rejected/fixed by the producers. But I am afraid that I do not live in that world.

Last year we processed something like 10,000-15,000 files from a large online database, and something like 60% had metadata (or data) imperfections. Most of these were easy to spot and could be fixed with a few steps of preprocessing. Remained something like 500 files making up about 30-50 supposedly homogeneous dataset units. Homogeneous in the sense that my_cube_list.concatenate_cube() should go down just like a sip of coffee. As it was impossible to imagine where these imperfections where, in in which file, in which metadata attribute, or which file(s) had coordinates that differed in LSB, trial and error was the method. Not preferred but at hand. A lot of coffee were used before that smooth sip finally went down.

So, what conclusions to to draw from this with any bearings on iris?

  • Python itself does a very good job in working with different types, be it integers or float64. Iris could aim in the same direction. In the example above, which is the more general type np.float(89.5) or 89.5?
  • Define some general rules for how to deal with this, something like * In the case of differences in [meta]data types iris will always attempt to convert to an equivalent numpy type and issue a warning. This warning can be muted by ...." In this way the program will continue and either chew on the next metadata imperfection or just happily run to the end.
  • Of course there might be situations where such a general rule is of no help, and raising an error is the only way.

I think that steps in this direction will be immensely useful all and everyone that are dealing with data from various sources.

@rcomer
Copy link
Member

rcomer commented Jan 27, 2022

@larsbarring have you looked at cube_helper? I've not used it myself but think it was written for CMIP ensembles. I think we're all agreed that Iris's own merge and concatenate functions should be more lenient though.

@ehogan
Copy link
Contributor Author

ehogan commented Jan 27, 2022

Thanks @bjlittle :) I'm happy with the discrimination between np.float(89.5) and 89.5. I do think it would be super helpful if the output showed the types of the values, as it is not clear what the difference is in the output (the values look the same to me!). So instead of:

CubeMetadata(attributes=({'value1': 89.5}, {'value1': 89.5}))

could the output be:

CubeMetadata(attributes=({'value1': np.float64(89.5)}, {'value1': 89.5}))

or similar?

Also, would it be worth adding to the documentation that assert cube1.attributes == cube2.attributes doesn't distinguish between types in the same way as cube1.metadata.difference(cube2.metadata)? If this is something you think is worth doing I'd be happy to have a go at a PR? Or, alternatively, I'd be happy to review something? :)

@github-actions
Copy link
Contributor

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

@github-actions github-actions bot added the Stale A stale issue/pull-request label Jun 12, 2023
@ehogan
Copy link
Contributor Author

ehogan commented Jun 15, 2023

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

This issue is still important to me 馃槉

@github-actions github-actions bot removed the Stale A stale issue/pull-request label Jun 16, 2023
@ESadek-MO ESadek-MO added the Dragon 馃悏 https://github.com/orgs/SciTools/projects/19?pane=info label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dragon 馃悏 https://github.com/orgs/SciTools/projects/19?pane=info Type: Bug
Projects
Status: No status
Development

No branches or pull requests

6 participants