Simplify hash and equality operations #47

kalvdans · 2021-12-08T13:15:57Z

Also broaden the test coverage with set operation and equality to
different object.

Follow-up on #43

Also broaden the test coverage with set operation and equality to different object.

kalvdans · 2021-12-08T13:26:13Z

The (macos-latest, 3.6) integration test failure looks like an unrelated internet access failure.

lmmentel · 2021-12-08T22:50:41Z

Thanks for the help @kalvdans! I checked your version of __eq__ and __hash__ and I wonder what is your motivation for refactoring it that way. In other words, would you mind explaining why you think your solution should be used instead of the current one?

Also please take into account that python's hash returns the value it get if it's of type integer, e.g.

>>> hash(1)
1
>>> hash(100)
100
>>>

It only computes the actual hash for other types of values:

>>> hash("a")
-2483041677650604276

In your case the calling hash(self.atomic_number) seems redundant.

kalvdans · 2021-12-09T05:52:11Z

A hash algorithm is used in hash tables to allow for constant time average lookup time. It should not be used for anything else.

The old implementation of __eq__ will have a slight risk of collission. Since it hashes strings, it gets the random seed mixed in. On 32-bit platsforms the hash is 32 bit. Let's say the hash is perfectly random. The probability of two elements out of the 118 existing ones having the same hash is then 1.62e-6 according to the birthday paradox. So once in every millionth Python invocation, two different elements will compare equal.

On 64-bit architectures, the risk is smaller, but it is just fundamentally wrong to use hash for equality.

kalvdans · 2021-12-09T05:55:37Z

In your case the calling hash(self.atomic_number) seems redundant.

You are right. hash() automatically truncates integers to machine precision. I will change that, hang on!

It is perfectly fine for __hash__() to return an integer.

lmmentel · 2021-12-10T08:45:22Z

On 64-bit architectures, the risk is smaller, but it is just fundamentally wrong to use hash for equality.

I'm not sure that I agree since python dict and set operations are based on hashing, so if it's good enough for the standard library it's good enough for our use case.

It seems that the solution you are presenting is a weaker equality condition that the one based on hashing since it uses a single attribute for comparison instead of 69 attributes. I don't think that I would like to rely on a single attribute since the odds of someone accidentally assigning a new value to Element.atomic_number or any other attribute may be higher that probability of a hash collision (which is still pretty unlikely). In the case of hash based implementation it would be able to detect that and say the elements are not equal.

kalvdans · 2021-12-10T09:25:50Z

I'm not sure that I agree since python dict and set operations are based on hashing, so if it's good enough for the standard library it's good enough for our use case.

They use the hash for inequality only, for equality they still call the __eq__ method, even if two hashes are the same.

someone accidentally assigning

Yes the documentation for __hash__ is clear that the object needs to be immutable. It is a bit tricky to enforce immutability on homemade Python objects, but that could be a future endeavour. Hashing all elements of a mutable object will still be buggy in case the object is already inside a dict.

lmmentel · 2021-12-12T20:59:50Z

I cannot see a strong case to get this merged, would like to continue iterating on this?

kalvdans · 2021-12-13T07:16:16Z

I care a lot about the quality of the mendeleev package, please tag someone else on the PR to get a third opinion.

lmmentel · 2021-12-19T20:39:45Z

I care a lot about the quality of the mendeleev package, please tag someone else on the PR to get a third opinion.

Thanks for your concern. I cannot currently see enough reason to devote more attention to this issue. If you find a concrete example (as in https://stackoverflow.com/help/minimal-reproducible-example) that illustrates any vulnerabilities with the current implementation I would be happy to jump in and help address that.

kalvdans · 2021-12-20T07:38:03Z

@mattwthompson, can you have a look at this PR?

mattwthompson · 2021-12-20T15:58:11Z

mendeleev/models.py

- return hash(tuple(sorted(hashable)))
+ # Allow Element objects as keys in dictionaries. Same atomic
+ # number means all other properties are equal too.
+ return self.atomic_number


I'd personally be weary of hashing elements based on a single property, as it assumes there is no possible way for two elements to differ if their atomic numbers are the same. Because the Element class is mutable and can be used to generate custom, pesudo-/non-physical elements, it seems much more likely that there would be a collision based on this assumption than the birthday paradox on 32-bit architectures.

I'm happy to remove the __hash__ function until we make the instances immutable. The hashable property wasn't part of #42 after all.

I've made an alternative pull request for that, see #51

mattwthompson · 2021-12-20T15:58:44Z

tests/test_element.py

+ assert element(12) != "sample string"
+
+
+def test_hashable():


👍 to having this tested, whether or not the changes to the source code are accepted

Simplify hash and equality operations

cf7196e

Also broaden the test coverage with set operation and equality to different object.

kalvdans mentioned this pull request Dec 8, 2021

Enable comparing Elements for equality #43

Merged

Remove unnecessary call to hash()

0ee8b15

It is perfectly fine for __hash__() to return an integer.

mattwthompson reviewed Dec 20, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify hash and equality operations #47

Simplify hash and equality operations #47

kalvdans commented Dec 8, 2021

kalvdans commented Dec 8, 2021 •

edited

Loading

lmmentel commented Dec 8, 2021

kalvdans commented Dec 9, 2021

kalvdans commented Dec 9, 2021

lmmentel commented Dec 10, 2021

kalvdans commented Dec 10, 2021 •

edited

Loading

lmmentel commented Dec 12, 2021

kalvdans commented Dec 13, 2021

lmmentel commented Dec 19, 2021

kalvdans commented Dec 20, 2021

mattwthompson Dec 20, 2021

kalvdans Dec 21, 2021 •

edited

Loading

kalvdans Dec 21, 2021

mattwthompson Dec 20, 2021

Simplify hash and equality operations #47

Are you sure you want to change the base?

Simplify hash and equality operations #47

Conversation

kalvdans commented Dec 8, 2021

kalvdans commented Dec 8, 2021 • edited Loading

lmmentel commented Dec 8, 2021

kalvdans commented Dec 9, 2021

kalvdans commented Dec 9, 2021

lmmentel commented Dec 10, 2021

kalvdans commented Dec 10, 2021 • edited Loading

lmmentel commented Dec 12, 2021

kalvdans commented Dec 13, 2021

lmmentel commented Dec 19, 2021

kalvdans commented Dec 20, 2021

mattwthompson Dec 20, 2021

Choose a reason for hiding this comment

kalvdans Dec 21, 2021 • edited Loading

Choose a reason for hiding this comment

kalvdans Dec 21, 2021

Choose a reason for hiding this comment

mattwthompson Dec 20, 2021

Choose a reason for hiding this comment

kalvdans commented Dec 8, 2021 •

edited

Loading

kalvdans commented Dec 10, 2021 •

edited

Loading

kalvdans Dec 21, 2021 •

edited

Loading