Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify hash and equality operations #47

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

kalvdans
Copy link
Collaborator

@kalvdans kalvdans commented Dec 8, 2021

Also broaden the test coverage with set operation and equality to
different object.

Follow-up on #43

Also broaden the test coverage with set operation and equality to
different object.
@kalvdans
Copy link
Collaborator Author

kalvdans commented Dec 8, 2021

The (macos-latest, 3.6) integration test failure looks like an unrelated internet access failure.

@lmmentel
Copy link
Owner

lmmentel commented Dec 8, 2021

Thanks for the help @kalvdans! I checked your version of __eq__ and __hash__ and I wonder what is your motivation for refactoring it that way. In other words, would you mind explaining why you think your solution should be used instead of the current one?

Also please take into account that python's hash returns the value it get if it's of type integer, e.g.

>>> hash(1)
1
>>> hash(100)
100
>>> 

It only computes the actual hash for other types of values:

>>> hash("a")
-2483041677650604276

In your case the calling hash(self.atomic_number) seems redundant.

@kalvdans
Copy link
Collaborator Author

kalvdans commented Dec 9, 2021

A hash algorithm is used in hash tables to allow for constant time average lookup time. It should not be used for anything else.

The old implementation of __eq__ will have a slight risk of collission. Since it hashes strings, it gets the random seed mixed in. On 32-bit platsforms the hash is 32 bit. Let's say the hash is perfectly random. The probability of two elements out of the 118 existing ones having the same hash is then 1.62e-6 according to the birthday paradox. So once in every millionth Python invocation, two different elements will compare equal.

On 64-bit architectures, the risk is smaller, but it is just fundamentally wrong to use hash for equality.

@kalvdans
Copy link
Collaborator Author

kalvdans commented Dec 9, 2021

In your case the calling hash(self.atomic_number) seems redundant.

You are right. hash() automatically truncates integers to machine precision. I will change that, hang on!

It is perfectly fine for __hash__() to return an integer.
@lmmentel
Copy link
Owner

On 64-bit architectures, the risk is smaller, but it is just fundamentally wrong to use hash for equality.

I'm not sure that I agree since python dict and set operations are based on hashing, so if it's good enough for the standard library it's good enough for our use case.

It seems that the solution you are presenting is a weaker equality condition that the one based on hashing since it uses a single attribute for comparison instead of 69 attributes. I don't think that I would like to rely on a single attribute since the odds of someone accidentally assigning a new value to Element.atomic_number or any other attribute may be higher that probability of a hash collision (which is still pretty unlikely). In the case of hash based implementation it would be able to detect that and say the elements are not equal.

@kalvdans
Copy link
Collaborator Author

kalvdans commented Dec 10, 2021

I'm not sure that I agree since python dict and set operations are based on hashing, so if it's good enough for the standard library it's good enough for our use case.

They use the hash for inequality only, for equality they still call the __eq__ method, even if two hashes are the same.

someone accidentally assigning

Yes the documentation for __hash__ is clear that the object needs to be immutable. It is a bit tricky to enforce immutability on homemade Python objects, but that could be a future endeavour. Hashing all elements of a mutable object will still be buggy in case the object is already inside a dict.

@lmmentel
Copy link
Owner

I cannot see a strong case to get this merged, would like to continue iterating on this?

@kalvdans
Copy link
Collaborator Author

I care a lot about the quality of the mendeleev package, please tag someone else on the PR to get a third opinion.

@lmmentel
Copy link
Owner

I care a lot about the quality of the mendeleev package, please tag someone else on the PR to get a third opinion.

Thanks for your concern. I cannot currently see enough reason to devote more attention to this issue. If you find a concrete example (as in https://stackoverflow.com/help/minimal-reproducible-example) that illustrates any vulnerabilities with the current implementation I would be happy to jump in and help address that.

@kalvdans
Copy link
Collaborator Author

@mattwthompson, can you have a look at this PR?

return hash(tuple(sorted(hashable)))
# Allow Element objects as keys in dictionaries. Same atomic
# number means all other properties are equal too.
return self.atomic_number

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally be weary of hashing elements based on a single property, as it assumes there is no possible way for two elements to differ if their atomic numbers are the same. Because the Element class is mutable and can be used to generate custom, pesudo-/non-physical elements, it seems much more likely that there would be a collision based on this assumption than the birthday paradox on 32-bit architectures.

Copy link
Collaborator Author

@kalvdans kalvdans Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to remove the __hash__ function until we make the instances immutable. The hashable property wasn't part of #42 after all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made an alternative pull request for that, see #51

assert element(12) != "sample string"


def test_hashable():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to having this tested, whether or not the changes to the source code are accepted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants