Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unpredictable md5 sum #14

Merged
merged 4 commits into from
Feb 10, 2021
Merged

Conversation

andriosrobert
Copy link
Contributor

  • removing json serialization will use edn for computing the hash
  • sorting the map will ensure predictable computations
  • should fix map->md5 is not predictive #11

- removing json serialization will use edn for computing the hash
- sorting the map will ensure predictable computations
- should fix #11
@andriosrobert andriosrobert self-assigned this Sep 12, 2020
@andriosrobert andriosrobert added the bug Something isn't working label Sep 12, 2020
@didiercrunch
Copy link

I doubt that will work in the general case.

Computing cryptographic hash code of general JSON object is super hard; it is a job for a cryptographer. The task is hard because you need to make sure of two things

  1. each different JSON object must produce a different hash code. In a none cryptographic implementation, usually {"a": "b"} and {"b": "a"} produces the same output; that would be something bad for cryptographic hash.
  2. each representation of the same JSON object must produce the same outcome. {"a": "b"} and {"a": "b"} must produce the same output. This is why you cannot rely on a "regular" JSON library like clojure.core.json.. they are not made for that.

I believe you have two options. I would vote strongly for the second option.

  1. find a library that does cryptographic hashing of json objects.
  2. hash your data as typed tuple. If you want to hash the map {:foo 900 :bar "sss" :toto 34.33} you should hash tuple [900, "sss", 34.33] knowing the types are [Long, utf-8 String, Double]. Your hashing library supports all these types.

@andriosrobert
Copy link
Contributor Author

andriosrobert commented Sep 14, 2020

Thanks for the detailed explanation!

Not sure if you checked it, but I completely removed the JSON serialization from the process. I'm sticking to the EDN format using a data structure that ensures the keys will be always ordered.

@didiercrunch
Copy link

I have looked into your change but it will not change the underlying issue. You use serialization as canonical representation of data structure; that will result into bugs in the general case.

The bellow example still shows the problem..

(let [m {:foo 300 :toto 600}
      mm {:foo m}
      t (assoc (assoc {} :toto 600) :foo 300)
      tt {:foo t}]
  (println mm (map->md5 mm)) ;; {:foo {:foo 300, :toto 600}} d8a50edd5aeb65be51deb87eefc48f93

  (println tt (map->md5 tt))) ;; {:foo {:foo 300, :toto 600}} 52a197cb51d65d93a047df1b7fa994aa

Actually, the bellow code worked with the json implementation but not the prn-str.

user=> (map->md5 {:foo '(1)})
"d53b5f3fd7e4d2d7d6ce5795f308ef7e"
user=> (map->md5 {:foo [1]})
"4dafdc7833269033d683da9aeff61ed9"

I still believe you should only hash things data structure you know exactly the format.

@lnmunhoz
Copy link

lnmunhoz commented Oct 8, 2020

Any updates on this? @andriosr @Jumanjii

@andriosrobert
Copy link
Contributor Author

Hey @didiercrunch - taking the idea of the tuple and trying to make it easier for others to make this check, I decided to merge the fields in a specific order using a String.

Any thoughts?

@andriosrobert
Copy link
Contributor Author

Merging this as the test cases are covered and we are using a single primitive type (String), which removes the problem of organizing a data structure in different ways when hashing

@andriosrobert andriosrobert merged commit f968a88 into master Feb 10, 2021
@andriosrobert andriosrobert deleted the fix/unpredictable-md5-sum branch February 10, 2021 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

map->md5 is not predictive
3 participants