Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename jaro_distance to jaro_similarity #122

Closed
stephantul opened this issue Jul 17, 2019 · 4 comments
Closed

rename jaro_distance to jaro_similarity #122

stephantul opened this issue Jul 17, 2019 · 4 comments

Comments

@stephantul
Copy link

The jaro_distance function returns a similarity (1 for exact matches) instead of a distance (0 for exact matches), which is the opposite of what the name implies. Would it be possible to rename the function to read jaro_similarity instead to avoid confusion?

@Niecke
Copy link

Niecke commented Aug 17, 2019

To be honest, I was confused by this a few weeks ago. So maybe renaming it would be helpful.

EDIT:
Ok jaro_distance is correct since 0 means absolute dissimilar and 1 absolute similar. For the jaro similarity it should be vice versa.

@stephantul
Copy link
Author

stephantul commented Aug 20, 2019

@Niecke I don't think that's true. Here's a simple test that shows that the logic is inverted.

import numpy as np
from jellyfish import levenshtein_distance, jaro_distance

a = "aabb"
b = "abab"

# We expect the output to be 1, since a has a smaller distance to itself than to something else
levenshtein = np.argmin([levenshtein_distance(a, b), levenshtein_distance(a, a)])
# Jaro gives the opposite result.
jaro = np.argmin([jaro_distance(a, b), jaro_distance(a, a)])
# But inverting it gives us the correct answer
jaro_inv = np.argmin([1 - jaro_distance(a, b), 1 - jaro_distance(a, a)])

This shows that the jaro_distance function is, in fact, doing the opposite of what it says it is doing, and is calculating a similarity. Effectively, switching levenshtein_distance to jaro_distance requires you to switch around all logic statements (i.e., every min becomes max and vice versa).

@jamesturk
Copy link
Owner

looking at ways to do this without breaking everyone, it'll probably require a major version upgrade to be final, but I could add the renamed function in the interim

@jamesturk
Copy link
Owner

as of 0.8.2 this is now the correct name, the old names still function but raise a warning

I will fix the _distance variants in 1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants