Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lin's similarity measure might be implemented incorrectly in GOATOOLS - outputs negative values #120

Closed
tanghaibao opened this issue Feb 13, 2019 · 4 comments

Comments

@tanghaibao
Copy link
Owner

Received the following issue with a direct email


Dear Haibao,

Thank you for GOATOOLS --> very nice tool!

I looked through your example of how to calculate semantic similarity using GOATOOLS:
https://github.com/tanghaibao/goatools/blob/master/notebooks/semantic_similarity.ipynb

And it looks as if you have implemented Lin's similarity incorrectly, as you get negative values as output.

Kind regards,

Alexander Grønning

@dvklopfenstein
Copy link
Collaborator

I will write a test for this.

But I do not have time to look into it further as the other important issues that need my attention are:

  1. Fully implementing evidence codes
  2. Supporting reading associations while using relationships.

@alex-wave, Can you add to the conversation regarding Lin's similarity score?

@dvklopfenstein
Copy link
Collaborator

dvklopfenstein commented Feb 14, 2019

The test which shows a negative Lin score is https://github.com/tanghaibao/goatools/blob/master/tests/semantic_i88.py

This plot shows the two user-specifed GO terms (green) and their deepest common ancestor (DCA), biological process (blue):

The numbers next to the "i" in the GO Term boxes show the information content:

  • i3.30 GO:0008150 biological_process
  • i5.48 GO:0032501 multicellular organismal process
  • i7.56 GO:0048364 root development

The information scores appear to make sense, meaning the least annotated one, root development(7.56), has a higher score than the middle one, multicellular organismal process(5.48). And they are both have higher information scores than their DCA, biological process(3.30).

Resnick's similarity score is defined as the information content between the DCA, which would be 3.30 in our case.

Lin's score is coded as -1 * 2 * Resnick/(info_score(GO:0032501) + info_score(GO:0048364)), which is calculated to -0.505 in this example.

That Lin's score is a fraction makes sense. I am just not sure why there is a '-1' in the equation if the information content values are always positive. Perhaps this is the issue?

@alex-wave
Copy link
Contributor

@dvklopfenstein there does appear to be a problem with the Lin similarity scores from GOATOOLS. The "-1.0" should not be there - Lin's similarity measure is defined as 2 * Resnik(t1, t2) / (IC(t1) + IC(t2)) (https://www.sciencedirect.com/science/article/pii/S0169023X06000875)

I've created a pull request to fix this #121

@dvklopfenstein
Copy link
Collaborator

Thank you Dr. Warwick Vesztrocy very much for taking your time to look at this and issue a pull request.

I have merged your pull request and will close this issue. Thank you, Dr. Grønning, for bringing this to our attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants