Lin's similarity measure might be implemented incorrectly in GOATOOLS - outputs negative values #120

tanghaibao · 2019-02-13T06:33:52Z

Received the following issue with a direct email

Dear Haibao,

Thank you for GOATOOLS --> very nice tool!

I looked through your example of how to calculate semantic similarity using GOATOOLS:
https://github.com/tanghaibao/goatools/blob/master/notebooks/semantic_similarity.ipynb

And it looks as if you have implemented Lin's similarity incorrectly, as you get negative values as output.

Kind regards,

Alexander Grønning

dvklopfenstein · 2019-02-14T02:58:34Z

I will write a test for this.

But I do not have time to look into it further as the other important issues that need my attention are:

Fully implementing evidence codes
Supporting reading associations while using relationships.

@alex-wave, Can you add to the conversation regarding Lin's similarity score?

dvklopfenstein · 2019-02-14T17:36:56Z

The test which shows a negative Lin score is https://github.com/tanghaibao/goatools/blob/master/tests/semantic_i88.py

This plot shows the two user-specifed GO terms (green) and their deepest common ancestor (DCA), biological process (blue):

The numbers next to the "i" in the GO Term boxes show the information content:

i3.30 GO:0008150 biological_process
i5.48 GO:0032501 multicellular organismal process
i7.56 GO:0048364 root development

The information scores appear to make sense, meaning the least annotated one, root development(7.56), has a higher score than the middle one, multicellular organismal process(5.48). And they are both have higher information scores than their DCA, biological process(3.30).

Resnick's similarity score is defined as the information content between the DCA, which would be 3.30 in our case.

Lin's score is coded as -1 * 2 * Resnick/(info_score(GO:0032501) + info_score(GO:0048364)), which is calculated to -0.505 in this example.

That Lin's score is a fraction makes sense. I am just not sure why there is a '-1' in the equation if the information content values are always positive. Perhaps this is the issue?

alex-wave · 2019-02-21T11:40:39Z

@dvklopfenstein there does appear to be a problem with the Lin similarity scores from GOATOOLS. The "-1.0" should not be there - Lin's similarity measure is defined as 2 * Resnik(t1, t2) / (IC(t1) + IC(t2)) (https://www.sciencedirect.com/science/article/pii/S0169023X06000875)

I've created a pull request to fix this #121

dvklopfenstein · 2019-02-21T19:24:46Z

Thank you Dr. Warwick Vesztrocy very much for taking your time to look at this and issue a pull request.

I have merged your pull request and will close this issue. Thank you, Dr. Grønning, for bringing this to our attention.

dvklopfenstein added a commit that referenced this issue Feb 14, 2019

Added more info for Lin's similarity score #120

8269c2c

dvklopfenstein closed this as completed Feb 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lin's similarity measure might be implemented incorrectly in GOATOOLS - outputs negative values #120

Lin's similarity measure might be implemented incorrectly in GOATOOLS - outputs negative values #120

tanghaibao commented Feb 13, 2019

dvklopfenstein commented Feb 14, 2019

dvklopfenstein commented Feb 14, 2019 •

edited

Loading

alex-wave commented Feb 21, 2019

dvklopfenstein commented Feb 21, 2019

Lin's similarity measure might be implemented incorrectly in GOATOOLS - outputs negative values #120

Lin's similarity measure might be implemented incorrectly in GOATOOLS - outputs negative values #120

Comments

tanghaibao commented Feb 13, 2019

dvklopfenstein commented Feb 14, 2019

dvklopfenstein commented Feb 14, 2019 • edited Loading

alex-wave commented Feb 21, 2019

dvklopfenstein commented Feb 21, 2019

dvklopfenstein commented Feb 14, 2019 •

edited

Loading