Fix issue 2396 #3089

alvaroabascar · 2018-12-21T23:20:03Z

Description

This PR fixes issue #2396: Doc.get_lca_matrix had a bug in this line which caused wrong outputs
for some pairs of tokens.

Types of change

Bug fix

Explanation of the bug

I borrow the following example from @slavi (#2396):

s = "He created a test for spacy"
doc = nlp(s)
doc.get_lca_matrix()[3][5]

Expected output: 3
Actual output: 1

The above mentioned line is:

...
            else:
                lca_index = __pairwise_lca(token_j.head, token_k.head, lca_matrix)
...

Here, for tokens 3 ("test") and 5 ("spacy"), we are calling __pairwise_lca recursively with their heads ("created" and "for" respectively). But in doing so, we are skiping "test", which is the actual lowest common ancestor.

Solution

The reimplementation consists on this simple algorihtm. It works as expected and is 3 times faster.

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

honnibal · 2018-12-27T13:57:39Z

Really nice first PR, thanks!

I noticed some ugliness with the original code as well --- unfortunately the implementation appears to be duplicated in the spacy/tokens/span.pyx module. It would probably be good to move the code in spacy/tokens/doc.pyx into functions so that span.pyx can call into it. If that ends up being tricky, we can also just punt on that and just apply the same fix to the span.pyx code, and let the implementations continue in parallel for now.

alvaroabascar · 2018-12-27T23:48:31Z

True! I hadn't seen the duplicated implementation inside span.pyx. I can either:

a) Separate the function and call it from both Span.get_lca_matrix and Doc.get_lca_matrix.
b) Implement the code in Span.get_lca_matrix, and from Doc.get_lca_matrix just call self[:].get_lca_matrix().

b) seems cleaner to me. But in case you'd prefer a), where would you place this auxiliary function?

honnibal · 2018-12-28T15:15:04Z

Delegating to a span of the whole document is a clever solution, but I think I probably prefer a helper function in doc.pyx. I probably wouldn't expect to find the implementation within span.pyx, since the Span is a view of the doc, so normally the delegation goes the other way.

Moving this out into a helper function also lets us fix that closure, which I always found a bit ugly about the previous implementation. So, we want the function to be something like:


def _get_lca_matrix(Doc doc, int start, int end):

We should keep it as a private function so that we don't have to worry about validating the start and end offsets within it. From the span, if we call _get_lca_matrix(self.doc, self.start, self.end) we know the start and end are valid.

…om Doc.get_lca_matrix and Span.get_lca_matrix

alvaroabascar · 2018-12-28T19:13:43Z

Done! I first tried to just define a helper _get_lca_matrix inside doc.pyx, but could not import it from span.pyx. It finally worked when I declared it in doc.pxd. I'm not experienced in Cython so I hope the approach is right.

I also removed that closure within _get_lca_matrix, and I defined it as another helper _get_tokens_lca, which made me wonder whether that helper should go in doc.pyx or token.pyx.

Finally, I guess that both helpers could be further "cythonized" resorting to TokenC, but this would mean renouncing to token.ancestors, which makes everything pretty clean.

Let me know what you think :)

honnibal · 2018-12-29T14:06:02Z

Looks great, thanks! I think keeping that helper in the doc.pyx is fine. It's nice to have the helper close to where it's used. I think you're right that keeping it as Python so we can use the ancestors property is better.

honnibal · 2018-12-29T14:07:39Z

Hm, the span test is failing, though. Is the test wrong or the code?

…ay instead of memoryview

alvaroabascar · 2018-12-29T15:36:23Z

It was a mistake in the code, but it should be fine now.

honnibal · 2018-12-29T17:02:31Z

Sweet, merging.

* Test on #2396: bug in Doc.get_lca_matrix() * reimplementation of Doc.get_lca_matrix(), (closes #2396) * reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix() * tests Span.get_lca_matrix() as well as Doc.get_lca_matrix() * implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix * use memory view instead of np.ndarray in _get_lca_matrix (faster) * fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview * cleaner conditional, add comment

This PR adds a test for an untested case of `Span.get_lca_matrix`, and fixes a bug for that scenario, which I introduced in [this PR](#3089) (sorry!). ## Description The previous implementation of get_lca_matrix was failing for the case `doc[j:k].get_lca_matrix()` where `j > 0`. A test has been added for this case and the bug has been fixed. ### Types of change Bug fix ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

alvaroabascar added 2 commits December 22, 2018 00:08

Test on explosion#2396: bug in Doc.get_lca_matrix()

953944b

reimplementation of Doc.get_lca_matrix(), (closes explosion#2396)

7ab0684

alvaroabascar mentioned this pull request Dec 22, 2018

get_lca_matrix is incorrect #2396

Closed

ines added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects labels Dec 27, 2018

alvaroabascar added 2 commits December 28, 2018 06:35

reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()

b7a52e9

tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()

8fafe6b

implement _get_lca_matrix as a helper function in doc.pyx; call it fr…

a5429e7

…om Doc.get_lca_matrix and Span.get_lca_matrix

use memory view instead of np.ndarray in _get_lca_matrix (faster)

3ccdc58

fix bug when calling Span.get_lca_matrix; return lca matrix as np.arr…

8bef5d4

…ay instead of memoryview

cleaner conditional, add comment

94bd099

honnibal merged commit 6fe276f into explosion:master Dec 29, 2018

alvaroabascar deleted the issue_2396 branch December 29, 2018 22:16

alvaroabascar restored the issue_2396 branch January 1, 2019 15:58

This was referenced Jan 1, 2019

Fix bug in get_lca_matrix, improve test #3109

Closed

Bugfix/get lca matrix #3110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue 2396 #3089

Fix issue 2396 #3089

alvaroabascar commented Dec 21, 2018

honnibal commented Dec 27, 2018

alvaroabascar commented Dec 27, 2018

honnibal commented Dec 28, 2018

alvaroabascar commented Dec 28, 2018 •

edited

Loading

honnibal commented Dec 29, 2018

honnibal commented Dec 29, 2018

alvaroabascar commented Dec 29, 2018

honnibal commented Dec 29, 2018

Fix issue 2396 #3089

Fix issue 2396 #3089

Conversation

alvaroabascar commented Dec 21, 2018

Description

Types of change

Explanation of the bug

Solution

Checklist

honnibal commented Dec 27, 2018

alvaroabascar commented Dec 27, 2018

honnibal commented Dec 28, 2018

alvaroabascar commented Dec 28, 2018 • edited Loading

honnibal commented Dec 29, 2018

honnibal commented Dec 29, 2018

alvaroabascar commented Dec 29, 2018

honnibal commented Dec 29, 2018

alvaroabascar commented Dec 28, 2018 •

edited

Loading