Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 2396 #3089

Merged
merged 8 commits into from
Dec 29, 2018
Merged

Fix issue 2396 #3089

merged 8 commits into from
Dec 29, 2018

Conversation

alvaroabascar
Copy link
Contributor

Description

This PR fixes issue #2396: Doc.get_lca_matrix had a bug in this line which caused wrong outputs
for some pairs of tokens.

Types of change

Bug fix

Explanation of the bug

I borrow the following example from @slavi (#2396):

He created a test for spacy

s = "He created a test for spacy"
doc = nlp(s)
doc.get_lca_matrix()[3][5]
  • Expected output: 3
  • Actual output: 1

The above mentioned line is:

...
            else:
                lca_index = __pairwise_lca(token_j.head, token_k.head, lca_matrix)
...

Here, for tokens 3 ("test") and 5 ("spacy"), we are calling __pairwise_lca recursively with their heads ("created" and "for" respectively). But in doing so, we are skiping "test", which is the actual lowest common ancestor.

Solution

The reimplementation consists on this simple algorihtm. It works as expected and is 3 times faster.

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@ines ines added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects labels Dec 27, 2018
@honnibal
Copy link
Member

Really nice first PR, thanks!

I noticed some ugliness with the original code as well --- unfortunately the implementation appears to be duplicated in the spacy/tokens/span.pyx module. It would probably be good to move the code in spacy/tokens/doc.pyx into functions so that span.pyx can call into it. If that ends up being tricky, we can also just punt on that and just apply the same fix to the span.pyx code, and let the implementations continue in parallel for now.

@alvaroabascar
Copy link
Contributor Author

True! I hadn't seen the duplicated implementation inside span.pyx. I can either:

a) Separate the function and call it from both Span.get_lca_matrix and Doc.get_lca_matrix.
b) Implement the code in Span.get_lca_matrix, and from Doc.get_lca_matrix just call self[:].get_lca_matrix().

b) seems cleaner to me. But in case you'd prefer a), where would you place this auxiliary function?

@honnibal
Copy link
Member

Delegating to a span of the whole document is a clever solution, but I think I probably prefer a helper function in doc.pyx. I probably wouldn't expect to find the implementation within span.pyx, since the Span is a view of the doc, so normally the delegation goes the other way.

Moving this out into a helper function also lets us fix that closure, which I always found a bit ugly about the previous implementation. So, we want the function to be something like:


def _get_lca_matrix(Doc doc, int start, int end):

We should keep it as a private function so that we don't have to worry about validating the start and end offsets within it. From the span, if we call _get_lca_matrix(self.doc, self.start, self.end) we know the start and end are valid.

…om Doc.get_lca_matrix and Span.get_lca_matrix
@alvaroabascar
Copy link
Contributor Author

alvaroabascar commented Dec 28, 2018

Done! I first tried to just define a helper _get_lca_matrix inside doc.pyx, but could not import it from span.pyx. It finally worked when I declared it in doc.pxd. I'm not experienced in Cython so I hope the approach is right.

I also removed that closure within _get_lca_matrix, and I defined it as another helper _get_tokens_lca, which made me wonder whether that helper should go in doc.pyx or token.pyx.

Finally, I guess that both helpers could be further "cythonized" resorting to TokenC, but this would mean renouncing to token.ancestors, which makes everything pretty clean.

Let me know what you think :)

@honnibal
Copy link
Member

Looks great, thanks! I think keeping that helper in the doc.pyx is fine. It's nice to have the helper close to where it's used. I think you're right that keeping it as Python so we can use the ancestors property is better.

@honnibal
Copy link
Member

Hm, the span test is failing, though. Is the test wrong or the code?

@alvaroabascar
Copy link
Contributor Author

It was a mistake in the code, but it should be fine now.

@honnibal honnibal merged commit 6fe276f into explosion:master Dec 29, 2018
@honnibal
Copy link
Member

Sweet, merging.

honnibal pushed a commit that referenced this pull request Dec 29, 2018
* Test on #2396: bug in Doc.get_lca_matrix()

* reimplementation of Doc.get_lca_matrix(), (closes #2396)

* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()

* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()

* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix

* use memory view instead of np.ndarray in _get_lca_matrix (faster)

* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview

* cleaner conditional, add comment
@alvaroabascar alvaroabascar deleted the issue_2396 branch December 29, 2018 22:16
@alvaroabascar alvaroabascar restored the issue_2396 branch January 1, 2019 15:58
ines pushed a commit that referenced this pull request Jan 6, 2019
This PR adds a test for an untested case of `Span.get_lca_matrix`, and fixes a bug for that scenario, which I introduced in [this PR](#3089) (sorry!).

## Description
The previous implementation of get_lca_matrix was failing for the case `doc[j:k].get_lca_matrix()` where `j > 0`. A test has been added for this case and the bug has been fixed.

### Types of change
Bug fix

## Checklist

- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants