-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-document support for span arrays #170
Conversation
Initial pass through Span class Fix typo in type hint
…r-pandas into branch-multidoc
@BryanCutler can you have a look at this design for multi-doc support and see if you notice any gotchas on either Arrow integration or passing the full Pandas ExtensionArray suite? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick look and seems like a good approach to me. It shouldn't be a problem to update the arrow conversion to support it. I'll do a more thorough review soon.
Merged with changes from master, found a bunch of regressions, and fixed them. |
@BryanCutler how would you recommend we encode |
I was thinking the best thing to do would be a DictionaryArray, but I'm not clear on how that would work with Pandas or if it's even implemented, so I'll have to try it out. If not, we could do something different, but might not be a clean as a dictionary. |
@frreiss I did a preliminary test of using a nested DictionaryArray through saving/loading a feather file and it works, so we should definitely go that route. It also might work that we could use the same DictionaryArray across different columns and only write to the file once, to avoid duplicating the document text. I'll have test more for that, but in the meantime we should proceed. Do you want to finish up with this PR and then I can follow with the Arrow changes? |
Makes sense, @BryanCutler. If we'll be using Arrow's native dictionary coding for serialization, do you think it would make sense to replace the guts of the |
It might be a good idea, that would make it so there isn't a need to copy documents when doing IO. Basically you would just need to change |
Updated token-based spans to support multiple documents. This change required refactoring of the previous changes to character-based spans. In the process, I changed the API for creating span arrays from a constructor call to a factory method. That is, Updated Feather support is still not implemented. I've disabled the tests of Feather for now. All the other regression tests are now working. Still need to update the notebooks to reflect the constructor changes. |
Sounds good @frreiss , as soon as this is merged I'll work on the arrow changes and enable the tests again. |
Updated the notebooks and fixed a few additional bugs. |
@BryanCutler I think these changes are about ready to merge, with the exception of revamping the Arrow/Feather support. Can you review these changes, please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frreiss I have a few questions on some of the high-level designs first before I do a more detailed review.
Pushed changes that simplify the way |
Pushed some additional changes to HTML rendering so that spans will render correctly in JupyterLab dark mode and in VSCode notebooks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple questions - I think I answered some myself as I went along :)
The code changes look good to me, I didn't look at the notebooks but I'm assuming they are updating the apis, so feel free to merge.
Thanks for the careful review, @BryanCutler ! I'm going to merge these changes into master now. |
Work in progress, do not merge yet.
This PR contains the first step towards implementing #73. There's quite a bit of additional work to be done, but I'm sharing this WIP PR to get some feedback on the current design.
Here's what has been implemented so far:
StringTable
, for efficiently managing the connection between document texts and spans.SpanArray
class to support mixing spans from multiple documents.test_span.py
pass.Major things still remaining:
TokenSpanArray
class to also support multiple documentsspanner
package to support multiple documents per span array