Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-document support for span arrays #170

Merged
merged 22 commits into from
Mar 3, 2021

Conversation

frreiss
Copy link
Member

@frreiss frreiss commented Jan 26, 2021

Work in progress, do not merge yet.

This PR contains the first step towards implementing #73. There's quite a bit of additional work to be done, but I'm sharing this WIP PR to get some feedback on the current design.

Here's what has been implemented so far:

  • Added a new class, StringTable, for efficiently managing the connection between document texts and spans.
  • Modified the SpanArray class to support mixing spans from multiple documents.
  • The existing unit tests in test_span.py pass.

Major things still remaining:

  • Modify the TokenSpanArray class to also support multiple documents
  • Update the Arrow integration to support multiple documents per array
  • Update the algorithms under the spanner package to support multiple documents per span array
  • Get all existing regression tests to pass
  • Add new regression tests for multi-document support
  • Update the example notebooks as needed

@frreiss frreiss marked this pull request as draft January 26, 2021 23:27
@frreiss
Copy link
Member Author

frreiss commented Jan 26, 2021

@BryanCutler can you have a look at this design for multi-doc support and see if you notice any gotchas on either Arrow integration or passing the full Pandas ExtensionArray suite?

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick look and seems like a good approach to me. It shouldn't be a problem to update the arrow conversion to support it. I'll do a more thorough review soon.

@frreiss
Copy link
Member Author

frreiss commented Feb 3, 2021

Merged with changes from master, found a bunch of regressions, and fixed them.

@frreiss
Copy link
Member Author

frreiss commented Feb 4, 2021

@BryanCutler how would you recommend we encode SpanArrays with multiple target texts in Arrow? Should we directly serialize the string table and an array of integer offsets into the table? Should we use an Arrow Dictionary array? Something else?

@BryanCutler
Copy link
Member

I was thinking the best thing to do would be a DictionaryArray, but I'm not clear on how that would work with Pandas or if it's even implemented, so I'll have to try it out. If not, we could do something different, but might not be a clean as a dictionary.

@BryanCutler
Copy link
Member

@frreiss I did a preliminary test of using a nested DictionaryArray through saving/loading a feather file and it works, so we should definitely go that route. It also might work that we could use the same DictionaryArray across different columns and only write to the file once, to avoid duplicating the document text. I'll have test more for that, but in the meantime we should proceed. Do you want to finish up with this PR and then I can follow with the Arrow changes?

@frreiss
Copy link
Member Author

frreiss commented Feb 12, 2021

Makes sense, @BryanCutler. If we'll be using Arrow's native dictionary coding for serialization, do you think it would make sense to replace the guts of the StringTable class with Arrow's dictionary data structure?

@BryanCutler
Copy link
Member

Makes sense, @BryanCutler. If we'll be using Arrow's native dictionary coding for serialization, do you think it would make sense to replace the guts of the StringTable class with Arrow's dictionary data structure?

It might be a good idea, that would make it so there isn't a need to copy documents when doing IO. Basically you would just need to change _id_to_str to a pa.Array of type string. Also, I think a numpy array of strings would work fine too, if you don't want to add an arrow dependency there.

@frreiss
Copy link
Member Author

frreiss commented Feb 22, 2021

Updated token-based spans to support multiple documents. This change required refactoring of the previous changes to character-based spans.

In the process, I changed the API for creating span arrays from a constructor call to a factory method. That is, SpanArray.create() instead of SpanArray().

Updated Feather support is still not implemented. I've disabled the tests of Feather for now.

All the other regression tests are now working.

Still need to update the notebooks to reflect the constructor changes.

@BryanCutler
Copy link
Member

Sounds good @frreiss , as soon as this is merged I'll work on the arrow changes and enable the tests again.

@frreiss
Copy link
Member Author

frreiss commented Feb 23, 2021

Updated the notebooks and fixed a few additional bugs.

@frreiss
Copy link
Member Author

frreiss commented Feb 23, 2021

@BryanCutler I think these changes are about ready to merge, with the exception of revamping the Arrow/Feather support. Can you review these changes, please?

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frreiss I have a few questions on some of the high-level designs first before I do a more detailed review.

text_extensions_for_pandas/array/span.py Show resolved Hide resolved
text_extensions_for_pandas/array/span.py Outdated Show resolved Hide resolved
text_extensions_for_pandas/array/token_span.py Outdated Show resolved Hide resolved
@frreiss
Copy link
Member Author

frreiss commented Feb 27, 2021

Pushed changes that simplify the way TokenSpanArray manages multiple tokenizations and restore the conventional constructor APIs. @BryanCutler can you please have a look when you get a chance?

@frreiss frreiss changed the title [WIP] Add multi-document support for span arrays Add multi-document support for span arrays Mar 2, 2021
@frreiss
Copy link
Member Author

frreiss commented Mar 2, 2021

Pushed some additional changes to HTML rendering so that spans will render correctly in JupyterLab dark mode and in VSCode notebooks.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple questions - I think I answered some myself as I went along :)
The code changes look good to me, I didn't look at the notebooks but I'm assuming they are updating the apis, so feel free to merge.

text_extensions_for_pandas/array/span.py Outdated Show resolved Hide resolved
text_extensions_for_pandas/array/span.py Outdated Show resolved Hide resolved
text_extensions_for_pandas/array/span.py Outdated Show resolved Hide resolved
text_extensions_for_pandas/array/span.py Outdated Show resolved Hide resolved
text_extensions_for_pandas/array/span.py Show resolved Hide resolved
@frreiss frreiss marked this pull request as ready for review March 3, 2021 18:23
@frreiss
Copy link
Member Author

frreiss commented Mar 3, 2021

Thanks for the careful review, @BryanCutler ! I'm going to merge these changes into master now.

@frreiss frreiss merged commit 5e3a11e into CODAIT:master Mar 3, 2021
@frreiss frreiss deleted the branch-multidoc branch October 29, 2021 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants