Add multi-document support for span arrays #170

frreiss · 2021-01-26T23:27:15Z

Work in progress, do not merge yet.

This PR contains the first step towards implementing #73. There's quite a bit of additional work to be done, but I'm sharing this WIP PR to get some feedback on the current design.

Here's what has been implemented so far:

Added a new class, StringTable, for efficiently managing the connection between document texts and spans.
Modified the SpanArray class to support mixing spans from multiple documents.
The existing unit tests in test_span.py pass.

Major things still remaining:

Modify the TokenSpanArray class to also support multiple documents
Update the Arrow integration to support multiple documents per array
Update the algorithms under the spanner package to support multiple documents per span array
Get all existing regression tests to pass
Add new regression tests for multi-document support
Update the example notebooks as needed

Initial pass through Span class Fix typo in type hint

…r-pandas into branch-multidoc

frreiss · 2021-01-26T23:42:59Z

@BryanCutler can you have a look at this design for multi-doc support and see if you notice any gotchas on either Arrow integration or passing the full Pandas ExtensionArray suite?

BryanCutler

I did a quick look and seems like a good approach to me. It shouldn't be a problem to update the arrow conversion to support it. I'll do a more thorough review soon.

…r-pandas into branch-multidoc

frreiss · 2021-02-03T23:58:15Z

Merged with changes from master, found a bunch of regressions, and fixed them.

frreiss · 2021-02-04T00:00:28Z

@BryanCutler how would you recommend we encode SpanArrays with multiple target texts in Arrow? Should we directly serialize the string table and an array of integer offsets into the table? Should we use an Arrow Dictionary array? Something else?

BryanCutler · 2021-02-05T20:21:43Z

I was thinking the best thing to do would be a DictionaryArray, but I'm not clear on how that would work with Pandas or if it's even implemented, so I'll have to try it out. If not, we could do something different, but might not be a clean as a dictionary.

BryanCutler · 2021-02-10T21:35:20Z

@frreiss I did a preliminary test of using a nested DictionaryArray through saving/loading a feather file and it works, so we should definitely go that route. It also might work that we could use the same DictionaryArray across different columns and only write to the file once, to avoid duplicating the document text. I'll have test more for that, but in the meantime we should proceed. Do you want to finish up with this PR and then I can follow with the Arrow changes?

frreiss · 2021-02-12T20:07:36Z

Makes sense, @BryanCutler. If we'll be using Arrow's native dictionary coding for serialization, do you think it would make sense to replace the guts of the StringTable class with Arrow's dictionary data structure?

BryanCutler · 2021-02-12T20:59:31Z

Makes sense, @BryanCutler. If we'll be using Arrow's native dictionary coding for serialization, do you think it would make sense to replace the guts of the StringTable class with Arrow's dictionary data structure?

It might be a good idea, that would make it so there isn't a need to copy documents when doing IO. Basically you would just need to change _id_to_str to a pa.Array of type string. Also, I think a numpy array of strings would work fine too, if you don't want to add an arrow dependency there.

…r-pandas into branch-multidoc

frreiss · 2021-02-22T23:25:14Z

Updated token-based spans to support multiple documents. This change required refactoring of the previous changes to character-based spans.

In the process, I changed the API for creating span arrays from a constructor call to a factory method. That is, SpanArray.create() instead of SpanArray().

Updated Feather support is still not implemented. I've disabled the tests of Feather for now.

All the other regression tests are now working.

Still need to update the notebooks to reflect the constructor changes.

… of dtype object

BryanCutler · 2021-02-23T01:21:24Z

Sounds good @frreiss , as soon as this is merged I'll work on the arrow changes and enable the tests again.

frreiss · 2021-02-23T03:28:05Z

Updated the notebooks and fixed a few additional bugs.

frreiss · 2021-02-23T03:32:29Z

@BryanCutler I think these changes are about ready to merge, with the exception of revamping the Arrow/Feather support. Can you review these changes, please?

BryanCutler

@frreiss I have a few questions on some of the high-level designs first before I do a more detailed review.

text_extensions_for_pandas/array/span.py

text_extensions_for_pandas/array/token_span.py

text_extensions_for_pandas/array/token_table.py

frreiss · 2021-02-27T03:17:56Z

Pushed changes that simplify the way TokenSpanArray manages multiple tokenizations and restore the conventional constructor APIs. @BryanCutler can you please have a look when you get a chance?

frreiss · 2021-03-02T19:33:41Z

Pushed some additional changes to HTML rendering so that spans will render correctly in JupyterLab dark mode and in VSCode notebooks.

BryanCutler

Just a couple questions - I think I answered some myself as I went along :)
The code changes look good to me, I didn't look at the notebooks but I'm assuming they are updating the apis, so feel free to merge.

text_extensions_for_pandas/array/span.py

frreiss · 2021-03-03T18:38:39Z

Thanks for the careful review, @BryanCutler ! I'm going to merge these changes into master now.

frreiss added 3 commits January 25, 2021 17:30

Initial version of StringTable class

ab58815

Initial implementation of multi-doc SpanArrays.

a5e16de

Initial pass through Span class Fix typo in type hint

Merge branch 'master' of https://github.com/CODAIT/text-extensions-fo…

2aa58cd

…r-pandas into branch-multidoc

frreiss marked this pull request as draft January 26, 2021 23:27

frreiss requested a review from BryanCutler January 26, 2021 23:28

frreiss mentioned this pull request Jan 26, 2021

Extend CharSpanArray and TokenSpanArray to support multiple documents #73

Closed

BryanCutler reviewed Feb 3, 2021

View reviewed changes

frreiss added 2 commits February 3, 2021 11:35

Merge branch 'master' of https://github.com/CODAIT/text-extensions-fo…

692efa5

…r-pandas into branch-multidoc

Fixed a bunch of regressions

ba2bf9b

frreiss added 6 commits February 17, 2021 16:26

Pull base class for tables of things

19dfad7

Merge branch 'master' of https://github.com/CODAIT/text-extensions-fo…

759f9e0

…r-pandas into branch-multidoc

Additional refactoring of ThingTable

1ec4880

Adjust remainder of package for changes to span internals

4b09833

Fix bug introduced by pycharm auto-fix-up-imports

5d40dd8

Merge branch 'master' of https://github.com/CODAIT/text-extensions-fo…

68f21fd

…r-pandas into branch-multidoc

frreiss added 2 commits February 22, 2021 16:31

Fix bug in document_text property and add document_tokens property

d7f437f

Make combine_folds() return a multi-doc SpanArray instead of a series…

bd8c2e0

… of dtype object

Update and rerun notebooks

98dc855

BryanCutler reviewed Feb 23, 2021

View reviewed changes

frreiss added 4 commits February 24, 2021 09:30

Update and rerun notebooks under tutorials/corpus

cbc0d2b

Clean up constructors and remove token table

ce7d2ec

Rerun notebooks after code changes

48ddb9e

Rerun expensive notebooks after code changes

9cb2a74

frreiss added 2 commits March 2, 2021 11:29

Further updates to rendering to fix dark mode

59749f9

Rerun notebooks after rendering changes

5aa13ad

frreiss changed the title ~~[WIP] Add multi-document support for span arrays~~ Add multi-document support for span arrays Mar 2, 2021

Update test results to reflect HTML changes

fdfe2aa

BryanCutler approved these changes Mar 2, 2021

View reviewed changes

Make document_text property more user-friendly

0fdd13b

frreiss marked this pull request as ready for review March 3, 2021 18:23

frreiss merged commit 5e3a11e into CODAIT:master Mar 3, 2021

frreiss deleted the branch-multidoc branch October 29, 2021 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-document support for span arrays #170

Add multi-document support for span arrays #170

frreiss commented Jan 26, 2021

frreiss commented Jan 26, 2021

BryanCutler left a comment

frreiss commented Feb 3, 2021

frreiss commented Feb 4, 2021

BryanCutler commented Feb 5, 2021

BryanCutler commented Feb 10, 2021

frreiss commented Feb 12, 2021

BryanCutler commented Feb 12, 2021

frreiss commented Feb 22, 2021

BryanCutler commented Feb 23, 2021

frreiss commented Feb 23, 2021

frreiss commented Feb 23, 2021

BryanCutler left a comment

frreiss commented Feb 27, 2021

frreiss commented Mar 2, 2021

BryanCutler left a comment

frreiss commented Mar 3, 2021

Add multi-document support for span arrays #170

Add multi-document support for span arrays #170

Conversation

frreiss commented Jan 26, 2021

frreiss commented Jan 26, 2021

BryanCutler left a comment

Choose a reason for hiding this comment

frreiss commented Feb 3, 2021

frreiss commented Feb 4, 2021

BryanCutler commented Feb 5, 2021

BryanCutler commented Feb 10, 2021

frreiss commented Feb 12, 2021

BryanCutler commented Feb 12, 2021

frreiss commented Feb 22, 2021

BryanCutler commented Feb 23, 2021

frreiss commented Feb 23, 2021

frreiss commented Feb 23, 2021

BryanCutler left a comment

Choose a reason for hiding this comment

frreiss commented Feb 27, 2021

frreiss commented Mar 2, 2021

BryanCutler left a comment

Choose a reason for hiding this comment

frreiss commented Mar 3, 2021