Extend CharSpanArray and TokenSpanArray to support multiple documents #73

frreiss · 2020-08-05T17:38:02Z

The current implementation of CharSpanArray and TokenSpanArray only allows a single target text for all of the spans in a given array. This restriction is fine as long as all the spans in a given Dataframe come from a single document, but it complicates use cases involving combining information from multiple documents in a single Dataframe. Currently the only way to have spans from multiple documents in a series is to convert CharSpanArray/TokenSpanArray arrays into arrays of type Object containing individual CharSpan and TokenSpan objects.

We should extend our span array types to allow for multiple target texts per array. Key challenges to address:

Memory-efficient representation for the span data
Clean semantics for span comparison across documents. What happens if two documents in a corpus have the same text?
Efficient implementations of the span operations under text_extensions_for_pandas.spanner with multiple target texts
Serialization/deserialization to/from Arrow and Feather format

The text was updated successfully, but these errors were encountered:

frreiss · 2021-01-26T23:43:26Z

WIP PR at #170

frreiss · 2021-03-03T18:38:58Z

Closed by #170

frreiss · 2021-03-03T18:43:29Z

Correction: #170 implements everything but updated Arrow support. Opened #179 to cover that work.

frreiss added the enhancement New feature or request label Aug 25, 2020

frreiss changed the title ~~ENH: Extend CharSpanArray and TokenSpanArray to support multiple documents~~ Extend CharSpanArray and TokenSpanArray to support multiple documents Aug 25, 2020

frreiss mentioned this issue Jan 26, 2021

Add multi-document support for span arrays #170

Merged

frreiss self-assigned this Jan 26, 2021

frreiss closed this as completed Mar 3, 2021

frreiss mentioned this issue Mar 3, 2021

Reimplement Arrow conversion to cover multi-document SpanArrays/TokenSpanArrays #179

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend CharSpanArray and TokenSpanArray to support multiple documents #73

Extend CharSpanArray and TokenSpanArray to support multiple documents #73

frreiss commented Aug 5, 2020

frreiss commented Jan 26, 2021

frreiss commented Mar 3, 2021

frreiss commented Mar 3, 2021

Extend CharSpanArray and TokenSpanArray to support multiple documents #73

Extend CharSpanArray and TokenSpanArray to support multiple documents #73

Comments

frreiss commented Aug 5, 2020

frreiss commented Jan 26, 2021

frreiss commented Mar 3, 2021

frreiss commented Mar 3, 2021