Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend CharSpanArray and TokenSpanArray to support multiple documents #73

Closed
frreiss opened this issue Aug 5, 2020 · 3 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@frreiss
Copy link
Member

frreiss commented Aug 5, 2020

The current implementation of CharSpanArray and TokenSpanArray only allows a single target text for all of the spans in a given array. This restriction is fine as long as all the spans in a given Dataframe come from a single document, but it complicates use cases involving combining information from multiple documents in a single Dataframe. Currently the only way to have spans from multiple documents in a series is to convert CharSpanArray/TokenSpanArray arrays into arrays of type Object containing individual CharSpan and TokenSpan objects.

We should extend our span array types to allow for multiple target texts per array. Key challenges to address:

  • Memory-efficient representation for the span data
  • Clean semantics for span comparison across documents. What happens if two documents in a corpus have the same text?
  • Efficient implementations of the span operations under text_extensions_for_pandas.spanner with multiple target texts
  • Serialization/deserialization to/from Arrow and Feather format
@frreiss frreiss added the enhancement New feature or request label Aug 25, 2020
@frreiss frreiss changed the title ENH: Extend CharSpanArray and TokenSpanArray to support multiple documents Extend CharSpanArray and TokenSpanArray to support multiple documents Aug 25, 2020
@frreiss
Copy link
Member Author

frreiss commented Jan 26, 2021

WIP PR at #170

@frreiss frreiss self-assigned this Jan 26, 2021
@frreiss
Copy link
Member Author

frreiss commented Mar 3, 2021

Closed by #170

@frreiss
Copy link
Member Author

frreiss commented Mar 3, 2021

Correction: #170 implements everything but updated Arrow support. Opened #179 to cover that work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant