Fix Arrow serializaiton for SpanArray multidoc support #181

BryanCutler · 2021-03-23T05:04:20Z

This changes Arrow serialization for SpanArray to store documents in a dictionary that is indexed by text ids. Also added support for saving to Parquet files.

From #179

BryanCutler · 2021-03-23T05:05:33Z

text_extensions_for_pandas/array/arrow_conversion.py

+ # Create a dictionary array from StringTable used in this span
+ dictionary = pa.array(list(char_span._string_table.things))
+ target_text_dict_array = pa.DictionaryArray.from_arrays(char_span._text_ids, dictionary)
+ # TODO: remove unused things and normalize text_ids?


Not sure if we should sanitize these before serializing to remove any unused document, then reset the text ids?

It would be a good idea to add that functionality to SpanArray itself, so that user code can trim down the in-memory footprint. And to invoke that method before serializing. And to invoke that method when copying a SpanArray.

But I think it's ok not to implement it for now.

BryanCutler · 2021-03-23T05:07:18Z

text_extensions_for_pandas/array/arrow_conversion.py

- typ = ArrowSpanType(begins_array.type, char_span.target_text)
+ # Create a dictionary array from StringTable used in this span
+ dictionary = pa.array(list(char_span._string_table.things))
+ target_text_dict_array = pa.DictionaryArray.from_arrays(char_span._text_ids, dictionary)


This uses the private char_span._string_table and ._text_ids. It didn't seem like we really needed accessors for these so I left it like this.

I guess this works. You might want to consider giving the StringTable class a to_arrow_dictionary method instead though.

Well the dictionary in pyarrow is just an ordinary array, we just need to make sure that the indices match those in the _text_ids array, which I think they do. I believe I should unbox the "thing" first though.

BryanCutler · 2021-03-23T05:08:27Z

text_extensions_for_pandas/array/arrow_conversion.py

+ target_text_dict_dtype = extension_array.field(ArrowSpanType.TARGET_TEXT_DICT_NAME).type
+ extension_array = pa.ExtensionArray.from_storage(
+ ArrowSpanType(index_dtype, target_text_dict_dtype),
+ extension_array)


For some reason, parquet doesn't return an extension array, but the struct array used as storage. Seems like a bug, so I'll follow up on it, but this workaround seemed ok for now.

ok, works for me.

BryanCutler · 2021-03-23T05:11:59Z

text_extensions_for_pandas/array/arrow_conversion.py

+ # Create target text StringTable and text_ids from dictionary array
+ target_text_dict_array = extension_array.storage.field(ArrowSpanType.TARGET_TEXT_DICT_NAME)
+ target_texts = [s.as_py() for s in target_text_dict_array.dictionary]
+ string_table, _ = StringTable.merge_things(target_texts)


This assumes that the text_ids will match the new ids returned from merge_thing. I think it's safe to do this.

I think it would be safer to add a class method to ThingTable that initializes an instance directly from a list of <thing, ID> pairs. Future versions of merge_things() will probably use a different, faster algorithm that may produced different IDs.

yup, that's a good idea. I'll add that here.

It's a little more useful to have a method to create a table from just a list of strings, so I added that. If you think we should have just one that takes things and ids to make a table, I can change it.

BryanCutler · 2021-03-23T05:14:42Z

@frreiss this seems like a good improvement for SpanArray serialization - much better to store in a dictionary batch rather than field metadata. If this looks ok, I'll get started on TokenSpanArray.

frreiss

Looking good. Some minor cleanup requested.

frreiss · 2021-03-24T21:44:56Z

text_extensions_for_pandas/array/arrow_conversion.py

- typ = ArrowSpanType(begins_array.type, char_span.target_text)
+ # Create a dictionary array from StringTable used in this span
+ dictionary = pa.array(list(char_span._string_table.things))
+ target_text_dict_array = pa.DictionaryArray.from_arrays(char_span._text_ids, dictionary)


I guess this works. You might want to consider giving the StringTable class a to_arrow_dictionary method instead though.

frreiss · 2021-03-24T21:53:57Z

text_extensions_for_pandas/array/arrow_conversion.py

+ # Create a dictionary array from StringTable used in this span
+ dictionary = pa.array(list(char_span._string_table.things))
+ target_text_dict_array = pa.DictionaryArray.from_arrays(char_span._text_ids, dictionary)
+ # TODO: remove unused things and normalize text_ids?


It would be a good idea to add that functionality to SpanArray itself, so that user code can trim down the in-memory footprint. And to invoke that method before serializing. And to invoke that method when copying a SpanArray.

But I think it's ok not to implement it for now.

frreiss · 2021-03-24T21:54:13Z

text_extensions_for_pandas/array/arrow_conversion.py

+ target_text_dict_dtype = extension_array.field(ArrowSpanType.TARGET_TEXT_DICT_NAME).type
+ extension_array = pa.ExtensionArray.from_storage(
+ ArrowSpanType(index_dtype, target_text_dict_dtype),
+ extension_array)


ok, works for me.

frreiss · 2021-03-24T21:59:42Z

text_extensions_for_pandas/array/arrow_conversion.py

+ # Create target text StringTable and text_ids from dictionary array
+ target_text_dict_array = extension_array.storage.field(ArrowSpanType.TARGET_TEXT_DICT_NAME)
+ target_texts = [s.as_py() for s in target_text_dict_array.dictionary]
+ string_table, _ = StringTable.merge_things(target_texts)


I think it would be safer to add a class method to ThingTable that initializes an instance directly from a list of <thing, ID> pairs. Future versions of merge_things() will probably use a different, faster algorithm that may produced different IDs.

frreiss · 2021-03-24T22:03:25Z

text_extensions_for_pandas/array/test_span.py

@@ -455,17 +457,28 @@ def test_addition(self):

 class CharSpanArrayIOTests(ArrayTestBase):

- @pytest.mark.skip("Temporarily disabled until Feather support reimplemented")
 def test_feather(self):
 arr = self._make_spans_of_tokens()


Would you mind modifying these tests so that they write out a SpanArray containing spans over two different document texts?

I added a separate test for multi-doc.

…zation

BryanCutler · 2021-03-25T23:14:28Z

I think I addressed all and tests are passing. I'll go ahead and merge now and fix up anything with a followup or when I fix TokenSpanArray arrow conversion.

BryanCutler added 3 commits March 22, 2021 17:06

Have basic serialization of SpanArray done

ef5d3b1

Added test for span to parquet, needed workaround for reading

311764f

Cleanup StringTable creation

d34810a

BryanCutler commented Mar 23, 2021

View reviewed changes

BryanCutler requested a review from frreiss March 23, 2021 05:12

frreiss approved these changes Mar 24, 2021

View reviewed changes

Added new constructor for StringTable, new test for multi-doc seriali…

7595cdd

…zation

BryanCutler merged commit 7d34e22 into CODAIT:master Mar 25, 2021

BryanCutler deleted the arrow-multidoc-support-179 branch March 25, 2021 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Arrow serializaiton for SpanArray multidoc support #181

Fix Arrow serializaiton for SpanArray multidoc support #181

BryanCutler commented Mar 23, 2021 •

edited

Loading

BryanCutler Mar 23, 2021

frreiss Mar 24, 2021 •

edited

Loading

BryanCutler Mar 23, 2021

frreiss Mar 24, 2021

BryanCutler Mar 25, 2021

BryanCutler Mar 23, 2021

frreiss Mar 24, 2021

BryanCutler Mar 23, 2021

frreiss Mar 24, 2021

BryanCutler Mar 25, 2021

BryanCutler Mar 25, 2021

BryanCutler commented Mar 23, 2021

frreiss left a comment

frreiss Mar 24, 2021

frreiss Mar 24, 2021 •

edited

Loading

frreiss Mar 24, 2021

frreiss Mar 24, 2021

frreiss Mar 24, 2021

BryanCutler Mar 25, 2021

BryanCutler commented Mar 25, 2021 •

edited

Loading

Fix Arrow serializaiton for SpanArray multidoc support #181

Fix Arrow serializaiton for SpanArray multidoc support #181

Conversation

BryanCutler commented Mar 23, 2021 • edited Loading

Choose a reason for hiding this comment

frreiss Mar 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Mar 23, 2021

frreiss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frreiss Mar 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Mar 25, 2021 • edited Loading

BryanCutler commented Mar 23, 2021 •

edited

Loading

frreiss Mar 24, 2021 •

edited

Loading

frreiss Mar 24, 2021 •

edited

Loading

BryanCutler commented Mar 25, 2021 •

edited

Loading