Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Arrow serializaiton for SpanArray multidoc support #181

Merged
merged 4 commits into from
Mar 25, 2021

Conversation

BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Mar 23, 2021

This changes Arrow serialization for SpanArray to store documents in a dictionary that is indexed by text ids. Also added support for saving to Parquet files.

From #179

# Create a dictionary array from StringTable used in this span
dictionary = pa.array(list(char_span._string_table.things))
target_text_dict_array = pa.DictionaryArray.from_arrays(char_span._text_ids, dictionary)
# TODO: remove unused things and normalize text_ids?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should sanitize these before serializing to remove any unused document, then reset the text ids?

Copy link
Member

@frreiss frreiss Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a good idea to add that functionality to SpanArray itself, so that user code can trim down the in-memory footprint. And to invoke that method before serializing. And to invoke that method when copying a SpanArray.

But I think it's ok not to implement it for now.

typ = ArrowSpanType(begins_array.type, char_span.target_text)
# Create a dictionary array from StringTable used in this span
dictionary = pa.array(list(char_span._string_table.things))
target_text_dict_array = pa.DictionaryArray.from_arrays(char_span._text_ids, dictionary)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the private char_span._string_table and ._text_ids. It didn't seem like we really needed accessors for these so I left it like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this works. You might want to consider giving the StringTable class a to_arrow_dictionary method instead though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the dictionary in pyarrow is just an ordinary array, we just need to make sure that the indices match those in the _text_ids array, which I think they do. I believe I should unbox the "thing" first though.

target_text_dict_dtype = extension_array.field(ArrowSpanType.TARGET_TEXT_DICT_NAME).type
extension_array = pa.ExtensionArray.from_storage(
ArrowSpanType(index_dtype, target_text_dict_dtype),
extension_array)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, parquet doesn't return an extension array, but the struct array used as storage. Seems like a bug, so I'll follow up on it, but this workaround seemed ok for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, works for me.

# Create target text StringTable and text_ids from dictionary array
target_text_dict_array = extension_array.storage.field(ArrowSpanType.TARGET_TEXT_DICT_NAME)
target_texts = [s.as_py() for s in target_text_dict_array.dictionary]
string_table, _ = StringTable.merge_things(target_texts)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the text_ids will match the new ids returned from merge_thing. I think it's safe to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be safer to add a class method to ThingTable that initializes an instance directly from a list of <thing, ID> pairs. Future versions of merge_things() will probably use a different, faster algorithm that may produced different IDs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, that's a good idea. I'll add that here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little more useful to have a method to create a table from just a list of strings, so I added that. If you think we should have just one that takes things and ids to make a table, I can change it.

@BryanCutler
Copy link
Member Author

@frreiss this seems like a good improvement for SpanArray serialization - much better to store in a dictionary batch rather than field metadata. If this looks ok, I'll get started on TokenSpanArray.

Copy link
Member

@frreiss frreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Some minor cleanup requested.

typ = ArrowSpanType(begins_array.type, char_span.target_text)
# Create a dictionary array from StringTable used in this span
dictionary = pa.array(list(char_span._string_table.things))
target_text_dict_array = pa.DictionaryArray.from_arrays(char_span._text_ids, dictionary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this works. You might want to consider giving the StringTable class a to_arrow_dictionary method instead though.

# Create a dictionary array from StringTable used in this span
dictionary = pa.array(list(char_span._string_table.things))
target_text_dict_array = pa.DictionaryArray.from_arrays(char_span._text_ids, dictionary)
# TODO: remove unused things and normalize text_ids?
Copy link
Member

@frreiss frreiss Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a good idea to add that functionality to SpanArray itself, so that user code can trim down the in-memory footprint. And to invoke that method before serializing. And to invoke that method when copying a SpanArray.

But I think it's ok not to implement it for now.

target_text_dict_dtype = extension_array.field(ArrowSpanType.TARGET_TEXT_DICT_NAME).type
extension_array = pa.ExtensionArray.from_storage(
ArrowSpanType(index_dtype, target_text_dict_dtype),
extension_array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, works for me.

# Create target text StringTable and text_ids from dictionary array
target_text_dict_array = extension_array.storage.field(ArrowSpanType.TARGET_TEXT_DICT_NAME)
target_texts = [s.as_py() for s in target_text_dict_array.dictionary]
string_table, _ = StringTable.merge_things(target_texts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be safer to add a class method to ThingTable that initializes an instance directly from a list of <thing, ID> pairs. Future versions of merge_things() will probably use a different, faster algorithm that may produced different IDs.

@@ -455,17 +457,28 @@ def test_addition(self):

class CharSpanArrayIOTests(ArrayTestBase):

@pytest.mark.skip("Temporarily disabled until Feather support reimplemented")
def test_feather(self):
arr = self._make_spans_of_tokens()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind modifying these tests so that they write out a SpanArray containing spans over two different document texts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a separate test for multi-doc.

@BryanCutler
Copy link
Member Author

BryanCutler commented Mar 25, 2021

I think I addressed all and tests are passing. I'll go ahead and merge now and fix up anything with a followup or when I fix TokenSpanArray arrow conversion.

@BryanCutler BryanCutler merged commit 7d34e22 into CODAIT:master Mar 25, 2021
@BryanCutler BryanCutler deleted the arrow-multidoc-support-179 branch March 25, 2021 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants