-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Arrow serializaiton for SpanArray multidoc support #181
Fix Arrow serializaiton for SpanArray multidoc support #181
Conversation
# Create a dictionary array from StringTable used in this span | ||
dictionary = pa.array(list(char_span._string_table.things)) | ||
target_text_dict_array = pa.DictionaryArray.from_arrays(char_span._text_ids, dictionary) | ||
# TODO: remove unused things and normalize text_ids? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should sanitize these before serializing to remove any unused document, then reset the text ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a good idea to add that functionality to SpanArray
itself, so that user code can trim down the in-memory footprint. And to invoke that method before serializing. And to invoke that method when copying a SpanArray
.
But I think it's ok not to implement it for now.
typ = ArrowSpanType(begins_array.type, char_span.target_text) | ||
# Create a dictionary array from StringTable used in this span | ||
dictionary = pa.array(list(char_span._string_table.things)) | ||
target_text_dict_array = pa.DictionaryArray.from_arrays(char_span._text_ids, dictionary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses the private char_span._string_table
and ._text_ids
. It didn't seem like we really needed accessors for these so I left it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this works. You might want to consider giving the StringTable
class a to_arrow_dictionary
method instead though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the dictionary in pyarrow is just an ordinary array, we just need to make sure that the indices match those in the _text_ids
array, which I think they do. I believe I should unbox the "thing" first though.
target_text_dict_dtype = extension_array.field(ArrowSpanType.TARGET_TEXT_DICT_NAME).type | ||
extension_array = pa.ExtensionArray.from_storage( | ||
ArrowSpanType(index_dtype, target_text_dict_dtype), | ||
extension_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, parquet doesn't return an extension array, but the struct array used as storage. Seems like a bug, so I'll follow up on it, but this workaround seemed ok for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, works for me.
# Create target text StringTable and text_ids from dictionary array | ||
target_text_dict_array = extension_array.storage.field(ArrowSpanType.TARGET_TEXT_DICT_NAME) | ||
target_texts = [s.as_py() for s in target_text_dict_array.dictionary] | ||
string_table, _ = StringTable.merge_things(target_texts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that the text_ids
will match the new ids returned from merge_thing
. I think it's safe to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be safer to add a class method to ThingTable
that initializes an instance directly from a list of <thing, ID> pairs. Future versions of merge_things()
will probably use a different, faster algorithm that may produced different IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, that's a good idea. I'll add that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little more useful to have a method to create a table from just a list of strings, so I added that. If you think we should have just one that takes things and ids to make a table, I can change it.
@frreiss this seems like a good improvement for SpanArray serialization - much better to store in a dictionary batch rather than field metadata. If this looks ok, I'll get started on TokenSpanArray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Some minor cleanup requested.
typ = ArrowSpanType(begins_array.type, char_span.target_text) | ||
# Create a dictionary array from StringTable used in this span | ||
dictionary = pa.array(list(char_span._string_table.things)) | ||
target_text_dict_array = pa.DictionaryArray.from_arrays(char_span._text_ids, dictionary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this works. You might want to consider giving the StringTable
class a to_arrow_dictionary
method instead though.
# Create a dictionary array from StringTable used in this span | ||
dictionary = pa.array(list(char_span._string_table.things)) | ||
target_text_dict_array = pa.DictionaryArray.from_arrays(char_span._text_ids, dictionary) | ||
# TODO: remove unused things and normalize text_ids? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a good idea to add that functionality to SpanArray
itself, so that user code can trim down the in-memory footprint. And to invoke that method before serializing. And to invoke that method when copying a SpanArray
.
But I think it's ok not to implement it for now.
target_text_dict_dtype = extension_array.field(ArrowSpanType.TARGET_TEXT_DICT_NAME).type | ||
extension_array = pa.ExtensionArray.from_storage( | ||
ArrowSpanType(index_dtype, target_text_dict_dtype), | ||
extension_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, works for me.
# Create target text StringTable and text_ids from dictionary array | ||
target_text_dict_array = extension_array.storage.field(ArrowSpanType.TARGET_TEXT_DICT_NAME) | ||
target_texts = [s.as_py() for s in target_text_dict_array.dictionary] | ||
string_table, _ = StringTable.merge_things(target_texts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be safer to add a class method to ThingTable
that initializes an instance directly from a list of <thing, ID> pairs. Future versions of merge_things()
will probably use a different, faster algorithm that may produced different IDs.
@@ -455,17 +457,28 @@ def test_addition(self): | |||
|
|||
class CharSpanArrayIOTests(ArrayTestBase): | |||
|
|||
@pytest.mark.skip("Temporarily disabled until Feather support reimplemented") | |||
def test_feather(self): | |||
arr = self._make_spans_of_tokens() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind modifying these tests so that they write out a SpanArray
containing spans over two different document texts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a separate test for multi-doc.
I think I addressed all and tests are passing. I'll go ahead and merge now and fix up anything with a followup or when I fix TokenSpanArray arrow conversion. |
This changes Arrow serialization for SpanArray to store documents in a dictionary that is indexed by text ids. Also added support for saving to Parquet files.
From #179