[Format] Consider adding an official variant type to Arrow #42069

CurtHagenlocher · 2024-06-10T16:14:18Z

Describe the enhancement requested

This could be aligned with the new Spark variant type or it could not be.

Component(s)

Format

zeroshade · 2024-06-10T19:01:16Z

What would be the benefit of this over the current Union types? Is it just to alleviate the need to specify all the types up front?

CurtHagenlocher · 2024-06-10T19:26:52Z

What would be the benefit of this over the current Union types? Is it just to alleviate the need to specify all the types up front?

That's part of it, yes, but many sources also support variants containing semistructured types where you could conceivably need a combinatorial explosion of unions to reflect all the data. (See https://docs.snowflake.com/en/sql-reference/data-types-semistructured, for instance.)

wjones127 · 2024-06-12T01:29:31Z

I’m investigating using the Spark spec as an extension type in DataFusion. I’ll report back here whether it turns out to work well with Arrow layouts.

wjones127 · 2024-06-13T16:05:05Z

Notes from discussion with original developers

I talked to developers at Databricks who worked on adding this feature to Spark and Delta Lake. Here a few notes from that.

This is being added as a data type in Spark and Delta Lake. They intend to add this data type to Iceberg as well.
They called it the “Open Variant Data Type” with the intention that this data type would proliferate to other systems.
- They have a standalone Java library that implements the data type. That’s the Java code at https://github.com/apache/spark/tree/master/common/variant
The data is stored as two binary fields: one to hold a string dictionary, the other to hold the binary representation of the values. It is generally kept as binary data in memory, but engines are free to manipulate it as they wish.
The eventual plan is to support record shredding, where fields that have dense values will be split out into their own columns. This allows row group / page pruning to happen with normal Parquet statistics / indices.
- Record shredding will have to be the same per Parquet file, but could be different between files.
- Once in memory, variants will be either recombined into the two binary columns or else have been selected back into their fully shredded forms. This is because most engines will require a common schema across files. The good news here is that means by the time it might be exported into Arrow data, we wouldn’t have to worry about the shredding.
Performance justification: JSON and BSON are not designed for OLAP queries.
- The canonical pathological case is where you are extracting the last field in a large object. JSON has to do O(n) string comparisons, the variant form replaces them with integer comparisons.
- The main performance optimization is that object keys (and other common strings) are pulled out into a common string dictionary. This reduces the size, but also replaces all the string comparisons needed in field lookups with integer comparisons.

wjones127 · 2024-06-13T16:13:49Z

An Arrow extension type?

In the near term, I think this would make a good Arrow extension type. This would be:

struct<
  metadata: dictionary<binary>,
  data: binary
>

The metadata will usually be a single binary shared across all rows, but could be multiple. (Multiple might happen if two different batches are concatenated together, for example.) Either dictionary or REE encoded array would be appropriate.

The data could be either binary, large binary, or binary view.

Binary view isn’t widely supported right now, but could be very useful for this data type. This is because sub-objects can be sliced out of variants. From the spec ¹:

Another motivation for the representation is that (aside from metadata) each inner Variant value is contiguous and self-contained. For example, in a Variant containing an Array of Variant values, the representation of an inner Variant value, when paired with the metadata of the full variant, is itself a valid Variant.

Where could this be useful?

A few immediate places I think this extension type could be useful:

Roundtrip variant Arrow ↔ Spark
- Spark Connect (and any ADBC connector to that) would benefit from this
Extension type in PyArrow, roundtrip PySpark ↔ PyArrow
DataFusion function library (I’m experimenting with that now)
- There's been substantial interest in DataFusion community for a way to handle semi-structured data efficiently.

Extension type pitfalls

The main pitfall of using an extension type for this is the storage type is meaningless to users. They need to have special libraries to interpret the bytes if pulled into a system that doesn't understand the variant extension type.

In addition, most existing Arrow systems I've worked with don't have a way to customize how extension arrays are printed. I think this is something we should fix. A reasonable workaround in the meantime is providing functions that convert these back to JSON strings for the purpose of printing.

https://github.com/apache/spark/blob/master/common/variant/README.md ↩

emkornfield · 2024-06-14T01:19:58Z

I think this makes sense as a extension type. I think given subcolumnarization work happening one might also want to store a union in the type as well for columns that have been split out

emkornfield · 2024-06-14T03:55:46Z

One other thought, I think the variant type in spark has a more limited type surface then Arrow, that is potentially something that might need reconciling

CurtHagenlocher · 2024-06-14T14:58:47Z

Yeah, I think there are really two different requests possible here: an Arrow-native variant type and a Spark-compatible variant type. The surface area thing works both ways: like Parquet, the Spark variant supports 32-bit and 64-bit decimal values while Arrow does not.

wjones127 · 2024-06-15T04:39:44Z

Yeah, I think there are really two different requests possible here: an Arrow-native variant type and a Spark-compatible variant type.

I think it’s too early to say whether an Arrow-native one makes sense. The Spark / delta lake teams have intentions that their standard will proliferate to other engines. At which point it will not be a Spark specific thing and might make more sense to align with these types. If that succeeds, it would make sense for us to align with the format. The Open Variant Data Type has a version field, so it could be amenable to expansions of the types in there.

If this standard doesn’t proliferate to other engines and ends up being Spark specific while other engines maintain different standards, we will have to have a conversation about what kind of variant type would make a good interchange format. That would be a point where Arrow designing its own format would make sense.

Either way, it’s much too early to know which direction to go. Spark 4.0 isn’t even release yet. I think this is the stage where we should experiment with this type as a non-canonical extension type and keep an eye on the data types adoption in the wider ecosystem.

emkornfield · 2024-06-15T14:56:33Z

Yeah, fwiw there is an iceberg proposal to also support variant type and if IIUC the current incarnation is to support spark with iceberg types but it hasn't made it very far yet

CurtHagenlocher · 2024-06-15T15:06:32Z

Even if there were no Spark (or Iceberg) variant type there would still be variants stored in databases and it would be nice for ADBC to be able to return those in a somewhat-consistent fashion. I suppose ADBC could define its own extension type for this purpose.

CurtHagenlocher · 2024-06-15T15:12:02Z

For curious observers, there's a thread about the Iceberg proposal at https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 and the proposal itself at https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit#heading=h.rt0cvesdzsj7.

alamb · 2024-06-19T10:09:41Z

Binary view isn’t widely supported right now, but could be very useful for this data type. This is because sub-objects can be sliced out of variants. From the spec 1:

BTW we are actively working on implementing StringView / BinaryView support in arrow-rs apache/arrow-rs#5374 and DataFusion apache/datafusion#10918 and thanks to @XiangpengXao, @Weijun-H and other we are making good progress

CurtHagenlocher added the Type: enhancement label Jun 10, 2024

github-actions bot added the Component: Format label Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Format] Consider adding an official variant type to Arrow #42069

[Format] Consider adding an official variant type to Arrow #42069

CurtHagenlocher commented Jun 10, 2024

zeroshade commented Jun 10, 2024

CurtHagenlocher commented Jun 10, 2024

wjones127 commented Jun 12, 2024

wjones127 commented Jun 13, 2024 •

edited

Loading

wjones127 commented Jun 13, 2024

emkornfield commented Jun 14, 2024

emkornfield commented Jun 14, 2024

CurtHagenlocher commented Jun 14, 2024

wjones127 commented Jun 15, 2024

emkornfield commented Jun 15, 2024

CurtHagenlocher commented Jun 15, 2024

CurtHagenlocher commented Jun 15, 2024 •

edited

Loading

alamb commented Jun 19, 2024

[Format] Consider adding an official variant type to Arrow #42069

[Format] Consider adding an official variant type to Arrow #42069

Comments

CurtHagenlocher commented Jun 10, 2024

Describe the enhancement requested

Component(s)

zeroshade commented Jun 10, 2024

CurtHagenlocher commented Jun 10, 2024

wjones127 commented Jun 12, 2024

wjones127 commented Jun 13, 2024 • edited Loading

Notes from discussion with original developers

wjones127 commented Jun 13, 2024

An Arrow extension type?

Where could this be useful?

Extension type pitfalls

Footnotes

emkornfield commented Jun 14, 2024

emkornfield commented Jun 14, 2024

CurtHagenlocher commented Jun 14, 2024

wjones127 commented Jun 15, 2024

emkornfield commented Jun 15, 2024

CurtHagenlocher commented Jun 15, 2024

CurtHagenlocher commented Jun 15, 2024 • edited Loading

alamb commented Jun 19, 2024

wjones127 commented Jun 13, 2024 •

edited

Loading

CurtHagenlocher commented Jun 15, 2024 •

edited

Loading