Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks // Trigger parquet FORMAT_OPTION merge-schema in loader, when schema drifts within the batch emited by stream transformer. #1126

Open
voropaevp opened this issue Nov 7, 2022 · 0 comments
Assignees

Comments

@voropaevp
Copy link
Contributor

voropaevp commented Nov 7, 2022

Problem

Some of the users highlighted issues with parquet loading:

  • Format options in PR #1083
  • Performance degradation during COPY TO reported via discorse

The investigation concluded that databricks is distributing load based on a number of parquet partitions. Spark would create a Task per each Core until it runs out of partitions to load. So higher number of partitions is generally good for performance unless they become too small making Task creating overhead takeover.

FORMAT_OPTIONS functional tests

A set of experiments with mixing parquet schema within the batch showed, that result of their combination is non-deterministic (it losses a different number of columns, sometimes none), and oftentimes results in a data loss of one or more columns. PR referenced above also shows that it is possible to crush the loading of the batch entirely.

COPY_OPTIONS benchmarks

COPY_OPTIONS were benchmarked as a part of a different exercise, showing no impact on loading performance.
The figure below shows the difference between COPY_OPTIONS vs no COPY_OPTIONS. The
image

Average of 10000 runs load time difference between COPY_OPTIONS vs no COPY_OPTIONS is < 0.01 sec.

Paritioning/FORMAT_OPTIONS benchmark

Here are the results of our benchmarks (3Gb dataset) on 3 node cluster:

No of Partitions Heavily utilized cluster Fresh cluster FORMAT_OPTIONS on fresh cluster
1 16m51s 2m33s 2m 15s
2 3m 53s 56s 48s
2 59s 37s 51s

Conclusions

  • On large clusters, it is beneficial to decrease the number of items per partition while increasing the time window. Which increases the number of parquet partitions. Enabling better load distribution.

  • FORMAT_OPTIONS is critical to load the data consistently, during schema drifts.

  • FORMAT_OPTIONS had shown a performance decrease on the datasets with many partitions. FORMAT_OPTIONS merge schema across the parquet dataset, as opposed to COPY OPTIONS that merge schema with the target table.

Proposed Solution

Make the stream transformer detect schema drift within the window and flag such an event as an Optional part of the ShreddingComplete message. So that loader would attach the FORMAT_OPTIONS ('merge_schema' = true) to the COPY command only when it is necessary.

@voropaevp voropaevp changed the title Trigger parquet FORMAT_OPTION merge-schema loader, when schema drifts within the batch. Trigger parquet FORMAT_OPTION merge-schema in loader, when schema drifts within the batch emited by stream transformer. Nov 7, 2022
@voropaevp voropaevp self-assigned this Nov 7, 2022
@voropaevp voropaevp changed the title Trigger parquet FORMAT_OPTION merge-schema in loader, when schema drifts within the batch emited by stream transformer. Databricks // Trigger parquet FORMAT_OPTION merge-schema in loader, when schema drifts within the batch emited by stream transformer. Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant