Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to arrow 32 and Switch to RawDecoder for JSON #5056

Merged
merged 14 commits into from
Jan 31, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Integration test for apache/arrow-rs#3479 and preparing for the next arrow release apache/arrow-rs#3584

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Switch to RawDecoder for JSON
@github-actions github-actions bot added the core Core DataFusion crate label Jan 25, 2023
Ok(futures::stream::iter(reader).boxed())
}
GetResult::Stream(s) => {
let mut decoder = RawDecoder::try_new(schema, batch_size)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this interface is pretty cool, it avoids needing to scan the byte stream looking for newlines, and consequently should add some additional performance on top of the faster performance of RawDecoder in general

@alamb
Copy link
Contributor

alamb commented Jan 25, 2023

🥳 🦜

@github-actions github-actions bot added the physical-expr Physical Expressions label Jan 27, 2023
@tustvold tustvold marked this pull request as ready for review January 30, 2023 16:46
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner labels Jan 30, 2023
@alamb alamb added the api change Changes the API exposed to users of the crate label Jan 30, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @tustvold -- the only question I have is on feeding in empty buffers to the csv reader -- but perhaps I am misreading something

// walk the next level
root_error = source;
// remember the lowest datafusion error so far
if let Some(e) = root_error.downcast_ref::<DataFusionError>() {
last_datafusion_error = e;
} else if let Some(e) = root_error.downcast_ref::<Arc<DataFusionError>>() {
// As `Arc<T>::source()` calls through to `T::source()` we need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -217,6 +217,9 @@ fn default_field_name(dt: &DataType) -> &str {
DataType::Union(_, _, _) => "union",
DataType::Dictionary(_, _) => "map",
DataType::Map(_, _) => unimplemented!("Map support not implemented"),
DataType::RunEndEncoded(_, _) => {
unimplemented!("RunEndEncoded support not implemented")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@@ -21,7 +21,6 @@ use crate::datasource::file_format::file_type::FileCompressionType;
use crate::error::{DataFusionError, Result};
use crate::execution::context::{SessionState, TaskContext};
use crate::physical_plan::expressions::PhysicalSortExpr;
use crate::physical_plan::file_format::delimited_stream::newline_delimited_stream;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the corresponding newline_delimited_stream module be deleted too?

https://github.com/search?q=repo%3Aapache%2Farrow-datafusion%20newline_delimited_stream&type=code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it is still used by the schema inference logic, I'll see about resurrecting the PR to move to using the upstream implementation

}
let decoded = match decoder.decode(buffered.as_ref()) {
// Note: the decoder needs to be called with an empty
// array to delimt the final record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// array to delimt the final record
// array to delimit the final record

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must be missing how the code is called with an empty buffer. If all data in buffered was consumed and then the next poll was empty, won't that break out of the the loop prior to calling decode() 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are quite correct, I'm investigating how this is working...

@@ -218,7 +218,7 @@ impl TryFrom<&DataType> for protobuf::arrow_type::ArrowTypeEnum {
DataType::Decimal256(_, _) => {
return Err(Error::General("Proto serialization error: The Decimal256 data type is not yet supported".to_owned()))
}
DataType::Map(_, _) => {
DataType::Map(_, _) | DataType::RunEndEncoded(_, _) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend either updating the error message here or adding a separate clause for RunEndEncoded

@tustvold tustvold merged commit bb699eb into apache:master Jan 31, 2023
@ursabot
Copy link

ursabot commented Jan 31, 2023

Benchmark runs are scheduled for baseline = a218b70 and contender = bb699eb. bb699eb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants