Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-1223] Allow value escaping in CSV files. #187

Closed
wants to merge 3 commits into from
Closed

[FLINK-1223] Allow value escaping in CSV files. #187

wants to merge 3 commits into from

Conversation

jkirsch
Copy link
Contributor

@jkirsch jkirsch commented Nov 6, 2014

Strings can now contain " quoted characters
Skip trailing whitespace after quote

Strings can now contain " quoted characters
Skip trailing whitespace after quote
@StephanEwen
Copy link
Contributor

Looks good. I have one inline comment/question, though

@@ -30,7 +34,14 @@
private static final byte WHITESPACE_TAB = (byte) '\t';

private static final byte QUOTE_DOUBLE = (byte) '"';


private static final Set<Byte> trailingCheckSet = Sets.newHashSet(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parsers are often stressed badly when reading large CSV files, so we think a lot about performance here.

I am wondering if a hash set is the best choice to check for these three elements. Computing hash, table lookup, entry lookup, comparison, ...

Might be cheaper to just hardwire the check for those types...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use sets when comparing for unquoted characters
@jkirsch
Copy link
Contributor Author

jkirsch commented Nov 7, 2014

Thanks for having a look at the code

Indeed the Set approach is nicer to read but there is a performance penalty incurred

I just ran a simple caliper benchmark and on my machine the set approach is about 10x slower, so I fixed that.

https://microbenchmarks.appspot.com/runs/78f655b3-76f7-4c54-b110-1da29df71d53#r:scenario.benchmarkSpec.methodName

@rmetzger
Copy link
Contributor

Looks good.

@uce
Copy link
Contributor

uce commented Nov 11, 2014

OK. I'm merging this in the current batch.

@asfgit asfgit closed this in e855ef4 Nov 11, 2014
@jkirsch jkirsch deleted the csv branch January 22, 2015 15:58
tzulitai added a commit to tzulitai/flink that referenced this pull request Jan 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants