Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow escaping of non-printable characters in CSV output/input #124

Closed
hgschmie opened this issue Mar 27, 2019 · 9 comments
Closed

Allow escaping of non-printable characters in CSV output/input #124

hgschmie opened this issue Mar 27, 2019 · 9 comments
Labels
Milestone

Comments

@hgschmie
Copy link

We have a use case where we replace an existing CSV writer (based on Apache Commons CSV) with Jackson. The old CSV writer was configured to write "special characters" such as CR and LF as \r and \n. Jackson does not support this but adheres to RFC 4180 (where no escaping exists). This causes a lot of pain for our customers as the data we write contains often CR and LF characters.

Here is some test code:

public class CsvTest {

    public static final void main(String... args) throws Exception {

        CsvSchema schema = CsvSchema.emptySchema().withEscapeChar('\\');
        CsvFactory csvFactory = new SchemaAwareCsvFactory(schema);

        ObjectMapper csvMapper = new ObjectMapper(csvFactory);

        String [] line = new String [] { "a", "\n", "\"", ","};

        csvMapper.writeValue(new PrintWriter(System.out), line);

    }

    // Is there actually a better way to set the schema for a ObjectMapper? This seems painful.
    public static class SchemaAwareCsvFactory extends CsvFactory {
        SchemaAwareCsvFactory(CsvSchema schema) {
            super();
            this._schema = schema;
        }
    }
}

Which produces

a,"
","""",","

I can get it to produce

"a","
","\"",","

by adding

csvFactory.enable(CsvGenerator.Feature.ALWAYS_QUOTE_STRINGS);
csvFactory.enable(Feature.ESCAPE_QUOTE_CHAR_WITH_ESCAPE_CHAR);

But what I am actually looking for is

"a","\n","\"",","

There seems to be no way to get the generator (and probably also the parser) to generate and parse control characters. Having CR and LF within quotes is legal from the RFC 4180 PoV, however most of the CSV that our systems produce get parsed by legacy ("brain dead") tools that assume that every LF is a record separator.

Apache Commons CSV has a nice summary on their Javadoc page for CSVFormat: https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#DEFAULT (and below)

(god, CSV is such a mess. And that is the standard format for enterprise data???)

@cowtowncoder
Copy link
Member

Yes, I think this makes sense -- and yes, CSV is a mess, making this tricky.

The main challenge is really that of figuring API extensions that would be needed.
Aside from new CsvGenerator.Feature entries that could be added, jackson-core actually has CharacterEscapes interface for which support could perhaps be added.

Obviously there's also the problem that although escape character may be specified, there's no real standard for CSV on what escapes would be allowed, and thereby parser/decoder does not support handling either.

@hgschmie
Copy link
Author

Understood; there are different flavors of CSV. At least our customers would probably be more ok with a CSV output that at least had predictable lines, even though they would now parse the fields (or, more likely use some technology that interprets \n as a LF anyway). I don't know enough about the various pieces available in the jackson core; unfortunately most of the documentation that exists for Jackson is written in Java. :-)

@cowtowncoder
Copy link
Member

On plus side, CSV format backend is relatively simple... well, there's the state machine, but wrt quoting should be possible to see how that works.

So I think there are 3 approaches:

  1. Add one or more new CsvGenerator.Feature
  2. CsvSchema to indicate additional option
  3. Integrate CharacterEscapes: that's just a container of escaping information, mostly used by JSON backend, but not strictly tied to it.

I wish I had more time to work on these as this is highly doable thing, and very useful too.
But if anyone has time to work on it I do have time to code review & help, and this would make it in 2.10 -- which I plan on finalizing somewhat soon (Java 9 module info mostly done, need the "class name validator" for allow-list for polymorphic).

@hgschmie
Copy link
Author

Would you talk a change for Jackson 2.9 as well? We are still somewhat locked in the Java 8 world and upgrading to Java 9+ is a scary thought for a number of team. I would try my hand at it in that case.

hgschmie pushed a commit to hgschmie/jackson-dataformats-text that referenced this issue Apr 1, 2019
This is a proposed solution for FasterXML#124. It introduces a new Feature, `ESCAPE_CONTROL_CHARS_WITH_ESCAPE_CHAR`,
which will apply the standard ASCII escapes from JSON to all characters that the CSV generator writes.

If this solution is workable, I will add tests.
@hgschmie
Copy link
Author

hgschmie commented Apr 2, 2019

@cowtowncoder any chance to look at #125 ?

@cowtowncoder
Copy link
Member

@hgschmie Will check it out now.

As to 2.10: note that JDK baseline does NOT require Java 9: runtime minimum is still JDK 6, build now requires JDK 8. So 2.10 should be fine wrt Java 8, even with added module-info. That's the beauty of Moditect approach.

@cowtowncoder
Copy link
Member

Ok so my main concern is performance. I'll see if I can quickly see what effect it might have on jackson-benchmarks

@cowtowncoder
Copy link
Member

With quick serialization check I don't see a significant change (there might be 1-2% slowdown but that's within margin of error unless I run a longer test) so that's probably just fine.
Code looked fine but I'll go over it again just to be sure I didn't miss anything (I doubt I did).

I'll merge this in 2.9(.9) and it'll be sort of undocumented feature, officially included in 2.10.

@cowtowncoder cowtowncoder added this to the 2.9.9 milestone May 4, 2019
pjankovsky pushed a commit to fivetran/jackson-dataformats-text that referenced this issue Jun 25, 2019
This is a proposed solution for FasterXML#124. It introduces a new Feature, `ESCAPE_CONTROL_CHARS_WITH_ESCAPE_CHAR`,
which will apply the standard ASCII escapes from JSON to all characters that the CSV generator writes.

If this solution is workable, I will add tests.
@cowtowncoder
Copy link
Member

Seems I forgot to close this. Officially in 2.10, although technically already in 2.9.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants