Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for escaping the delimiter like the python counterpart (Read) #233

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

leaty
Copy link

@leaty leaty commented May 28, 2021

Fixes #232

As mentioned in the issue, this is probably sub-optimal, and it might will probably break stuff. In fact this might even be possible already, just hidden in the docs. In any case I would just like to hear some ideas.

Using these options:

let mut csv = ReaderBuilder::new()
    .delimiter(b',')
    .quote(b'\\')
    .quoting(true)
    .escape(Some(b'\\'))
    .flexible(false)

One can now read this properly:

field1,field2,field3\, with\, commas,field4

Copy link
Owner

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change accomplishes what you want. You're right that it breaks everything. :-)

I'm also not sure whether this is worthwhile to add. The data you've given is not CSV. I know this parser goes out of its way to minimally support some kinds of invalid CSV, but I'm not sure how far down that path we want to go.

(Also, in order to get something like this merged aside from the above, there needs to be tests. And ideally a doc example showing its use in the main csv crate.)

csv-core/src/reader.rs Outdated Show resolved Hide resolved
csv-core/src/reader.rs Outdated Show resolved Hide resolved
@leaty
Copy link
Author

leaty commented May 28, 2021

Thanks for the quick reply. All of what you said makes sense, apologies for the poor (and quite clearly broken code). I just needed this to work quickly and I thought I'd share my current changes with a PR to spur a conversation.

Using your notes I thought of this:

  1. Add a new function to the builder which sets e.g. escapechar: Option<u8>. This would then be a true escape character (escapes any one character immediately after).
  2. Add a new state that is set after e.g. if self.escapechar = Some(c) while currently in the InField state.
  3. Exit this new state on the next character and return back to InField.

Would that be something that could be considered for this csv library? I know it's not standard csv, but it is technically still comma-separated values albeit with an escape character.

Again, apologies for the code.

@BurntSushi
Copy link
Owner

No need to apologize. I'm happy to receive lower effort stuff first: it lets us prove out the idea, and ultimately if we don't move forward with it, then not much has been lost.

Your plan sounds okay I think, although, I wonder if it makes sense to re-use the current escape option, and then add a new (disabled by default) boolean option that permits the escape character to apply outside of quoted fields. e.g., escape_outside_quoted_fields or something like that.

@leaty
Copy link
Author

leaty commented May 28, 2021

I wonder if it makes sense to re-use the current escape option

Yes, I'm all for reusing that. I also think it's better if it's only enabled when quoting is false and the escape character is set. In most if not all cases, you either run with quotes for escaping the value or an escape character to escape the actual delimiter. There shouldn't really be a good reason to use both ways at the same time.

From python std docs for escapechar in csv:

A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False. On reading, the escapechar removes any special meaning from the following character. It defaults to None, which disables escaping.

Would that be manageable? The escape option is never used if quoting is disabled correct? If I'm right, then you could re-use the escape character for this, quite freely. Assuming quoting also disables double quote, it might look like this:

let mut csv = ReaderBuilder::new()
    .quoting(false)
    .escape(Some(b'\\'))
    ...

@BurntSushi
Copy link
Owner

Oh I see. Off hand, I believe that's right, so no new config knob is needed. Nice.

@leaty
Copy link
Author

leaty commented May 28, 2021

Sounds good, I'll give it a go this weekend 👍

@leaty leaty force-pushed the feature/escape-delimiter branch from 96b2a5b to e31d44f Compare May 30, 2021 16:28
@leaty
Copy link
Author

leaty commented May 30, 2021

I must be misinterpreting something, I really thought something like this would work. But it's still interpreting the , as a delimiter after a \ and giving me UnequalLengths. Any ideas?

EDIT: Also, let me know if you find it unwanted to modify the state numbers like this.

@leaty leaty force-pushed the feature/escape-delimiter branch 2 times, most recently from c4547f4 to 91998f6 Compare May 31, 2021 10:42
@leaty
Copy link
Author

leaty commented May 31, 2021

Solved by moving escape out from under quoting:

if self.quoting {
self.dfa.classes.add(self.quote);
}
if let Some(escape) = self.escape {
self.dfa.classes.add(escape);
}

However, while this doesn't give an error and partly works, it's removing the escaped , E.g.

field1,field2,field3\, with\, commas,field4

Becomes:

StringRecord(["field1", "field2", "field3 with commas", "field4"])

What am I missing?

@BurntSushi
Copy link
Owner

Not sure. I went and looked for the obvious thing (not using NfaInputAction::CopyToOutput), but it looks like you're using that. I'd have to debug it, and I doubt I'll have time to do that any time soon unfortunately.

I would suggest debugging it at the csv-core level. There are lots of tests in that crate. Add one there (you'll have to do that anyway), and see if you can track it down.

@leaty leaty force-pushed the feature/escape-delimiter branch 2 times, most recently from 7102087 to cb2bc96 Compare May 31, 2021 15:26
@leaty leaty force-pushed the feature/escape-delimiter branch from cb2bc96 to 938a372 Compare May 31, 2021 15:27
@leaty
Copy link
Author

leaty commented May 31, 2021

Some gdb later and I felt silly, I forgot to add it back to const NFA_STATES. After that the , remained as it should. I also had to add it to StartField in transition_nfa, otherwise it wouldn't play nice with fields that begin with an escape sequence e.g. field1,\,field2,field3.

One thing I'm unsure about is the Dfa struct, I put in_escape_sequence there because it felt right, but honestly I'm not entirely sure this is necessary in this case. This is the only reference I could find for the others, and it works without adding it:

if state == self.dfa.in_field || state == self.dfa.in_quoted {
self.dfa
.classes
.scan_and_copy(input, &mut nin, output, &mut nout);
}

On another note, if this is accepted as a valuable feature, could adding it to the writer be considered as well?

@leaty leaty requested a review from BurntSushi May 31, 2021 15:45
@leaty leaty changed the title Allow for escaping the delimiter like the python counterpart Allow for escaping the delimiter like the python counterpart (Read) Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow escaping the delimiter
2 participants