Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leading spaces in header fields #78

Closed
vmx opened this issue May 29, 2017 · 11 comments
Closed

Leading spaces in header fields #78

vmx opened this issue May 29, 2017 · 11 comments

Comments

@vmx
Copy link

vmx commented May 29, 2017

I came across a CSV file which had leading spaces in the header fields:

medallion, hack_license, vendor_id, pickup_datetime, payment_type, fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount

When I deserialized it with Serde, each field has a space in front. Would it make sense to have an option to trim the header fields or to transform them in a more general way?

@BurntSushi
Copy link
Owner

BurntSushi commented May 29, 2017

If your struct is like this:

#[derive(Deserialize)]
struct Record {
  medallion: String,
  hack_license: String,
  ...
}

Then you could do:

#[derive(Deserialize)]
struct Record {
  #[serde(rename = " medallion")]
  medallion: String,
  #[serde(rename = " hack_license")]
  hack_license: String,
  ...
}

I'm not sure whether this is common enough to elevate this to a more convenient feature. Certainly, I've seen other CSV parsers have options like, "trim all whitespace around field values," which is something that could be feasibly added with some minor additional cost.

@vmx
Copy link
Author

vmx commented May 29, 2017

I'm doing a CSV to JSON conversion, hence I ended up with:

struct Fare {
    medallion: String,
    #[serde(rename(deserialize = " hack_license", serialize = "hack_license"))]
    hack_license: String,
    ...
}

Which isn't that nice.

@BurntSushi
Copy link
Owner

Here is what I'd suggest we do. In the csv crate, we should add a new Trim enum type and a new trim method on ReaderBuilder that accepts a value with type Trim. The Trim enum should be defined as follows:

pub enum Trim {
    None,
    Headers,
    /// Hints that destructuring should not be exhaustive.
    ///
    /// This enum may grow additional variants, so this makes sure clients
    /// don't count on exhaustive matching. (Otherwise, adding a new variant
    /// could break existing code.)
    #[doc(hidden)]
    __Nonexhaustive,
}

Trim::None should be the default, which matches existing behavior. The Headers option should correspond to trimming only header values. We can add other variants later that permit trimming values as well, but it seems easiest to start with headers and does solve the OP's problem.

I think probably the easiest place to implement this is in the set_headers_impl method, since that is only place that self.state.headers is modified.

I think it is OK to create a new record that corresponds to the previous record, but with its fields trimmed. This introduces an extra allocation, but since it's only for the header record, I think that's OK.

The hardest part of this will probably be trimming a ByteRecord, since it isn't guaranteed to be valid UTF-8. I think in that case, all we can do is trim ASCII space characters.

@casey
Copy link

casey commented Sep 8, 2017

I also encountered this, although it was with field values, not headers. A solution in my case would have been to allow a multi-character delimiter, like "; ", but support for trimming headers and fields would also work.

@medwards
Copy link
Contributor

medwards commented Nov 1, 2017

BTW I also would like this for rows as well as headers. In general it would be nice to have a trim trait or something (or for me preferably a dont_trim trait and the default is to trim).

I'm willing to write it but a rough roadmap of what files to look at and maybe a good write up on how to implement traits would be really helpful for me.

@BurntSushi
Copy link
Owner

@medwards This is where I'd start: #78 (comment) --- Note that I'm not sure why you're talking about traits here. I don't think implementing this feature should require any new traits.

@medwards
Copy link
Contributor

medwards commented Nov 1, 2017

Sorry, I misspoke. I think I meant attributes (ie #[serde(default = "default_locationtype", deserialize_with = "deserialize_locationtype")])

@BurntSushi
Copy link
Owner

I don't think this needs attributes either. We can't add new Serde attributes anyway. I'm thinking that this is a CSV reader configuration knob that is applied to every field (or just the header, or whatever).

@medwards
Copy link
Contributor

medwards commented Nov 1, 2017

Ok sounds good, I'll try to put some time in this weekend.

@BurntSushi
Copy link
Owner

@medwards Awesome, thanks! I'm burntsushi on the various Rust IRC channels. Please ping me if you get stuck! (Or even if you aren't stuck and just want to bounce ideas.)

medwards added a commit to medwards/rust-csv that referenced this issue Nov 5, 2017
@medwards
Copy link
Contributor

medwards commented Nov 5, 2017

I took a stab at it (not fully tested) but I want to rethink things. From what I can tell the only opportune place to create a new record for whitespace trimming is in set_headers_impl so technically I've accomplished that much but the same approach doesn't work for records because the Reader is just mutating a Record it received (at least thats my understanding of read_byte_record_impl).

I can mess with read_record_dfa/nfa which I think is kind of terrifying, or further mutate the record passed to read_byte_record_impl which means some awkward Vec::remove operations that have to be perfectly synchronized with changes to the record bounds. I don't really like either of these ideas (the former especially) so I thought I'd run them by you before I go too much further.

medwards added a commit to medwards/rust-csv that referenced this issue Nov 13, 2017
medwards added a commit to medwards/rust-csv that referenced this issue Nov 13, 2017
medwards added a commit to medwards/rust-csv that referenced this issue Nov 14, 2017
medwards added a commit to medwards/rust-csv that referenced this issue Nov 29, 2017
medwards added a commit to medwards/rust-csv that referenced this issue Nov 30, 2017
BurntSushi pushed a commit that referenced this issue Jan 30, 2018
This commit adds support for trimming CSV records. There are two levels
of support:

  1. Both `ByteRecord` and `StringRecord` have grown `trim` methods. A
     `ByteRecord` trims ASCII whitespace while a `StringRecord` trims
     Unicode whitespace.
  2. The CSV reader can now be configured to automatically trim all
     records that it reads. This is useful when using Serde to match
     header names with spaces (for example) to struct member names.

Fixes #78
BurntSushi pushed a commit that referenced this issue Jan 30, 2018
This commit adds support for trimming CSV records. There are two levels
of support:

  1. Both `ByteRecord` and `StringRecord` have grown `trim` methods. A
     `ByteRecord` trims ASCII whitespace while a `StringRecord` trims
     Unicode whitespace.
  2. The CSV reader can now be configured to automatically trim all
     records that it reads. This is useful when using Serde to match
     header names with spaces (for example) to struct member names.

Fixes #78
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants