Leading spaces in header fields #78

vmx · 2017-05-29T19:53:34Z

I came across a CSV file which had leading spaces in the header fields:

medallion, hack_license, vendor_id, pickup_datetime, payment_type, fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount

When I deserialized it with Serde, each field has a space in front. Would it make sense to have an option to trim the header fields or to transform them in a more general way?

The text was updated successfully, but these errors were encountered:

BurntSushi · 2017-05-29T19:59:03Z

If your struct is like this:

#[derive(Deserialize)]
struct Record {
  medallion: String,
  hack_license: String,
  ...
}

Then you could do:

#[derive(Deserialize)]
struct Record {
  #[serde(rename = " medallion")]
  medallion: String,
  #[serde(rename = " hack_license")]
  hack_license: String,
  ...
}

I'm not sure whether this is common enough to elevate this to a more convenient feature. Certainly, I've seen other CSV parsers have options like, "trim all whitespace around field values," which is something that could be feasibly added with some minor additional cost.

vmx · 2017-05-29T20:04:34Z

I'm doing a CSV to JSON conversion, hence I ended up with:

struct Fare {
    medallion: String,
    #[serde(rename(deserialize = " hack_license", serialize = "hack_license"))]
    hack_license: String,
    ...
}

Which isn't that nice.

BurntSushi · 2017-06-27T13:16:46Z

Here is what I'd suggest we do. In the csv crate, we should add a new Trim enum type and a new trim method on ReaderBuilder that accepts a value with type Trim. The Trim enum should be defined as follows:

pub enum Trim {
    None,
    Headers,
    /// Hints that destructuring should not be exhaustive.
    ///
    /// This enum may grow additional variants, so this makes sure clients
    /// don't count on exhaustive matching. (Otherwise, adding a new variant
    /// could break existing code.)
    #[doc(hidden)]
    __Nonexhaustive,
}

Trim::None should be the default, which matches existing behavior. The Headers option should correspond to trimming only header values. We can add other variants later that permit trimming values as well, but it seems easiest to start with headers and does solve the OP's problem.

I think probably the easiest place to implement this is in the set_headers_impl method, since that is only place that self.state.headers is modified.

I think it is OK to create a new record that corresponds to the previous record, but with its fields trimmed. This introduces an extra allocation, but since it's only for the header record, I think that's OK.

The hardest part of this will probably be trimming a ByteRecord, since it isn't guaranteed to be valid UTF-8. I think in that case, all we can do is trim ASCII space characters.

casey · 2017-09-08T22:16:03Z

I also encountered this, although it was with field values, not headers. A solution in my case would have been to allow a multi-character delimiter, like "; ", but support for trimming headers and fields would also work.

medwards · 2017-11-01T19:01:33Z

BTW I also would like this for rows as well as headers. In general it would be nice to have a trim trait or something (or for me preferably a dont_trim trait and the default is to trim).

I'm willing to write it but a rough roadmap of what files to look at and maybe a good write up on how to implement traits would be really helpful for me.

BurntSushi · 2017-11-01T19:07:53Z

@medwards This is where I'd start: #78 (comment) --- Note that I'm not sure why you're talking about traits here. I don't think implementing this feature should require any new traits.

medwards · 2017-11-01T19:34:13Z

Sorry, I misspoke. I think I meant attributes (ie #[serde(default = "default_locationtype", deserialize_with = "deserialize_locationtype")])

BurntSushi · 2017-11-01T19:35:57Z

I don't think this needs attributes either. We can't add new Serde attributes anyway. I'm thinking that this is a CSV reader configuration knob that is applied to every field (or just the header, or whatever).

medwards · 2017-11-01T19:36:47Z

Ok sounds good, I'll try to put some time in this weekend.

BurntSushi · 2017-11-01T19:40:50Z

@medwards Awesome, thanks! I'm burntsushi on the various Rust IRC channels. Please ping me if you get stuck! (Or even if you aren't stuck and just want to bounce ideas.)

first stab at BurntSushi#78

medwards · 2017-11-05T18:29:02Z

I took a stab at it (not fully tested) but I want to rethink things. From what I can tell the only opportune place to create a new record for whitespace trimming is in set_headers_impl so technically I've accomplished that much but the same approach doesn't work for records because the Reader is just mutating a Record it received (at least thats my understanding of read_byte_record_impl).

I can mess with read_record_dfa/nfa which I think is kind of terrifying, or further mutate the record passed to read_byte_record_impl which means some awkward Vec::remove operations that have to be perfectly synchronized with changes to the record bounds. I don't really like either of these ideas (the former especially) so I thought I'd run them by you before I go too much further.

Fixes BurntSushi#78

This commit adds support for trimming CSV records. There are two levels of support: 1. Both `ByteRecord` and `StringRecord` have grown `trim` methods. A `ByteRecord` trims ASCII whitespace while a `StringRecord` trims Unicode whitespace. 2. The CSV reader can now be configured to automatically trim all records that it reads. This is useful when using Serde to match header names with spaces (for example) to struct member names. Fixes #78

BurntSushi added enhancement question labels May 29, 2017

BurntSushi added help-wanted and removed question labels Jun 27, 2017

medwards added a commit to medwards/rust-csv that referenced this issue Nov 5, 2017

Accept whitespace trimming settings

d8f79d2

first stab at BurntSushi#78

medwards mentioned this issue Nov 6, 2017

Move to CSV crate georust/transitfeed#3

Closed

medwards added a commit to medwards/rust-csv that referenced this issue Nov 13, 2017

Accept whitespace trimming settings

5083635

Fixes BurntSushi#78

medwards added a commit to medwards/rust-csv that referenced this issue Nov 13, 2017

Accept whitespace trimming settings

98d40f6

Fixes BurntSushi#78

medwards mentioned this issue Nov 13, 2017

Accept whitespace trimming settings #97

Closed

medwards added a commit to medwards/rust-csv that referenced this issue Nov 14, 2017

Accept whitespace trimming settings

b9baeee

Fixes BurntSushi#78

medwards added a commit to medwards/rust-csv that referenced this issue Nov 29, 2017

Accept whitespace trimming settings

288acc1

Fixes BurntSushi#78

medwards added a commit to medwards/rust-csv that referenced this issue Nov 30, 2017

Accept whitespace trimming settings

38c8630

Fixes BurntSushi#78

BurntSushi mentioned this issue Jan 30, 2018

reading: provide trim functionality #106

Merged

BurntSushi closed this as completed in #106 Jan 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leading spaces in header fields #78

Leading spaces in header fields #78

vmx commented May 29, 2017

BurntSushi commented May 29, 2017 •

edited

Loading

vmx commented May 29, 2017

BurntSushi commented Jun 27, 2017

casey commented Sep 8, 2017

medwards commented Nov 1, 2017

BurntSushi commented Nov 1, 2017

medwards commented Nov 1, 2017

BurntSushi commented Nov 1, 2017

medwards commented Nov 1, 2017

BurntSushi commented Nov 1, 2017

medwards commented Nov 5, 2017

Leading spaces in header fields #78

Leading spaces in header fields #78

Comments

vmx commented May 29, 2017

BurntSushi commented May 29, 2017 • edited Loading

vmx commented May 29, 2017

BurntSushi commented Jun 27, 2017

casey commented Sep 8, 2017

medwards commented Nov 1, 2017

BurntSushi commented Nov 1, 2017

medwards commented Nov 1, 2017

BurntSushi commented Nov 1, 2017

medwards commented Nov 1, 2017

BurntSushi commented Nov 1, 2017

medwards commented Nov 5, 2017

BurntSushi commented May 29, 2017 •

edited

Loading