Use bytes when possible #198

michalsrb · 2020-04-30T10:13:06Z

The csv crate has the ability to work with records as strings or as bytes and in many places allows mixing valid and invalid Utf-8 in one record, which is great. Unfortunately, when using the DeserializeRecordsIntoIter/DeserializeRecordsIter interface to deserialize types using serde, it always required that all fields are valid Utf-8 strings.

This PR modifies DeserializeRecordsIntoIter/DeserializeRecordsIter to use ByteRecord internally and only convert to string when the Deserializer asks for it. So fields that are skipped or are deserialized using deserialize_bytes/deserialize_byte_buf do not have to be valid Utf-8 strings.

I did not benchmark, but this may also improve performance, since only the fields that are used are tested. Possible step further would be to use something like the atoi crate to parse numbers directly from byte slices.

If we are going to deserialize the records using serde, we don't necessarily need every field to be valid Utf-8 string. Lets keep them as bytes until the deserializer asks for a string.

When the deserializer skips unused fields, we don't have to convert them to valid Utf-8 string. This will prevent errors when unused fields are not Utf-8 (maybe they are supposed to be bytes). Removing the check may also improve performance.

michalsrb · 2020-04-30T10:16:37Z

PS: My usecase is deserializing a CSV file in WINDOWS_1250 encoding which has lot of fields, but I only care for few fields that are not affected by the encoding. To make things worse, the field separator is a '\xC7' byte, so converting the whole thing to Utf-8 would create a two-byte separator, which is not supported.

BurntSushi · 2020-04-30T14:24:29Z

Hmm. There was a reason why I did it this way. It's because UTF-8 validation for an entire StringRecord is typically faster than doing validation for each field individually. Benchmarks seem to show this when comparing master with your branch:

$ cargo benchcmp master03.bench bytes03.bench --threshold 5
 name                              master03.bench ns/iter  bytes03.bench ns/iter  diff ns/iter  diff %  speedup
 count_game_deserialize_owned_str  21,437,916 (121 MB/s)   23,128,024 (112 MB/s)     1,690,108   7.88%   x 0.93
 count_mbta_deserialize_owned_str  3,499,860 (206 MB/s)    3,831,893 (188 MB/s)        332,033   9.49%   x 0.91
 count_pop_deserialize_owned_str   7,593,107 (125 MB/s)    8,449,337 (113 MB/s)        856,230  11.28%   x 0.90

These are exactly the benchmarks I'd expect to take a hit because they have several individual String fields.

Your use case can still be supported by not using the deserialize iterators. You just need to read ByteRecords manually and then call ByteRecord::deserialize. Given the nicheness of your use case, I'm more inclined to ask you to do this instead of taking a performance hit in a more common case.

michalsrb · 2020-05-22T15:25:02Z

Alright, that alternative solution works for me.

When the deserializer skips unused fields, we don't have to convert them to valid Utf-8 string. This will prevent errors when unused fields are not UTF-8. Removing the check may also improve performance. See also: #198 PR #203

michalsrb added 2 commits April 30, 2020 11:54

Use ByteRecord internally in DeserializeRecords(Into)Iter

4e7b741

If we are going to deserialize the records using serde, we don't necessarily need every field to be valid Utf-8 string. Lets keep them as bytes until the deserializer asks for a string.

michalsrb closed this May 22, 2020

michalsrb mentioned this pull request Jun 1, 2020

Do not convert to string when skipping unused fields. #203

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use bytes when possible #198

Use bytes when possible #198

michalsrb commented Apr 30, 2020

michalsrb commented Apr 30, 2020

BurntSushi commented Apr 30, 2020

michalsrb commented May 22, 2020

Use bytes when possible #198

Use bytes when possible #198

Conversation

michalsrb commented Apr 30, 2020

michalsrb commented Apr 30, 2020

BurntSushi commented Apr 30, 2020

michalsrb commented May 22, 2020