Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use bytes when possible #198

Closed

Conversation

michalsrb
Copy link
Contributor

The csv crate has the ability to work with records as strings or as bytes and in many places allows mixing valid and invalid Utf-8 in one record, which is great. Unfortunately, when using the DeserializeRecordsIntoIter/DeserializeRecordsIter interface to deserialize types using serde, it always required that all fields are valid Utf-8 strings.

This PR modifies DeserializeRecordsIntoIter/DeserializeRecordsIter to use ByteRecord internally and only convert to string when the Deserializer asks for it. So fields that are skipped or are deserialized using deserialize_bytes/deserialize_byte_buf do not have to be valid Utf-8 strings.

I did not benchmark, but this may also improve performance, since only the fields that are used are tested. Possible step further would be to use something like the atoi crate to parse numbers directly from byte slices.

If we are going to deserialize the records using serde, we don't
necessarily need every field to be valid Utf-8 string. Lets keep them
as bytes until the deserializer asks for a string.
When the deserializer skips unused fields, we don't have to convert them
to valid Utf-8 string. This will prevent errors when unused fields are
not Utf-8 (maybe they are supposed to be bytes). Removing the check may also
improve performance.
@michalsrb
Copy link
Contributor Author

PS: My usecase is deserializing a CSV file in WINDOWS_1250 encoding which has lot of fields, but I only care for few fields that are not affected by the encoding. To make things worse, the field separator is a '\xC7' byte, so converting the whole thing to Utf-8 would create a two-byte separator, which is not supported.

@BurntSushi
Copy link
Owner

Hmm. There was a reason why I did it this way. It's because UTF-8 validation for an entire StringRecord is typically faster than doing validation for each field individually. Benchmarks seem to show this when comparing master with your branch:

$ cargo benchcmp master03.bench bytes03.bench --threshold 5
 name                              master03.bench ns/iter  bytes03.bench ns/iter  diff ns/iter  diff %  speedup
 count_game_deserialize_owned_str  21,437,916 (121 MB/s)   23,128,024 (112 MB/s)     1,690,108   7.88%   x 0.93
 count_mbta_deserialize_owned_str  3,499,860 (206 MB/s)    3,831,893 (188 MB/s)        332,033   9.49%   x 0.91
 count_pop_deserialize_owned_str   7,593,107 (125 MB/s)    8,449,337 (113 MB/s)        856,230  11.28%   x 0.90

These are exactly the benchmarks I'd expect to take a hit because they have several individual String fields.

Your use case can still be supported by not using the deserialize iterators. You just need to read ByteRecords manually and then call ByteRecord::deserialize. Given the nicheness of your use case, I'm more inclined to ask you to do this instead of taking a performance hit in a more common case.

@michalsrb
Copy link
Contributor Author

Alright, that alternative solution works for me.

@michalsrb michalsrb closed this May 22, 2020
BurntSushi pushed a commit that referenced this pull request Nov 27, 2020
When the deserializer skips unused fields, we don't have to convert
them to valid Utf-8 string. This will prevent errors when unused fields
are not UTF-8. Removing the check may also improve performance.

See also: #198 

PR #203
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants