Skip to content

Commit

Permalink
perf: fix record copying performance bug
Browse files Browse the repository at this point in the history
If there happens to be an abnormally long record in a CSV file---where
the rest are short---this abnormally long record ends up causing a
performance loss while parsing subsequent records. Such a thing is
usually caused by a buffer being expanded, and then that expanded buffer
leading to extra cost that shouldn't be paid when parsing smaller
records. Indeed, this case is no exception.

In this case, the standard record iterators use an internal record for
copying CSV data into, and then clone this record as appropriate it the
iterator's `next` method. In this way, that record's memory can be
reused. This is a bit better than just allocating a fresh buffer every
time, since generally speaking, the length of each CSV row is usually
pretty similar to the length of prior rows.

However, in this case, when we come across an exceptionally long record,
the internal record is expanded to handle that record. When that
internal record is clone to give back to the caller, the record *and*
its excess capacity is also cloned. In the case of an abnormally long
record, this ends up copying that extra excess capacity for all
subsequent rows. This easily explains the performance bug.

So to fix it, we introduce a new private method that lets us copy a
record *without* excess capacity. (We could implement `Clone` more
intelligently, but I'm not sure whether it's appropriate to drop excess
capacity in a `Clone` impl. That might be unexpected.) We then use this
new method in the iterators instead of standard `clone`.

In the case where there is no abnormally long records, this shouldn't
have any impact.

Fixes #227
  • Loading branch information
BurntSushi committed Mar 7, 2021
1 parent 6623d87 commit 73cf38b
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 4 deletions.
12 changes: 12 additions & 0 deletions src/byte_record.rs
Original file line number Diff line number Diff line change
Expand Up @@ -497,6 +497,18 @@ impl ByteRecord {
&self.0.fields[..self.0.bounds.end()]
}

/// Clone this record, but only copy `fields` up to the end of bounds. This
/// is useful when one wants to copy a record, but not necessarily any
/// excess capacity in that record.
#[inline]
pub(crate) fn clone_truncated(&self) -> ByteRecord {
let mut br = ByteRecord::new();
br.0.pos = self.0.pos.clone();
br.0.bounds = self.0.bounds.clone();
br.0.fields = self.0.fields[..self.0.bounds.end()].to_vec();
br
}

/// Retrieve the underlying parts of a byte record.
#[inline]
pub(crate) fn as_parts(&mut self) -> (&mut Vec<u8>, &mut Vec<usize>) {
Expand Down
8 changes: 4 additions & 4 deletions src/reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2050,7 +2050,7 @@ impl<R: io::Read> Iterator for StringRecordsIntoIter<R> {
fn next(&mut self) -> Option<Result<StringRecord>> {
match self.rdr.read_record(&mut self.rec) {
Err(err) => Some(Err(err)),
Ok(true) => Some(Ok(self.rec.clone())),
Ok(true) => Some(Ok(self.rec.clone_truncated())),
Ok(false) => None,
}
}
Expand Down Expand Up @@ -2087,7 +2087,7 @@ impl<'r, R: io::Read> Iterator for StringRecordsIter<'r, R> {
fn next(&mut self) -> Option<Result<StringRecord>> {
match self.rdr.read_record(&mut self.rec) {
Err(err) => Some(Err(err)),
Ok(true) => Some(Ok(self.rec.clone())),
Ok(true) => Some(Ok(self.rec.clone_truncated())),
Ok(false) => None,
}
}
Expand Down Expand Up @@ -2126,7 +2126,7 @@ impl<R: io::Read> Iterator for ByteRecordsIntoIter<R> {
fn next(&mut self) -> Option<Result<ByteRecord>> {
match self.rdr.read_byte_record(&mut self.rec) {
Err(err) => Some(Err(err)),
Ok(true) => Some(Ok(self.rec.clone())),
Ok(true) => Some(Ok(self.rec.clone_truncated())),
Ok(false) => None,
}
}
Expand Down Expand Up @@ -2163,7 +2163,7 @@ impl<'r, R: io::Read> Iterator for ByteRecordsIter<'r, R> {
fn next(&mut self) -> Option<Result<ByteRecord>> {
match self.rdr.read_byte_record(&mut self.rec) {
Err(err) => Some(Err(err)),
Ok(true) => Some(Ok(self.rec.clone())),
Ok(true) => Some(Ok(self.rec.clone_truncated())),
Ok(false) => None,
}
}
Expand Down
8 changes: 8 additions & 0 deletions src/string_record.rs
Original file line number Diff line number Diff line change
Expand Up @@ -610,6 +610,14 @@ impl StringRecord {
self.0
}

/// Clone this record, but only copy `fields` up to the end of bounds. This
/// is useful when one wants to copy a record, but not necessarily any
/// excess capacity in that record.
#[inline]
pub(crate) fn clone_truncated(&self) -> StringRecord {
StringRecord(self.0.clone_truncated())
}

/// A safe function for reading CSV data into a `StringRecord`.
///
/// This relies on the internal representation of `StringRecord`.
Expand Down

0 comments on commit 73cf38b

Please sign in to comment.