Refactoring and performance improvements.

The major change here is that the zero-allocation reader has doubled its performance. This has resulted in a perf boost to the record/decoder iterators, but not as dramatic. In doing this, I've refactored pieces of the code, which includes some public facing changes. 1. `ByteString` is no longer a newtype because it no longer provided any added benefit over `Vec<u8>`. Instead, it is a type alias. Since `ByteString` deref'd to `Vec<u8>`, it's possible your code will need no changes. If you used `ByteString` specific things (like its constructor), then you'll need to replace it with standard `Vec` functions 2. Parse errors have been tweaked. Notably, line/column numbers are no longer recorded. Instead, record/field numbers are saved. (This was done for performance reasons.) See the documentation for the error's new structure. 3. The `index` sub-module has received some documentation love and some small naming tweaks. Notably, the `csv` method was removed in favor of `Deref`/`DerefMut` impls on `Indexed`. No changes to the format were made. 4. The `quote` and `escape` methods have had their argument types tweaked. It is currently no longer possible to specify "no quoting" to the parser. [breaking-change]
BurntSushi · Apr 5, 2015 · c05997d · c05997d
1 parent 1c37d57
commit c05997d
Show file tree

Hide file tree

Showing 16 changed files with 728 additions and 829 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,8 +1,9 @@
 .*.swp
 doc
 tags
-examples/data/ss10pusa.csv
+examples/ss10pusa.csv
 build
 target
 Cargo.lock
 scratch*
+bench_large/huge
diff --git a/Cargo.toml b/Cargo.toml
@@ -12,6 +12,14 @@ license = "Unlicense"
 
 [lib]
 name = "csv"
+bench = false
+
+[[bin]]
+name = "bench-large"
+path = "bench_large/huge.rs"
+test = false
+bench = false
+doc = false
 
 [dependencies]
 byteorder = "*"
@@ -23,3 +31,7 @@ regex = "*"
 [profile.bench]
 opt-level = 3
 lto = true # this doesn't seem to work... why?
+
+[profile.release]
+opt-level = 3
+lto = true
diff --git a/bench_large/README.md b/bench_large/README.md
@@ -14,7 +14,7 @@ Then compile and run:
  go build -o huge-go
  time ./huge-go
 
-To run the huge benchmark for Rust, make sure `ss10pusa.csv` is in the same 
+To run the huge benchmark for Rust, make sure `ss10pusa.csv` is in the same
 location as above and run:
 
  rustc --opt-level=3 -Z lto -L ../target/release/ huge.rs -o huge-rust
@@ -23,43 +23,43 @@ location as above and run:
 To get libraries in `../target/release/`, run `cargo build --release` in the
 project root directory.
 
-(Please make sure that one CPU is pegged when running this benchmark. If it 
+(Please make sure that one CPU is pegged when running this benchmark. If it
 isn't, you're probably just testing the speed of your disk.)
 
 
 ### Results
 
-Benchmarks were run on an Intel i3930K. Note that the 
-'ns/iter' value is computed by each language's microbenchmark facilities. I 
+Benchmarks were run on an Intel i3930K. Note that the
+'ns/iter' value is computed by each language's microbenchmark facilities. I
 suspect the granularity is big enough that the values are comparable.
 
 For rust, --opt-level=3 was used.
 
 ```
-Go 41033948 ns/iter
-Rust (decode) 24016498
-Rust (string) 17052713
-Rust (byte string) 14876428
-Rust (byte slice) 11932269
+Go 41146322 ns/iter
+Rust (decode) 16341720
+Rust (string) 10959665
+Rust (byte string)  9228027
+Rust (byte slice)  5589359
 ```
 
-You'll note that none of the above benchmarks use a particularly large CSV 
-file. So I've also run a pretty rough benchmark on a huge CSV file (3.6GB). A 
-single large benchmark isn't exactly definitive, but I think we can use it as a 
+You'll note that none of the above benchmarks use a particularly large CSV
+file. So I've also run a pretty rough benchmark on a huge CSV file (3.6GB). A
+single large benchmark isn't exactly definitive, but I think we can use it as a
 ballpark estimate.
 
-The huge benchmark for both Rust and Go use buffering. The times are wall 
+The huge benchmark for both Rust and Go use buffering. The times are wall
 clock times. The file system cache was warm and no disk access occurred during
 the benchmark. Both use a negligible and constant amount of memory (~1KB).
 
 ```
-Go 146 seconds
-Rust (byte slice) 32 seconds
+Go 190 seconds
+Rust (byte slice) 19 seconds
 ```
 
-TODO: Fill in the other Rust access patterns for the huge benchmark. (The "byte 
+TODO: Fill in the other Rust access patterns for the huge benchmark. (The "byte
 slice" access pattern is the fastest.)
 
-TODO: Benchmark with Python. (Estimate: "byte slice" is faster by around 2x, 
+TODO: Benchmark with Python. (Estimate: "byte slice" is faster by around 2x,
 but the other access patterns are probably comparable.)
 
diff --git a/bench_large/huge.go b/bench_large/huge.go
@@ -2,22 +2,26 @@ package main
 
 import (
  "encoding/csv"
+ "fmt"
  "io"
  "log"
  "os"
 )
 
-func readAll(r io.Reader) {
+func readAll(r io.Reader) int {
+ fields := 0
  csvr := csv.NewReader(r)
  for {
- _, err := csvr.Read()
+ row, err := csvr.Read()
  if err != nil {
  if err == io.EOF {
  break
  }
  log.Fatal(err)
  }
+ fields += len(row)
  }
+ return fields
 }
 
 func main() {
@@ -28,5 +32,5 @@ func main() {
  if err != nil {
  log.Fatal(err)
  }
- readAll(f)
+ fmt.Println(readAll(f))
 }
diff --git a/bench_large/huge.rs b/bench_large/huge.rs
@@ -1,16 +1,16 @@
 extern crate csv;
 
-use std::path::Path;
-
 fn main() {
- let huge = "../examples/data/ss10pusa.csv";
- let mut rdr = csv::Reader::from_file(&Path::new(huge));
- while !rdr.done() {
- loop {
- match rdr.next_field() {
- None => break,
- Some(f) => { f.unwrap(); }
- }
+ let huge = ::std::env::args().nth(1).unwrap();
+ let mut rdr = csv::Reader::from_file(huge).unwrap();
+ let mut count = 0;
+ loop {
+ match rdr.next_bytes() {
+ csv::NextField::Error(err) => panic!("{:?}", err),
+ csv::NextField::EndOfCsv => break,
+ csv::NextField::EndOfRecord => {}
+ csv::NextField::Data(_) => { count += 1; }
  }
  }
+ println!("{}", count);
 }
diff --git a/benches/bench.rs b/benches/bench.rs
@@ -37,7 +37,7 @@ fn raw_records(b: &mut Bencher) {
  b.iter(|| {
  let mut dec = reader(&mut data);
  while !dec.done() {
- while let Some(r) = dec.next_field().into_iter_result() {
+ while let Some(r) = dec.next_bytes().into_iter_result() {
  r.unwrap();
  }
  }

diff --git a/src/borrow_bytes.rs b/src/borrow_bytes.rs
@@ -0,0 +1,41 @@
+use std::borrow::{Cow, ToOwned};
+use ByteString;
+
+/// A trait that permits borrowing byte vectors.
+///
+/// This is useful for providing an API that can abstract over Unicode
+/// strings and byte strings.
+pub trait BorrowBytes {
+ /// Borrow a byte vector.
+ fn borrow_bytes<'a>(&'a self) -> &'a [u8];
+}
+
+impl BorrowBytes for String {
+ fn borrow_bytes(&self) -> &[u8] { self.as_bytes() }
+}
+
+impl BorrowBytes for str {
+ fn borrow_bytes(&self) -> &[u8] { self.as_bytes() }
+}
+
+impl BorrowBytes for ByteString {
+ fn borrow_bytes(&self) -> &[u8] { &**self }
+}
+
+impl BorrowBytes for [u8] {
+ fn borrow_bytes(&self) -> &[u8] { self }
+}
+
+impl<'a, B: ?Sized> BorrowBytes for Cow<'a, B>
+ where B: BorrowBytes + ToOwned, <B as ToOwned>::Owned: BorrowBytes {
+ fn borrow_bytes(&self) -> &[u8] {
+ match *self {
+ Cow::Borrowed(v) => v.borrow_bytes(),
+ Cow::Owned(ref v) => v.borrow_bytes(),
+ }
+ }
+}
+
+impl<'a, T: ?Sized + BorrowBytes> BorrowBytes for &'a T {
+ fn borrow_bytes(&self) -> &[u8] { (*self).borrow_bytes() }
+}
diff --git a/src/buffered.rs b/src/buffered.rs