Skip to content

Commit

Permalink
Refactoring and performance improvements.
Browse files Browse the repository at this point in the history
The major change here is that the zero-allocation reader has doubled
its performance. This has resulted in a perf boost to the record/decoder
iterators, but not as dramatic.

In doing this, I've refactored pieces of the code, which includes some
public facing changes.

1. `ByteString` is no longer a newtype because it no longer provided any
   added benefit over `Vec<u8>`. Instead, it is a type alias. Since
   `ByteString` deref'd to `Vec<u8>`, it's possible your code will need
   no changes. If you used `ByteString` specific things (like its
   constructor), then you'll need to replace it with standard `Vec`
   functions
2. Parse errors have been tweaked. Notably, line/column numbers are no
   longer recorded. Instead, record/field numbers are saved. (This was
   done for performance reasons.) See the documentation for the error's
   new structure.
3. The `index` sub-module has received some documentation love and some
   small naming tweaks. Notably, the `csv` method was removed in favor
   of `Deref`/`DerefMut` impls on `Indexed`. No changes to the format
   were made.
4. The `quote` and `escape` methods have had their argument types
   tweaked. It is currently no longer possible to specify "no quoting"
   to the parser.

[breaking-change]
  • Loading branch information
BurntSushi committed Apr 5, 2015
1 parent 1c37d57 commit c05997d
Show file tree
Hide file tree
Showing 16 changed files with 728 additions and 829 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
.*.swp
doc
tags
examples/data/ss10pusa.csv
examples/ss10pusa.csv
build
target
Cargo.lock
scratch*
bench_large/huge
12 changes: 12 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,14 @@ license = "Unlicense"

[lib]
name = "csv"
bench = false

[[bin]]
name = "bench-large"
path = "bench_large/huge.rs"
test = false
bench = false
doc = false

[dependencies]
byteorder = "*"
Expand All @@ -23,3 +31,7 @@ regex = "*"
[profile.bench]
opt-level = 3
lto = true # this doesn't seem to work... why?

[profile.release]
opt-level = 3
lto = true
34 changes: 17 additions & 17 deletions bench_large/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Then compile and run:
go build -o huge-go
time ./huge-go

To run the huge benchmark for Rust, make sure `ss10pusa.csv` is in the same
To run the huge benchmark for Rust, make sure `ss10pusa.csv` is in the same
location as above and run:

rustc --opt-level=3 -Z lto -L ../target/release/ huge.rs -o huge-rust
Expand All @@ -23,43 +23,43 @@ location as above and run:
To get libraries in `../target/release/`, run `cargo build --release` in the
project root directory.

(Please make sure that one CPU is pegged when running this benchmark. If it
(Please make sure that one CPU is pegged when running this benchmark. If it
isn't, you're probably just testing the speed of your disk.)


### Results

Benchmarks were run on an Intel i3930K. Note that the
'ns/iter' value is computed by each language's microbenchmark facilities. I
Benchmarks were run on an Intel i3930K. Note that the
'ns/iter' value is computed by each language's microbenchmark facilities. I
suspect the granularity is big enough that the values are comparable.

For rust, --opt-level=3 was used.

```
Go 41033948 ns/iter
Rust (decode) 24016498
Rust (string) 17052713
Rust (byte string) 14876428
Rust (byte slice) 11932269
Go 41146322 ns/iter
Rust (decode) 16341720
Rust (string) 10959665
Rust (byte string) 9228027
Rust (byte slice) 5589359
```

You'll note that none of the above benchmarks use a particularly large CSV
file. So I've also run a pretty rough benchmark on a huge CSV file (3.6GB). A
single large benchmark isn't exactly definitive, but I think we can use it as a
You'll note that none of the above benchmarks use a particularly large CSV
file. So I've also run a pretty rough benchmark on a huge CSV file (3.6GB). A
single large benchmark isn't exactly definitive, but I think we can use it as a
ballpark estimate.

The huge benchmark for both Rust and Go use buffering. The times are wall
The huge benchmark for both Rust and Go use buffering. The times are wall
clock times. The file system cache was warm and no disk access occurred during
the benchmark. Both use a negligible and constant amount of memory (~1KB).

```
Go 146 seconds
Rust (byte slice) 32 seconds
Go 190 seconds
Rust (byte slice) 19 seconds
```

TODO: Fill in the other Rust access patterns for the huge benchmark. (The "byte
TODO: Fill in the other Rust access patterns for the huge benchmark. (The "byte
slice" access pattern is the fastest.)

TODO: Benchmark with Python. (Estimate: "byte slice" is faster by around 2x,
TODO: Benchmark with Python. (Estimate: "byte slice" is faster by around 2x,
but the other access patterns are probably comparable.)

10 changes: 7 additions & 3 deletions bench_large/huge.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,26 @@ package main

import (
"encoding/csv"
"fmt"
"io"
"log"
"os"
)

func readAll(r io.Reader) {
func readAll(r io.Reader) int {
fields := 0
csvr := csv.NewReader(r)
for {
_, err := csvr.Read()
row, err := csvr.Read()
if err != nil {
if err == io.EOF {
break
}
log.Fatal(err)
}
fields += len(row)
}
return fields
}

func main() {
Expand All @@ -28,5 +32,5 @@ func main() {
if err != nil {
log.Fatal(err)
}
readAll(f)
fmt.Println(readAll(f))
}
20 changes: 10 additions & 10 deletions bench_large/huge.rs
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
extern crate csv;

use std::path::Path;

fn main() {
let huge = "../examples/data/ss10pusa.csv";
let mut rdr = csv::Reader::from_file(&Path::new(huge));
while !rdr.done() {
loop {
match rdr.next_field() {
None => break,
Some(f) => { f.unwrap(); }
}
let huge = ::std::env::args().nth(1).unwrap();
let mut rdr = csv::Reader::from_file(huge).unwrap();
let mut count = 0;
loop {
match rdr.next_bytes() {
csv::NextField::Error(err) => panic!("{:?}", err),
csv::NextField::EndOfCsv => break,
csv::NextField::EndOfRecord => {}
csv::NextField::Data(_) => { count += 1; }
}
}
println!("{}", count);
}
2 changes: 1 addition & 1 deletion benches/bench.rs
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ fn raw_records(b: &mut Bencher) {
b.iter(|| {
let mut dec = reader(&mut data);
while !dec.done() {
while let Some(r) = dec.next_field().into_iter_result() {
while let Some(r) = dec.next_bytes().into_iter_result() {
r.unwrap();
}
}
Expand Down
41 changes: 41 additions & 0 deletions src/borrow_bytes.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
use std::borrow::{Cow, ToOwned};
use ByteString;

/// A trait that permits borrowing byte vectors.
///
/// This is useful for providing an API that can abstract over Unicode
/// strings and byte strings.
pub trait BorrowBytes {
/// Borrow a byte vector.
fn borrow_bytes<'a>(&'a self) -> &'a [u8];
}

impl BorrowBytes for String {
fn borrow_bytes(&self) -> &[u8] { self.as_bytes() }
}

impl BorrowBytes for str {
fn borrow_bytes(&self) -> &[u8] { self.as_bytes() }
}

impl BorrowBytes for ByteString {
fn borrow_bytes(&self) -> &[u8] { &**self }
}

impl BorrowBytes for [u8] {
fn borrow_bytes(&self) -> &[u8] { self }
}

impl<'a, B: ?Sized> BorrowBytes for Cow<'a, B>
where B: BorrowBytes + ToOwned, <B as ToOwned>::Owned: BorrowBytes {
fn borrow_bytes(&self) -> &[u8] {
match *self {
Cow::Borrowed(v) => v.borrow_bytes(),
Cow::Owned(ref v) => v.borrow_bytes(),
}
}
}

impl<'a, T: ?Sized + BorrowBytes> BorrowBytes for &'a T {
fn borrow_bytes(&self) -> &[u8] { (*self).borrow_bytes() }
}
125 changes: 0 additions & 125 deletions src/buffered.rs

This file was deleted.

0 comments on commit c05997d

Please sign in to comment.