Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovering from CSV read errors leads to infinite loops when caused by IO errors #207

Closed
Ten0 opened this issue Aug 12, 2020 · 4 comments · Fixed by #213
Closed

Recovering from CSV read errors leads to infinite loops when caused by IO errors #207

Ten0 opened this issue Aug 12, 2020 · 4 comments · Fixed by #213

Comments

@Ten0
Copy link
Contributor

Ten0 commented Aug 12, 2020

What version of the csv crate are you using?

1.1.3

Briefly describe the question, bug or feature request.

Recovering from CSV errors leads to infinite loops when caused by IO errors

Include a complete program demonstrating a problem.

use {
    csv::Reader,
    std::io::{self, Read},
};

#[derive(Debug, serde_derive::Deserialize)]
struct Deserialized {
    something: String,
}

fn main() {
    for record_result in Reader::from_reader(FailingRead).deserialize::<Deserialized>() {
        match record_result {
            Ok(deserialized) => println!("Got something: {}", deserialized.something),
            Err(failed_line) => eprintln!("This record is broken :("),
        }
    }
}

struct FailingRead;
impl Read for FailingRead {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        Err(io::Error::new(io::ErrorKind::Other, "reader just broke!"))
    }
}

What is the observed behavior of the code above?

Infinite loop of "This record is broken :(".

What is the expected or desired behavior of the code above?

This record is broken :(

program exits

@Ten0
Copy link
Contributor Author

Ten0 commented Nov 2, 2020

Hi again.
@BurntSushi This has just cost us thousands of euros due to generating tons of useless logs after an invalid file was transferred, leading to an always-erroring reader.
Would you be open to a PR fixing this ?

@BurntSushi
Copy link
Owner

I don't really understand what needs to be fixed in the csv crate here. You're getting an error and then explicitly continuing. If you don't want to continue, then stop the loop when an error occurs.

@Ten0
Copy link
Contributor Author

Ten0 commented Nov 2, 2020

We want to keep parsing other lines when there's a broken record in the actual CSV, as it's very common the CSVs we handle are broken in some way (e.g. on a single line, not all fields set, broken escapes leading to too many fields, failed to deserialize because of wrong content...), so we need to iterate on all the records, handling lines that are not broken and logging the errors: this is what this example code is meant to do (though this is a much reduced example: in practice for us it's actually hundreds of lines of code generic on iterators of results, with multiple conversion steps, csv not being the only library we parse from).

The issue is that when the underlying reader breaks, csv::DeserializeRecordsIntoIter becomes an infinite iterator of IO errors, and I beleive it should not, as this creates issues when handling possibly partly broken CSVs, and I can't think of a scenario where the infinite iterator of IO errors would be what somebody who still attempts to read the next record after they encountered an error on one record would want.

I'm thinking about e.g. using some kind of wrapper around the reader, like this, in order to avoid this issue:

/// Wrapper around reader to simulate an "End of file" after we hit an error
///
/// This is so that users of the reader who recover from some errors (e.g. users of csv reader) don't infinite loop.
/// "End of file" in Rust is represented by `read` returning `Ok(0)`
pub struct PretendEofAfterError<R: Read> {
	reader: R,
	has_errored: bool,
}

impl<R: Read> PretendEofAfterError<R> {
	pub fn new(reader: R) -> Self {
		Self {
			reader,
			has_errored: false,
		}
	}
}

impl<R: Read> Read for PretendEofAfterError<R> {
	fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
		if self.has_errored {
			Ok(0)
		} else {
			let res = self.reader.read(buf);
			if res.is_err() {
				self.has_errored = true;
			}
			res
		}
	}
}

(This is the workaround I implemented after opening this issue, but forgetting to use it in one place is what eventually cost us thousands of euros.)

@BurntSushi
Copy link
Owner

BurntSushi commented Nov 2, 2020

Thanks for elaborating. That's a much better bug report. There is some subtlety to this though. It sounds like the iterator's behavior needs to depend on the kind of error that occurred. If an I/O error occurs, then yeah, probably the iterator should not yield anything else. But if a deserialization error occurs, then the CSV reader should advance to the next record. The latter behavior exists today:

#[derive(Debug, serde::Deserialize)]
struct Record {
    field1: String,
    field2: i32,
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut rdr = csv::Reader::from_path("weird.csv")?;
    for result in rdr.deserialize::<Record>() {
        match result {
            Ok(record) => println!("OK: {:?}", record),
            Err(err) => println!("ERROR: {:?}", err),
        }
    }
    Ok(())
}
[package]
name = "csv-207"
version = "0.1.0"
authors = ["Andrew Gallant <[email protected]>"]
edition = "2018"

[dependencies]
csv = "1"
serde = { version = "1", features = ["derive"] }
field1,field2
foo,2
bar,2.5
baz,3
$ cargo run
   Compiling csv-207 v0.1.0 (/tmp/csv-207)
    Finished dev [unoptimized + debuginfo] target(s) in 0.48s
     Running `target/debug/csv-207`
OK: Record { field1: "foo", field2: 2 }
ERROR: Error(Deserialize { pos: Some(Position { byte: 20, line: 3, record: 2 }), err: DeserializeError { field: Some(1), kind: ParseInt(ParseIntError { kind: InvalidDigit }) } })
OK: Record { field1: "baz", field2: 3 }

I guess this is what you're saying: the "continuing" behavior of errors is correct for some kinds of errors but not for other kinds. That distinction wasn't obvious to me.

But yeah, I agree with you, if an iterator hits an I/O error, then the iterator should be marked as exhausted. And this should be clearly documented. It shouldn't be too hard to implement, but would be tedious probably. I don't know when I'll get around to it, so if you wanted to submit a PR, that would be great.

Ten0 added a commit to Ten0/rust-csv that referenced this issue Nov 2, 2020
Resolves BurntSushi#207

When reading and parsing a record fails, it's correct behaviour
in most cases to keep on trying to read and parse the next record.

However when the Error is caused by failing to read from the underlying
`impl Read`, the next read would almost always fail with the exact
same error.
In that scenario, reading from the underlying `impl Read` should not be
attempted when trying to deserialize the next record, as this may lead
to infinite loops.
Instead, the `Reader` will behave in the same way as if an end-of-file
had been reached.
Ten0 added a commit to Ten0/rust-csv that referenced this issue Nov 2, 2020
Resolves BurntSushi#207

When reading and parsing a record fails, it's correct behavior in most cases to keep on trying to read and parse the next record.

However when the Error is caused by failing to read from the underlying `impl Read`, the next read would almost always fail with the exact same error.
In that scenario, reading from the underlying `impl Read` should not be attempted when trying to extract the next record, as this may lead to infinite loops.
Instead, the `Reader` will behave in the same way as if an end-of-file had been reached.
BurntSushi pushed a commit that referenced this issue Nov 3, 2020
This commit changes the behavior of the CSV reader (and its iterators)
to stop when an I/O error from the underlying reader is encountered.

When reading and parsing a record fails, correct behavior in most cases
to keep on trying to read and parse the next record. That behavior
remains the same after this commit.

However when the Error is caused by failing to read from the underlying
reader, the next read would almost always fail with the exact same
error. In that scenario, reading from the underlying reader should not
be attempted when trying to extract the next record, as this may lead
to infinite loops. Instead, the reader will behave in the same way as
if an end-of-file had been reached.

Fixes #207
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants