Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match bug #1319

Closed
BurntSushi opened this issue Jul 5, 2019 · 5 comments
Closed

match bug #1319

BurntSushi opened this issue Jul 5, 2019 · 5 comments
Labels
bug A bug.

Comments

@BurntSushi
Copy link
Owner

BurntSushi commented Jul 5, 2019

$ rg --version
ripgrep 11.0.1 (rev 7bf7ceb5d3)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

This matches:

$ echo 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC' | egrep 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C'
CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC

But this doesn't:

$ echo 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC' | rg 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C'

To minimize, this doesn't match:

$ rg 'TTGAGTCCAGGAG[ATCG]{2}C' /tmp/subject

But this does:

$ rg 'TGAGTCCAGGAG[ATCG]{2}C' /tmp/subject
1:CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC

The only difference between the latter two is that the latter removes the first
T from the regex.

From inspecting the --trace output, I note that from the former regex, it
says this:

DEBUG|grep_regex::literal|grep-regex/src/literal.rs:105: required literals found: [Complete(TTGAGTCCAGGAGAC), Complete(TTGAGTCCAGGAGCC), Complete(TTGAGTCCAGGAGGC), Complete(TTGAGTCCAGGAGTC)]
TRACE|grep_regex::matcher|grep-regex/src/matcher.rs:52: extracted fast line regex: (?-u:TTGAGTCCAGGAGAC|TTGAGTCCAGGAGCC|TTGAGTCCAGGAGGC|TTGAGTCCAGGAGTC)

But in the latter regex (the one that works), we have this:

DEBUG|grep_regex::literal|grep-regex/src/literal.rs:59: literal prefixes detected: Literals { lits: [Complete(TGAGTCCAGGAGAAC), Complete(TGAGTCCAGGAGCAC), Complete(TGAGTCCAGGAGGAC), Complete(TGAGTCC
AGGAGTAC), Complete(TGAGTCCAGGAGACC), Complete(TGAGTCCAGGAGCCC), Complete(TGAGTCCAGGAGGCC), Complete(TGAGTCCAGGAGTCC), Complete(TGAGTCCAGGAGAGC), Complete(TGAGTCCAGGAGCGC), Complete(TGAGTCCAGGAGGGC)
, Complete(TGAGTCCAGGAGTGC), Complete(TGAGTCCAGGAGATC), Complete(TGAGTCCAGGAGCTC), Complete(TGAGTCCAGGAGGTC), Complete(TGAGTCCAGGAGTTC)], limit_size: 250, limit_class: 10 }

Therefore, this is almost certainly a bug in literal extraction. Moreover,
this Rust program correctly prints true:

fn main() {
    let pattern = r"CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C";
    let haystack = "CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC";

    let re = regex::Regex::new(pattern).unwrap();
    println!("{:?}", re.is_match(haystack));
}

Which points the finger at grep-regex's inner literal extraction. Sigh.

@BurntSushi BurntSushi added the bug A bug. label Jul 5, 2019
@BurntSushi
Copy link
Owner Author

Oh, forgot to mention that this was originally reported in this StackOverflow question: https://stackoverflow.com/questions/56906725/ripgrep-missing-character-class-repetions

@jakubadamw
Copy link
Contributor

jakubadamw commented Sep 5, 2019

Replacing the line:

if max.map_or(true, |max| min < max) {

with

if max.map_or(true, |max| min <= max) {

or just dropping the whole if (as min > max never holds) condition fixes the issue for me. That condition looked suspicious to me at a first glance but as I do not understand that code very well, I sadly cannot substantiate why this is the right fix and I'm not even convinced it is.

@jakubadamw
Copy link
Contributor

jakubadamw commented Sep 6, 2019

I wrote a fuzzer that uses the regex_generate crate to produce matching inputs for a given fuzz-generated regular expression and then checks if grep_matcher::RegexMatcher successfully matches that input against the regex. It has successfully found the class of errors represented by this issue and with the proposed change applied it has not found any more failing cases yet.

@BurntSushi
Copy link
Owner Author

@jakubadamw That's awesome! Do you want to submit a PR? If not, I'll get to this eventually!

@jakubadamw
Copy link
Contributor

@BurntSushi, sure! 🙂

BurntSushi pushed a commit that referenced this issue Feb 17, 2020
This appears to be another transcription bug from copying this code from
the prefix literal detection from inside the regex crate. Namely, when
it comes to inner literals, we only want to treat counted repetition as
two separate cases: the case when the minimum match is 0 and the case
when the minimum match is more than 0. In the former case, we treat
`e{0,n}` as `e*` and in the latter we treat `e{m,n}` where `m >= 1` as
just `e`.

We could definitely do better here. e.g., This means regexes like
`(foo){10}` will only have `foo` extracted as a literal, where searching
for the full literal would likely be faster.

The actual bug here was that we were not implementing this logic
correctly. Namely, we weren't always "cutting" the literals in the
second case to prevent them from being expanded.

Fixes #1319, Closes #1367
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants