match bug #1319

BurntSushi · 2019-07-05T17:13:40Z

$ rg --version
ripgrep 11.0.1 (rev 7bf7ceb5d3)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

This matches:

$ echo 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC' | egrep 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C'
CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC

But this doesn't:

$ echo 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC' | rg 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C'

To minimize, this doesn't match:

$ rg 'TTGAGTCCAGGAG[ATCG]{2}C' /tmp/subject

But this does:

$ rg 'TGAGTCCAGGAG[ATCG]{2}C' /tmp/subject
1:CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC

The only difference between the latter two is that the latter removes the first
T from the regex.

From inspecting the --trace output, I note that from the former regex, it
says this:

DEBUG|grep_regex::literal|grep-regex/src/literal.rs:105: required literals found: [Complete(TTGAGTCCAGGAGAC), Complete(TTGAGTCCAGGAGCC), Complete(TTGAGTCCAGGAGGC), Complete(TTGAGTCCAGGAGTC)]
TRACE|grep_regex::matcher|grep-regex/src/matcher.rs:52: extracted fast line regex: (?-u:TTGAGTCCAGGAGAC|TTGAGTCCAGGAGCC|TTGAGTCCAGGAGGC|TTGAGTCCAGGAGTC)

But in the latter regex (the one that works), we have this:

DEBUG|grep_regex::literal|grep-regex/src/literal.rs:59: literal prefixes detected: Literals { lits: [Complete(TGAGTCCAGGAGAAC), Complete(TGAGTCCAGGAGCAC), Complete(TGAGTCCAGGAGGAC), Complete(TGAGTCC
AGGAGTAC), Complete(TGAGTCCAGGAGACC), Complete(TGAGTCCAGGAGCCC), Complete(TGAGTCCAGGAGGCC), Complete(TGAGTCCAGGAGTCC), Complete(TGAGTCCAGGAGAGC), Complete(TGAGTCCAGGAGCGC), Complete(TGAGTCCAGGAGGGC)
, Complete(TGAGTCCAGGAGTGC), Complete(TGAGTCCAGGAGATC), Complete(TGAGTCCAGGAGCTC), Complete(TGAGTCCAGGAGGTC), Complete(TGAGTCCAGGAGTTC)], limit_size: 250, limit_class: 10 }

Therefore, this is almost certainly a bug in literal extraction. Moreover,
this Rust program correctly prints true:

fn main() {
    let pattern = r"CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C";
    let haystack = "CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC";

    let re = regex::Regex::new(pattern).unwrap();
    println!("{:?}", re.is_match(haystack));
}

Which points the finger at grep-regex's inner literal extraction. Sigh.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2019-07-06T14:58:25Z

Oh, forgot to mention that this was originally reported in this StackOverflow question: https://stackoverflow.com/questions/56906725/ripgrep-missing-character-class-repetions

jakubadamw · 2019-09-05T21:30:40Z

Replacing the line:

ripgrep/grep-regex/src/literal.rs

Line 217 in 4858267

if max.map_or(true, |max| min < max) {

with

if max.map_or(true, |max| min <= max) {

or just dropping the whole if (as min > max never holds) condition fixes the issue for me. That condition looked suspicious to me at a first glance but as I do not understand that code very well, I sadly cannot substantiate why this is the right fix and I'm not even convinced it is.

jakubadamw · 2019-09-06T20:59:51Z

I wrote a fuzzer that uses the regex_generate crate to produce matching inputs for a given fuzz-generated regular expression and then checks if grep_matcher::RegexMatcher successfully matches that input against the regex. It has successfully found the class of errors represented by this issue and with the proposed change applied it has not found any more failing cases yet.

BurntSushi · 2019-09-06T21:09:31Z

@jakubadamw That's awesome! Do you want to submit a PR? If not, I'll get to this eventually!

jakubadamw · 2019-09-06T21:46:12Z

@BurntSushi, sure! 🙂

This appears to be another transcription bug from copying this code from the prefix literal detection from inside the regex crate. Namely, when it comes to inner literals, we only want to treat counted repetition as two separate cases: the case when the minimum match is 0 and the case when the minimum match is more than 0. In the former case, we treat `e{0,n}` as `e*` and in the latter we treat `e{m,n}` where `m >= 1` as just `e`. We could definitely do better here. e.g., This means regexes like `(foo){10}` will only have `foo` extracted as a literal, where searching for the full literal would likely be faster. The actual bug here was that we were not implementing this logic correctly. Namely, we weren't always "cutting" the literals in the second case to prevent them from being expanded. Fixes #1319, Closes #1367

BurntSushi added the bug A bug. label Jul 5, 2019

jakubadamw mentioned this issue Sep 6, 2019

Fix a bug in literal extraction from exact-count repetition patterns #1367

Closed

BurntSushi closed this as completed in b435eaa Feb 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

match bug #1319

match bug #1319

BurntSushi commented Jul 5, 2019 •

edited

Loading

BurntSushi commented Jul 6, 2019

jakubadamw commented Sep 5, 2019 •

edited

Loading

jakubadamw commented Sep 6, 2019 •

edited

Loading

BurntSushi commented Sep 6, 2019

jakubadamw commented Sep 6, 2019

match bug #1319

match bug #1319

Comments

BurntSushi commented Jul 5, 2019 • edited Loading

BurntSushi commented Jul 6, 2019

jakubadamw commented Sep 5, 2019 • edited Loading

jakubadamw commented Sep 6, 2019 • edited Loading

BurntSushi commented Sep 6, 2019

jakubadamw commented Sep 6, 2019

BurntSushi commented Jul 5, 2019 •

edited

Loading

jakubadamw commented Sep 5, 2019 •

edited

Loading

jakubadamw commented Sep 6, 2019 •

edited

Loading