Patterns may affect each other when multiple patterns are provided #2480

gofri · 2023-03-29T15:04:23Z

What version of ripgrep are you using?

13.0.0

How did you install ripgrep?

brew

What operating system are you using ripgrep on?

MacOS M1

Describe your bug.

When providing multiple patterns via either -e or -f,
specified flags affect all following patterns.

What are the steps to reproduce the behavior?

Text

What is the actual behavior?

rg respect the case-insensitive flag (?i) from the first pattern in the second pattern as well.

DEBUG|globset|crates/globset/src/lib.rs:421: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
Text

Same goes for:
echo 'MyText' | rg -f <(printf '(?i)notintext\ntext')

What is the expected behavior?

Empty result (the case-insensitive flag should not affect the second pattern).

What do you think ripgrep should have done?

Treat each pattern independently.
Specifically, for these pattern, I'd expect the merged pattern to be:
'(?:(?i)notintext')|(?:'text')
rather than:
'(?i)notintext'|'text'

Side notes:

Works as expected if I manually wrap each pattern as (?:PATTERN)
Works as expected if I replace the first pattern with (?i:notintext)
Works as expected if I replace the order of the patterns (because the effect of the flag only applies "to the right")

The text was updated successfully, but these errors were encountered:

BurntSushi · 2023-03-29T15:11:19Z

Yeah so you are indeed correct that ripgrep just does a simple join with | as the delimiter:

ripgrep/crates/core/args.rs

Line 703 in 0415448

builder.build(&patterns.join("|"))

ripgrep could make the situation much better by wrapping each pattern in a (?:<pattern>). It would resolve your specific problem for example. And this would be easy to do I think. And it would work for PCRE2. So we should do that.

But it doesn't quite make things completely independent. Because you could still do things like -e foo)(?i)bar( to sneakily insert a (?i) (for example). Fixing that probably requires trying to parse each pattern individually and ensuring each pattern is itself valid. A little more work, but not a huge deal to do. However, that won't work for PCRE2 unless you go and actually try to compile every pattern individually. (Because PCRE2 doesn't expose its own parser.) That is potentially a lot more work. Another possibility is to just write a simple little mini parser that just makes sure parentheses are balanced. I think you'd just have to account for escapes and character classes, which should be... somewhat simple?

gofri · 2023-04-11T10:01:44Z

I wrote the following code that should act as the mini parser you mentioned.
If it looks legit to you (or somewhat legit) I can try to create a PR to use it there (with testing, documentation, etc.).

The guiding principles:

we only try to detect cases where invalid balance would affect later patterns, and may result in a valid joined pattern constructed from invalid patterns.
assume that any other pattern invalidity would fail the validity check of the combined pattern (i.e., including negative balance, e.g. ab)c / ab]c)
O(n); optimize for valid patterns (and simplicity); do not try to fail early for invalid patterns.

p.s. the code here is just for sanity check and we can discuss details in the PR, shall there be one.
p.s.2. The code for the kind/error is trivial so I left it out for this comment.

fn validate_subpattern_safety(pattern: &str) -> Result<(), BalanceError> {
    let (mut escape, mut round_depth, mut square_depth) = (false, 0, 0);

    for c in pattern.chars() {
        if escape {
            // escape takes precedence over all states; render the current char literal
            escape = false;
        } else if c == '\\' {
            // escape takes precedence over all states; move to escape state
            escape = true;
        } else if square_depth > 0 {
            // character classes take precedence over round brackets
            match c {
                '[' => square_depth += 1,
                ']' => square_depth -= 1,
                _ => {} // any char is literal inside character classes (balance-wise)
            }
        } else if round_depth > 0 && c == ')' {
            // only acconut for round bracket closures for balance correctness
            round_depth -= 1;
        } else {
            // the ground state: move to a higher state or ignore
            match c {
                '(' => round_depth += 1,
                '[' => square_depth += 1,
                _ => {} // ignore literal chars and bracket closers
            };
        }
    }

    let kind = if escape {
        Some(ErrUnbalanceKind::TrailingBackslash)
    } else if square_depth > 0 {
        Some(ErrUnbalanceKind::UnbalancedSquareBrackets)
    } else if round_depth > 0 {
        Some(ErrUnbalanceKind::UnbalanceRoundBrackets)
    } else {
        None
    };

    match kind {
        Some(kind) => BalanceError::new(kind, pattern.to_string()),
        None => Ok(()),
    }
}

fn join_patterns_safely(patterns: &[String]) -> Result<String, BalanceError> {
    match patterns.len() {
        0 => Ok("".to_string()),
        1 => Ok(patterns.first().unwrap().to_owned()),
        _ => patterns // note: can apply to all but the last
            .iter()
            .map(|p| {
                validate_subpattern_safety(p)?;
                Ok(format!("(?:{})", p))
            })
            .collect::<Result<Vec<String>, BalanceError>>()
            .map(|ps| ps.join("|")),
    }
}

BurntSushi · 2023-04-11T11:10:47Z

Thanks for the effort! Unfortunately, it's just not quite right. It doesn't account for the fact that [[] and [\[] are equivalent, for example. Honestly, there's probably other stuff. And especially so for PCRE2, which has a fair bit more syntax than the default engine.

After thinking on it some more, I think for now, I'd rather just add the (?:<pattern>) wrapping and call that good enough. If folks end up running into problems there, then we can re-evaluate later. But I might just end up marking it as wontfix.

Also, one small note: the perf for something like a routine for what you wrote is basically insignificant. Consider how much work a regex engine needs to do to compile a regex into a matcher. It's a lot more work than a simple parser like that. :-)

gofri · 2023-04-11T11:36:15Z

Thanks for the quick response!

It doesn't account for the fact that [[] and [[] are equivalent, for example.

I actually tested for unescaped brackets inside char-class and got a parse error:

echo '\[' | rg '[[]'
regex parse error:
    [[]
     ^^
error: unclosed character class

But I tested now with --engine pcre2 and it does work there, as you said.

I looked for other cases in PCRE2 now and didn't find a particular problem.
Specifically, since names can't include curly/round brackets, we can ignore <> because it's within the constraints (i.e. an invalid pattern can't lead to a valid combined pattern).

I feel you on going with the (?:<pattern>) wrapping, but I think that fixing the [[] issue shouldn't be too hard at this point.
So, let me know if you'd consider merging it if I fixed that part (or otherwise, if you can think of other uncovered cases that I missed, and I'll just quit the attempt anyway)

BurntSushi · 2023-04-11T11:42:27Z

Whoops, I meant that []] and [\]] are equivalent.

For now, I'd rather go with the (?:pattern) approach and just stop there. It's simple and fixes the most pressing problem.

This was originally fixed by using non-capturing groups when joining patterns in crates/core/args.rs, but before that landed, it ended up getting fixed via a refactor in the course of migrating to regex 1.9. Namely, it's now fixed by pushing pattern joining down into the regex layer, so that patterns can be joined in the most effective way possible. Still, #2488 contains a useful test, so we bring that in here. The test actually failed for `rg -e ')('`, since it expected the command to fail with a syntax error. But my refactor actually causes this command to succeed. And indeed, #2488 worked around this by special casing a single pattern. That work-around fixes it for the single pattern case, but doesn't fix it for the -w or -X or multi-pattern case. So for now, we're content to leave well enough alone. The only real way to fix this for real is to parse each regexp individual and verify that each is valid on its own. It's not clear that doing so is worth it. Fixes #2480, Closes #2488

14.0.2 (2023-11-27) =================== This is a patch release with a few small bug fixes. Bug fixes: * [BUG #2654](BurntSushi/ripgrep#2654): Fix `deb` release sha256 sum file. * [BUG #2658](BurntSushi/ripgrep#2658): Fix partial regression in the behavior of `--null-data --line-regexp`. * [BUG #2659](BurntSushi/ripgrep#2659): Fix Fish shell completions. * [BUG #2662](BurntSushi/ripgrep#2662): Fix typo in documentation for `-i/--ignore-case`. 14.0.1 (2023-11-26) =================== This a patch release meant to fix `cargo install ripgrep` on Windows. Bug fixes: * [BUG #2653](BurntSushi/ripgrep#2653): Include `pkg/windows/Manifest.xml` in crate package. 14.0.0 (2023-11-26) =================== ripgrep 14 is a new major version release of ripgrep that has some new features, performance improvements and a lot of bug fixes. The headlining feature in this release is hyperlink support. In this release, they are an opt-in feature but may change to an opt-out feature in the future. To enable them, try passing `--hyperlink-format default`. If you use [VS Code], then try passing `--hyperlink-format vscode`. Please [report your experience with hyperlinks][report-hyperlinks], positive or negative. [VS Code]: https://code.visualstudio.com/ [report-hyperlinks]: BurntSushi/ripgrep#2611 Another headlining development in this release is that it contains a rewrite of its regex engine. You generally shouldn't notice any changes, except for some searches may get faster. You can read more about the [regex engine rewrite on my blog][regex-internals]. Please [report your performance improvements or regressions that you notice][report-perf]. [report-perf]: BurntSushi/ripgrep#2652 Finally, ripgrep switched the library it uses for argument parsing. Users should not notice a difference in most cases (error messages have changed somewhat), but flag overrides should generally be more consistent. For example, things like `--no-ignore --ignore-vcs` work as one would expect (disables all filtering related to ignore rules except for rules found in version control systems such as `git`). [regex-internals]: https://blog.burntsushi.net/regex-internals/ **BREAKING CHANGES**: * `rg -C1 -A2` used to be equivalent to `rg -A2`, but now it is equivalent to `rg -B1 -A2`. That is, `-A` and `-B` no longer completely override `-C`. Instead, they only partially override `-C`. Build process changes: * ripgrep's shell completions and man page are now created by running ripgrep with a new `--generate` flag. For example, `rg --generate man` will write a man page in `roff` format on stdout. The release archives have not changed. * The optional build dependency on `asciidoc` or `asciidoctor` has been dropped. Previously, it was used to produce ripgrep's man page. ripgrep now owns this process itself by writing `roff` directly. Performance improvements: * [PERF #1746](BurntSushi/ripgrep#1746): Make some cases with inner literals faster. * [PERF #1760](BurntSushi/ripgrep#1760): Make most searches with `\b` look-arounds (among others) much faster. * [PERF #2591](BurntSushi/ripgrep#2591): Parallel directory traversal now uses work stealing for faster searches. * [PERF #2642](BurntSushi/ripgrep#2642): Parallel directory traversal has some contention reduced. Feature enhancements: * Added or improved file type filtering for Ada, DITA, Elixir, Fuchsia, Gentoo, Gradle, GraphQL, Markdown, Prolog, Raku, TypeScript, USD, V * [FEATURE #665](BurntSushi/ripgrep#665): Add a new `--hyperlink-format` flag that turns file paths into hyperlinks. * [FEATURE #1709](BurntSushi/ripgrep#1709): Improve documentation of ripgrep's behavior when stdout is a tty. * [FEATURE #1737](BurntSushi/ripgrep#1737): Provide binaries for Apple silicon. * [FEATURE #1790](BurntSushi/ripgrep#1790): Add new `--stop-on-nonmatch` flag. * [FEATURE #1814](BurntSushi/ripgrep#1814): Flags are now categorized in `-h/--help` output and ripgrep's man page. * [FEATURE #1838](BurntSushi/ripgrep#1838): An error is shown when searching for NUL bytes with binary detection enabled. * [FEATURE #2195](BurntSushi/ripgrep#2195): When `extra-verbose` mode is enabled in zsh, show extra file type info. * [FEATURE #2298](BurntSushi/ripgrep#2298): Add instructions for installing ripgrep using `cargo binstall`. * [FEATURE #2409](BurntSushi/ripgrep#2409): Added installation instructions for `winget`. * [FEATURE #2425](BurntSushi/ripgrep#2425): Shell completions (and man page) can be created via `rg --generate`. * [FEATURE #2524](BurntSushi/ripgrep#2524): The `--debug` flag now indicates whether stdin or `./` is being searched. * [FEATURE #2643](BurntSushi/ripgrep#2643): Make `-d` a short flag for `--max-depth`. * [FEATURE #2645](BurntSushi/ripgrep#2645): The `--version` output will now also contain PCRE2 availability information. Bug fixes: * [BUG #884](BurntSushi/ripgrep#884): Don't error when `-v/--invert-match` is used multiple times. * [BUG #1275](BurntSushi/ripgrep#1275): Fix bug with `\b` assertion in the regex engine. * [BUG #1376](BurntSushi/ripgrep#1376): Using `--no-ignore --ignore-vcs` now works as one would expect. * [BUG #1622](BurntSushi/ripgrep#1622): Add note about error messages to `-z/--search-zip` documentation. * [BUG #1648](BurntSushi/ripgrep#1648): Fix bug where sometimes short flags with values, e.g., `-M 900`, would fail. * [BUG #1701](BurntSushi/ripgrep#1701): Fix bug where some flags could not be repeated. * [BUG #1757](BurntSushi/ripgrep#1757): Fix bug when searching a sub-directory didn't have ignores applied correctly. * [BUG #1891](BurntSushi/ripgrep#1891): Fix bug when using `-w` with a regex that can match the empty string. * [BUG #1911](BurntSushi/ripgrep#1911): Disable mmap searching in all non-64-bit environments. * [BUG #1966](BurntSushi/ripgrep#1966): Fix bug where ripgrep can panic when printing to stderr. * [BUG #2046](BurntSushi/ripgrep#2046): Clarify that `--pre` can accept any kind of path in the documentation. * [BUG #2108](BurntSushi/ripgrep#2108): Improve docs for `-r/--replace` syntax. * [BUG #2198](BurntSushi/ripgrep#2198): Fix bug where `--no-ignore-dot` would not ignore `.rgignore`. * [BUG #2201](BurntSushi/ripgrep#2201): Improve docs for `-r/--replace` flag. * [BUG #2288](BurntSushi/ripgrep#2288): `-A` and `-B` now only each partially override `-C`. * [BUG #2236](BurntSushi/ripgrep#2236): Fix gitignore parsing bug where a trailing `\/` resulted in an error. * [BUG #2243](BurntSushi/ripgrep#2243): Fix `--sort` flag for values other than `path`. * [BUG #2246](BurntSushi/ripgrep#2246): Add note in `--debug` logs when binary files are ignored. * [BUG #2337](BurntSushi/ripgrep#2337): Improve docs to mention that `--stats` is always implied by `--json`. * [BUG #2381](BurntSushi/ripgrep#2381): Make `-p/--pretty` override flags like `--no-line-number`. * [BUG #2392](BurntSushi/ripgrep#2392): Improve global git config parsing of the `excludesFile` field. * [BUG #2418](BurntSushi/ripgrep#2418): Clarify sorting semantics of `--sort=path`. * [BUG #2458](BurntSushi/ripgrep#2458): Make `--trim` run before `-M/--max-columns` takes effect. * [BUG #2479](BurntSushi/ripgrep#2479): Add documentation about `.ignore`/`.rgignore` files in parent directories. * [BUG #2480](BurntSushi/ripgrep#2480): Fix bug when using inline regex flags with `-e/--regexp`. * [BUG #2505](BurntSushi/ripgrep#2505): Improve docs for `--vimgrep` by mentioning footguns and some work-arounds. * [BUG #2519](BurntSushi/ripgrep#2519): Fix incorrect default value in documentation for `--field-match-separator`. * [BUG #2523](BurntSushi/ripgrep#2523): Make executable searching take `.com` into account on Windows. * [BUG #2574](BurntSushi/ripgrep#2574): Fix bug in `-w/--word-regexp` that would result in incorrect match offsets. * [BUG #2623](BurntSushi/ripgrep#2623): Fix a number of bugs with the `-w/--word-regexp` flag. * [BUG #2636](BurntSushi/ripgrep#2636): Strip release binaries for macOS.

BurntSushi added the bug A bug. label Mar 29, 2023

gofri mentioned this issue Apr 13, 2023

args: use non-capturing groups when joining multiple patterns #2488

Closed

BurntSushi closed this as completed in 36194c2 Jul 8, 2023

sankalp-khare mentioned this issue Nov 26, 2023

ripgrep 14.0.0 Homebrew/homebrew-core#155592

Merged

learnbyexample mentioned this issue Dec 6, 2023

Inline flag (?x) with comment doesn't work due to automatic wrapping of patterns inside non-capturing group #2677

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patterns may affect each other when multiple patterns are provided #2480

Patterns may affect each other when multiple patterns are provided #2480

gofri commented Mar 29, 2023

BurntSushi commented Mar 29, 2023

gofri commented Apr 11, 2023 •

edited

Loading

BurntSushi commented Apr 11, 2023

gofri commented Apr 11, 2023 •

edited

Loading

BurntSushi commented Apr 11, 2023

Patterns may affect each other when multiple patterns are provided #2480

Patterns may affect each other when multiple patterns are provided #2480

Comments

gofri commented Mar 29, 2023

What version of ripgrep are you using?

How did you install ripgrep?

What operating system are you using ripgrep on?

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

What do you think ripgrep should have done?

Side notes:

BurntSushi commented Mar 29, 2023

gofri commented Apr 11, 2023 • edited Loading

BurntSushi commented Apr 11, 2023

gofri commented Apr 11, 2023 • edited Loading

BurntSushi commented Apr 11, 2023

gofri commented Apr 11, 2023 •

edited

Loading

gofri commented Apr 11, 2023 •

edited

Loading