Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--smart-case detection of upper-case characters is unintuitive #717

Closed
okdana opened this issue Dec 17, 2017 · 4 comments
Closed

--smart-case detection of upper-case characters is unintuitive #717

okdana opened this issue Dec 17, 2017 · 4 comments
Labels
bug A bug.

Comments

@okdana
Copy link
Contributor

okdana commented Dec 17, 2017

The --smart-case option doesn't behave the way i expect in terms of determining whether the pattern contains upper-case characters. Specifically, a search like the following will fail:

rg -S 'foo\w' <<< FOOBAR

This is because it expands the class \w and discovers that it contains the range of upper-case characters A-Z.

ag uses a different metric: It simply checks the input pattern byte by byte to see if there's an upper-case character in the ASCII range. This works out in the foo\w case, but it will fail in the similarly obvious case of foo\S (which rg happens to get right) — not to mention multi-byte stuff like É and ß.

There are also more complicated cases where they both fail in different ways, like this:

rg -S '\p{Ll}' <<< FOOBAR

rg returns a match here, because \p{Ll} expands to only lower-case characters. But i think this should be treated as an 'explicit' range, like [A-Z] would be.

At the very least, though, i feel like something as basic as foo\w should probably work as expected with smart case.

Is there any way to determine through the AST the actual input that produced a class? Or would that be easy to add? I'm not sure i'm at the level yet where i should be messing with the regex crate, but that seems like it'd be a convenient way to sort it.

The easiest 'good enough' alternative that comes to mind would be to do a semi-naïve scan of the pattern that's just smart enough to ignore stuff like short-hand classes, but that feels kind of dumb (especially since i get the impression that you don't really care for this feature in the first place...).

@BurntSushi
Copy link
Owner

(especially since i get the impression that you don't really care for this feature in the first place...)

Haha yeah I am not a fan of it, but recognize that others like it. It should definitely work intuitively, and I basically agree with your analysis here.

I think you are on the right track with respect to fixing this. The problem is that the regex "AST" isn't a true faithful AST. The parser itself does various transformations that cannot be reversed, so there is no way to "see" that a \w was actually written as a \w instead of a [...].

I have been working on a rewrite of the parser for quite some time now, and with it, a proper AST. Once that is active (I'm not sure when that will be), then we can scan the AST and apply this heuristic with as much precision as we care to.

@BurntSushi BurntSushi added the bug A bug. label Dec 18, 2017
@BurntSushi
Copy link
Owner

BurntSushi commented Dec 18, 2017

The easiest 'good enough' alternative that comes to mind would be to do a semi-naïve scan of the pattern that's just smart enough to ignore stuff like short-hand classes

@okdana If you wanted to do this in the shorter term to get at least some simple cases right, then I'd be in favor of this. I believe the check is done in the grep crate in this repo.

okdana added a commit to okdana/ripgrep that referenced this issue Dec 18, 2017
Fixes BurntSushi#717 (partially)

The previous implementation of the smart-case feature was actually *too* smart,
in that it inspected the final character ranges in the AST to determine if the
pattern contained upper-case characters. This meant that patterns like `foo\w`
would not be handled case-insensitively, since `\w` includes the range of
upper-case characters A–Z.

As a medium-term solution to this problem, we now inspect the input pattern
itself for upper-case characters, ignoring any that immediately follow a `\`.
This neatly handles all of the most basic cases like `\w`, `\S`, and `É`, though
it still has problems with more complex features like `\p{Ll}`. Handling those
correctly will require improvements to the AST.
BurntSushi pushed a commit that referenced this issue Dec 18, 2017
Fixes #717 (partially)

The previous implementation of the smart-case feature was actually *too* smart,
in that it inspected the final character ranges in the AST to determine if the
pattern contained upper-case characters. This meant that patterns like `foo\w`
would not be handled case-insensitively, since `\w` includes the range of
upper-case characters A–Z.

As a medium-term solution to this problem, we now inspect the input pattern
itself for upper-case characters, ignoring any that immediately follow a `\`.
This neatly handles all of the most basic cases like `\w`, `\S`, and `É`, though
it still has problems with more complex features like `\p{Ll}`. Handling those
correctly will require improvements to the AST.
@BurntSushi
Copy link
Owner

BurntSushi commented Mar 10, 2018

@okdana With the new AST coming in soon, what do you think about this set of rules for smart case?

When the --smart-case flag is given, ripgrep chooses to match case insensitively if and only if there is at least one literal character in the pattern and that all such literals are not considered as uppercase. If the pattern contains no literals or at least one uppercase literal character, then normal case sensitive search is used.

Examples: abc, [a-z], abc\w and abc\p{Ll} are all searched case insensitively while aBc, [A-Z], aBc\w and \p{Ll} are searched case sensitively.

@okdana
Copy link
Contributor Author

okdana commented Mar 10, 2018

Hmm. I can't think of any case where that wouldn't make sense. Yeah, sounds cool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug.
Projects
None yet
Development

No branches or pull requests

2 participants