Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ripgrep is slow when detecting whitespace literals, e.g., '[ \t]$' #1087

Closed
shlomif opened this issue Oct 19, 2018 · 3 comments
Closed

ripgrep is slow when detecting whitespace literals, e.g., '[ \t]$' #1087

shlomif opened this issue Oct 19, 2018 · 3 comments
Labels
enhancement An enhancement to the functionality of the software. rollup A PR that has been merged with many others in a rollup.

Comments

@shlomif
Copy link

shlomif commented Oct 19, 2018

What version of ripgrep are you using?

ripgrep 0.10.0
-SIMD -AVX (compiled)
+SIMD -AVX (runtime)

How did you install ripgrep?

cargo install ripgrep

What operating system are you using ripgrep on?

My system is Mageia Linux v7 x86-64 on a Core i3 Sandy Bridge machine.

rg is much slower than ag when searching for the '[ \t]$' regex on a
sample repo. With this benchmark case:

#! /bin/bash
#
# ag-rg-bench.bash
# Copyright (C) 2018 Shlomi Fish <[email protected]>
#
# Distributed under terms of the MIT license.
#

set -e -x
d=freecell-pro-0fc-deals

if ! test -d "$d"
then
    git clone https://github.com/shlomif/freecell-pro-0fc-deals
fi

rg --version
ag --version
locale
(
    cd "$d"
    set +e
    time ag '[ \t]$'
    time rg '[ \t]$'
    time ag '[ \t]$'
    time rg '[ \t]$'
    time rg '[ \t]$'
)

I am getting the following:

+ d=freecell-pro-0fc-deals
+ test -d freecell-pro-0fc-deals
+ rg --version
ripgrep 0.10.0
-SIMD -AVX (compiled)
+SIMD -AVX (runtime)
+ ag --version
ag version 2.1.0

Features:
  +jit +lzma +zlib
+ locale
LANG=en_GB.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC=en_GB.UTF-8
LC_TIME=en_GB.UTF-8
LC_COLLATE=en_US.UTF-8
LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8
LC_PAPER=en_US.UTF-8
LC_NAME=en_GB.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_GB.UTF-8
LC_IDENTIFICATION=en_GB.UTF-8
LC_ALL=
+ cd freecell-pro-0fc-deals
+ set +e
+ ag '[ \t]$'

real	0m0.101s
user	0m0.303s
sys	0m0.025s
+ rg '[ \t]$'

real	0m1.249s
user	0m4.799s
sys	0m0.086s
+ ag '[ \t]$'

real	0m0.094s
user	0m0.305s
sys	0m0.016s
+ rg '[ \t]$'

real	0m1.272s
user	0m4.910s
sys	0m0.078s
+ rg '[ \t]$'

real	0m1.246s
user	0m4.828s
sys	0m0.074s

My system is Mageia Linux v7 x86-64 on a Core i3 Sandy Bridge machine.

@BurntSushi
Copy link
Owner

BurntSushi commented Oct 19, 2018

Thanks for the easy reproduction! Much appreciated.

This is unfortunately a case where ripgrep's literal optimizations end up making it slower rather than faster. The presence of literal optimizations virtually guarantees that cases like this exist, although we might consider some common sense heuristics to improve it. e.g., If all detected literals are just whitespace, then it's probably better not to use them. That would be a fairly easy change in grep-regex/src/literal.rs.

You can test this by trying ripgrep with PCRE2. For example, rg '[ \t]$' -P -U --no-pcre2-unicode --mmap runs faster on my machine than ag '[ \t]$'. In particular, the flags given to ripgrep here enable the same performance profile as ag, albeit with PCRE2 instead of PCRE1. Similarly, rg '\s$' runs quite a bit faster than ag '\s$' (although, rg -U '\s$' is the more accurate comparison, but still runs quite a bit faster) precisely because \s prevents literal detection from happening since there are too many of them.

@BurntSushi BurntSushi changed the title rg is much slower than ag when searching for the '[ \t]$' regex. ripgrep is slow when detecting whitespace literals, e.g., '[ \t]$' Oct 19, 2018
@BurntSushi BurntSushi added the enhancement An enhancement to the functionality of the software. label Oct 19, 2018
@shlomif
Copy link
Author

shlomif commented Oct 19, 2018 via email

@BurntSushi BurntSushi added the rollup A PR that has been merged with many others in a rollup. label Mar 15, 2020
BurntSushi added a commit that referenced this issue Mar 15, 2020
If a literal is entirely whitespace, then it's quite likely that it is
very common. So when that case occurs, just don't do (inner) literal
optimizations at all.

The regex engine may still make sub-optimal decisions here, but that's a
problem for another day.

Fixes #1087
@shlomif
Copy link
Author

shlomif commented Mar 15, 2020

Thanks for fixing it, @BurntSushi ! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement to the functionality of the software. rollup A PR that has been merged with many others in a rollup.
Projects
None yet
Development

No branches or pull requests

2 participants