Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should binary file detection be suppressed in aggregate views? #855

Closed
minusf opened this issue Mar 12, 2018 · 9 comments
Closed

should binary file detection be suppressed in aggregate views? #855

minusf opened this issue Mar 12, 2018 · 9 comments
Labels
bug A bug. question An issue that is lacking clarity on one or more points.

Comments

@minusf
Copy link

minusf commented Mar 12, 2018

What version of ripgrep are you using?

ripgrep 0.8.1
-SIMD -AVX

What operating system are you using ripgrep on?

macOS 13.13.3

Describe your question, feature request, or bug.

I have 4MB big whatsapp chat history, mostly text but lots of emojis.
Doing some manual stats, i wanted to count number of chats, etc.
The data spans 3 years, 2016-2018. Lines look like this:

23/09/2016, 21:29 - Sender: text

Nothing special, really. Trying to count all chats:

$ rg -c " 201., " chat.txt
427
$ grep -c "201., " chat.txt
58948

Searching for any other pattern stops at the same line in the file
(checked with actual output instead of -c):

$ rg -c " - Sender1: " chat.txt
244
$ grep -c " - Sender1: " chat.txt
30222

For the personal nature of the corpus i am not able to include it...
But none of the surrounding lines where rg stops even have emojis...

$ rg --debug -c "201., " chat.txt
DEBUG/grep::search/grep/src/search.rs:195: regex ast:
Concat(
    [
        Literal {
            chars: [
                '2',
                '0',
                '1'
            ],
            casei: false
        },
        AnyCharNoNL,
        Literal {
            chars: [
                ',',
                ' '
            ],
            casei: false
        }
    ]
)
DEBUG/grep::literals/grep/src/literals.rs:88: required literal found: "201"
427

If this is a bug, what are the steps to reproduce the behavior?

see above

If this is a bug, what is the actual behavior?

see above

If this is a bug, what is the expected behavior?

see above the grep counts

@BurntSushi
Copy link
Owner

BurntSushi commented Mar 12, 2018

Could you please try to reproduce this bug on data that you can share? Otherwise, there's nothing I can do.

@BurntSushi BurntSushi added the question An issue that is lacking clarity on one or more points. label Mar 12, 2018
@minusf
Copy link
Author

minusf commented Mar 12, 2018

xx.gz

replaced all the text with 'x'. you can still search for dates.

$ rg -c '/201., ' xx
388
$ grep -c '/201., ' xx
58947

hope this helps

@BurntSushi
Copy link
Owner

This is a good one! It turns out your file contains NUL bytes, which trips ripgrep's binary file detection. That, in turn, causes ripgrep to stop searching the file.

The interesting bit here is that grep also does binary file detection (in basically the same way as ripgrep), so why the difference? Well, it looks like grep will disable its binary file skipping if it's showing counts instead of printing lines, where as ripgrep always stops searching if it finds a binary file.

This becomes evident if you actually try to look at the results:

$ grep '/201., ' xx
Binary file xx matches

You can make this work by passing the -a/--text flag to ripgrep, which will cause it to suppress its binary file detection (which is the same flag that grep supports).

I suspect that ripgrep should probably behave the same as grep in cases like this, however making the search inconsistent depending on whether --count is passed or not doesn't strike me as an obvious win, but ripgrep's current behavior isn't exactly intuitive either.

@abhichandra21 I wonder if this was your issue too with respect to incorrect results. (You never reported the results of using the -a/--text flag because it OOMed, but we fixed that by using --no-mmap. Could you try again with -a/--text and see if the results are correct?)

@BurntSushi
Copy link
Owner

@minusf Your NUL bytes are on line 477 of the file you gave me:

24/09/2016, 02:41 - xxx: 👍�👍�👍

There are NUL bytes after the first and second 👍.

@BurntSushi BurntSushi added the bug A bug. label Mar 12, 2018
@minusf
Copy link
Author

minusf commented Mar 12, 2018

yesss, sorry i was lazy to od the file :} but i was getting there.

indeed counting in grep works even for binary files, and i had to pass it -a to get results. didn't occur to me that rg has the same flag. another small fail.

slowly but steadily the distinction between text files and binary files is washing away. if we have a 4MB file that is mostly text and emojis sprinkled with some 0 bytes here and there (cause whatapp's exporter does ...) is it a text or a binary file? is it a valid dychotomy between textness and binaryness? cause i would argue that this is a text file in the end. a file is a file is a file. we are all one 😸

@BurntSushi
Copy link
Owner

BurntSushi commented Mar 13, 2018

Let's please not re-litigate binary file detection. It's not productive without simultaneously focusing on implementation details, because those details matter, a lot. Detecting NUL bytes is obviously a heuristic, and it's a heuristic that grep has used for decades. The purpose of binary file detection is to prevent search tools from spewing arbitrary binary data to terminals, which can wreak havoc. Try cating a PDF in your terminal, for example. ripgrep also uses binary file detection as a type of filter.

Also, emojis have nothing to do with this. There is no emoji whose UTF-8 encoding contains a NUL byte.

@minusf
Copy link
Author

minusf commented Mar 13, 2018

sorry for the harmless semi philosophical nosense. i'll be more than happy when rg reports back binary file just like grep.

@BurntSushi
Copy link
Owner

i'll be more than happy when rg reports back binary file just like grep.

To be clear, that bug is tracked by #306.

What isn't clear to me is whether we should be matching the behavior of grep in the original case you reported with the -c/--count flag.

@minusf
Copy link
Author

minusf commented Mar 13, 2018

as the heuristics are very simple, indeed a -a flag is a must.

i don't think the grep behaviour is an oversight. the possibly disruptive text
is never printed, so in the end it doesn't matter if the file passes the heuristics or not.

while there is dedicated binary comparison utility, cmp, to complement diff,
i don't know about a dedicated binary search tool.

also my man pages says:

NAME
     grep, egrep, fgrep, zgrep, zegrep, zfgrep -- file pattern searcher

@BurntSushi BurntSushi added the libripgrep An issue related to modularizing ripgrep into libraries. label Apr 23, 2018
@BurntSushi BurntSushi added this to the libripgrep milestone Apr 23, 2018
@BurntSushi BurntSushi changed the title missed results should binary file detection be suppressed in aggregate views? Apr 23, 2018
@BurntSushi BurntSushi removed the libripgrep An issue related to modularizing ripgrep into libraries. label Aug 19, 2018
@BurntSushi BurntSushi removed this from the libripgrep milestone Aug 19, 2018
BurntSushi added a commit that referenced this issue Apr 14, 2019
This commit attempts to surface binary filtering in a slightly more
user friendly way. Namely, before, ripgrep would silently stop
searching a file if it detected a NUL byte, even if it had previously
printed a match. This can lead to the user quite reasonably assuming
that there are no more matches, since a partial search is fairly
unintuitive. (ripgrep has this behavior by default because it really
wants to NOT search binary files at all, just like it doesn't search
gitignored or hidden files.)

With this commit, if a match has already been printed and ripgrep detects
a NUL byte, then it will print a warning message indicating that the search
stopped prematurely.

Moreover, this commit adds a new flag, --binary, which causes ripgrep to
stop filtering binary files, but in a way that still avoids dumping
binary data into terminals. That is, the --binary flag makes ripgrep
behave more like grep's default behavior.

For files explicitly specified in a search, e.g., `rg foo some-file`,
then no binary filtering is applied (just like no gitignore and no
hidden file filtering is applied). Instead, ripgrep behaves as if you
gave the --binary flag for all explicitly given files.

This was a fairly invasive change, and potentially increases the UX
complexity of ripgrep around binary files. (Before, there were two
binary modes, where as now there are three.) However, ripgrep is now a
bit louder with warning messages when binary file detection might
otherwise be hiding potential matches, so hopefully this is a net
improvement.

Finally, the `-uuu` convenience now maps to `--no-ignore --hidden
--binary`, since this is closer to the actualy intent of the
`--unrestricted` flag, i.e., to reduce ripgrep's smart filtering. As a
consequence, `rg -uuu foo` should now search roughly the same number of
bytes as `grep -r foo`, and `rg -uuua foo` should search roughly the
same number of bytes as `grep -ra foo`. (The "roughly" weasel word is
used because grep's and ripgrep's binary file detection might differ
somewhat---perhaps based on buffer sizes---which can impact exactly what
is and isn't searched.)

See the numerous tests in tests/binary.rs for intended behavior.

Fixes #306, Fixes #855
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug. question An issue that is lacking clarity on one or more points.
Projects
None yet
Development

No branches or pull requests

2 participants