Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOMs are not stripped consistently in --json output #1638

Closed
acheronfail opened this issue Jul 8, 2020 · 0 comments
Closed

BOMs are not stripped consistently in --json output #1638

acheronfail opened this issue Jul 8, 2020 · 0 comments
Labels
bug A bug. rollup A PR that has been merged with many others in a rollup.

Comments

@acheronfail
Copy link

What version of ripgrep are you using?

ripgrep 12.0.1
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

cargo install ripgrep

What operating system are you using ripgrep on?

Arch Linux 5.7.6

Describe your bug.

When using ripgrep's --json flag on a file encoded as "UTF 8 with BOM" the BOM is not accounted for (as opposed to other encodings, such as UTF 16).

What are the steps to reproduce the behavior?

UTF8

# Create a UTF8 encoded file (without BOM)
printf "\x66\x6f\x6f" > utf8
# Run ripgrep
rg foo ./utf8 --json

UTF8 BOM

# Create a UTF8 encoded file (with BOM)
printf "\xef\xbb\xbf\x66\x6f\x6f" > utf8bom
# Run ripgrep
rg foo ./utf8bom --json

UTF16

# Create a UTF16 encoded file (has BOM)
printf "\xff\xfe\x66\x00\x6f\x00\x6f\x00" > utf16
# Run ripgrep
rg foo ./utf16 --json

What is the actual behavior?

Here is the JSON output for the above three code blocks.

UTF8

{"type":"begin","data":{"path":{"text":"./utf8"}}}
{"type":"match","data":{"path":{"text":"./utf8"},"lines":{"text":"foo"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"foo"},"start":0,"end":3}]}}
{"type":"end","data":{"path":{"text":"./utf8"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":42600,"human":"0.000043s"},"searches":1,"searches_with_match":1,"bytes_searched":3,"bytes_printed":219,"matched_lines":1,"matches":1}}}
{"data":{"elapsed_total":{"human":"0.007219s","nanos":7218625,"secs":0},"stats":{"bytes_printed":219,"bytes_searched":3,"elapsed":{"human":"0.000043s","nanos":42600,"secs":0},"matched_lines":1,"matches":1,"searches":1,"searches_with_match":1}},"type":"summary"}

UTF8 BOM

{"type":"begin","data":{"path":{"text":"./utf8bom"}}}
{"type":"match","data":{"path":{"text":"./utf8bom"},"lines":{"text":"foo"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"foo"},"start":3,"end":6}]}}
{"type":"end","data":{"path":{"text":"./utf8bom"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":47766,"human":"0.000048s"},"searches":1,"searches_with_match":1,"bytes_searched":6,"bytes_printed":228,"matched_lines":1,"matches":1}}}
{"data":{"elapsed_total":{"human":"0.007849s","nanos":7849144,"secs":0},"stats":{"bytes_printed":228,"bytes_searched":6,"elapsed":{"human":"0.000048s","nanos":47766,"secs":0},"matched_lines":1,"matches":1,"searches":1,"searches_with_match":1}},"type":"summary"}

UTF16

{"type":"begin","data":{"path":{"text":"./utf16"}}}
{"type":"match","data":{"path":{"text":"./utf16"},"lines":{"text":"foo"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"foo"},"start":0,"end":3}]}}
{"type":"end","data":{"path":{"text":"./utf16"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":43947,"human":"0.000044s"},"searches":1,"searches_with_match":1,"bytes_searched":3,"bytes_printed":221,"matched_lines":1,"matches":1}}}
{"data":{"elapsed_total":{"human":"0.006400s","nanos":6399559,"secs":0},"stats":{"bytes_printed":221,"bytes_searched":3,"elapsed":{"human":"0.000044s","nanos":43947,"secs":0},"matched_lines":1,"matches":1,"searches":1,"searches_with_match":1}},"type":"summary"}

What is the expected behavior?

I personally expected that ripgrep would strip the UTF8 BOM from the JSON report since that's what it does for UTF16 encodings. However, I'm not sure if this should be the case or not, considering that a UTF8 BOM is an optional file header.

alessandroasm added a commit to alessandroasm/ripgrep that referenced this issue Oct 2, 2020
UTF-8 encoded files with BOM didn't sniff the BOM from results, regardless of
config.bom_sniffing; ripgrep already implemented this option for UTF-16 files
correctly.

Fixes BurntSushi#1638
@BurntSushi BurntSushi added bug A bug. rollup A PR that has been merged with many others in a rollup. labels May 30, 2021
BurntSushi pushed a commit that referenced this issue May 30, 2021
Previously, we were only looking for the UTF-16 BOM for determining
whether to do transcoding or not. But we should also look for the UTF-8
BOM as well.

Fixes #1638, Closes #1697
BurntSushi pushed a commit that referenced this issue May 31, 2021
Previously, we were only looking for the UTF-16 BOM for determining
whether to do transcoding or not. But we should also look for the UTF-8
BOM as well.

Fixes #1638, Closes #1697
BurntSushi pushed a commit that referenced this issue Jun 1, 2021
Previously, we were only looking for the UTF-16 BOM for determining
whether to do transcoding or not. But we should also look for the UTF-8
BOM as well.

Fixes #1638, Closes #1697
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug. rollup A PR that has been merged with many others in a rollup.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants