BOMs are not stripped consistently in --json output #1638

acheronfail · 2020-07-08T11:54:22Z

What version of ripgrep are you using?

ripgrep 12.0.1
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

cargo install ripgrep

What operating system are you using ripgrep on?

Arch Linux 5.7.6

Describe your bug.

When using ripgrep's --json flag on a file encoded as "UTF 8 with BOM" the BOM is not accounted for (as opposed to other encodings, such as UTF 16).

What are the steps to reproduce the behavior?

UTF8

# Create a UTF8 encoded file (without BOM)
printf "\x66\x6f\x6f" > utf8
# Run ripgrep
rg foo ./utf8 --json

UTF8 BOM

# Create a UTF8 encoded file (with BOM)
printf "\xef\xbb\xbf\x66\x6f\x6f" > utf8bom
# Run ripgrep
rg foo ./utf8bom --json

UTF16

# Create a UTF16 encoded file (has BOM)
printf "\xff\xfe\x66\x00\x6f\x00\x6f\x00" > utf16
# Run ripgrep
rg foo ./utf16 --json

What is the actual behavior?

Here is the JSON output for the above three code blocks.

UTF8

{"type":"begin","data":{"path":{"text":"./utf8"}}}
{"type":"match","data":{"path":{"text":"./utf8"},"lines":{"text":"foo"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"foo"},"start":0,"end":3}]}}
{"type":"end","data":{"path":{"text":"./utf8"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":42600,"human":"0.000043s"},"searches":1,"searches_with_match":1,"bytes_searched":3,"bytes_printed":219,"matched_lines":1,"matches":1}}}
{"data":{"elapsed_total":{"human":"0.007219s","nanos":7218625,"secs":0},"stats":{"bytes_printed":219,"bytes_searched":3,"elapsed":{"human":"0.000043s","nanos":42600,"secs":0},"matched_lines":1,"matches":1,"searches":1,"searches_with_match":1}},"type":"summary"}

UTF8 BOM

{"type":"begin","data":{"path":{"text":"./utf8bom"}}}
{"type":"match","data":{"path":{"text":"./utf8bom"},"lines":{"text":"foo"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"foo"},"start":3,"end":6}]}}
{"type":"end","data":{"path":{"text":"./utf8bom"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":47766,"human":"0.000048s"},"searches":1,"searches_with_match":1,"bytes_searched":6,"bytes_printed":228,"matched_lines":1,"matches":1}}}
{"data":{"elapsed_total":{"human":"0.007849s","nanos":7849144,"secs":0},"stats":{"bytes_printed":228,"bytes_searched":6,"elapsed":{"human":"0.000048s","nanos":47766,"secs":0},"matched_lines":1,"matches":1,"searches":1,"searches_with_match":1}},"type":"summary"}

UTF16

{"type":"begin","data":{"path":{"text":"./utf16"}}}
{"type":"match","data":{"path":{"text":"./utf16"},"lines":{"text":"foo"},"line_number":1,"absolute_offset":0,"submatches":[{"match":{"text":"foo"},"start":0,"end":3}]}}
{"type":"end","data":{"path":{"text":"./utf16"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":43947,"human":"0.000044s"},"searches":1,"searches_with_match":1,"bytes_searched":3,"bytes_printed":221,"matched_lines":1,"matches":1}}}
{"data":{"elapsed_total":{"human":"0.006400s","nanos":6399559,"secs":0},"stats":{"bytes_printed":221,"bytes_searched":3,"elapsed":{"human":"0.000044s","nanos":43947,"secs":0},"matched_lines":1,"matches":1,"searches":1,"searches_with_match":1}},"type":"summary"}

What is the expected behavior?

I personally expected that ripgrep would strip the UTF8 BOM from the JSON report since that's what it does for UTF16 encodings. However, I'm not sure if this should be the case or not, considering that a UTF8 BOM is an optional file header.

The text was updated successfully, but these errors were encountered:

UTF-8 encoded files with BOM didn't sniff the BOM from results, regardless of config.bom_sniffing; ripgrep already implemented this option for UTF-16 files correctly. Fixes BurntSushi#1638

Previously, we were only looking for the UTF-16 BOM for determining whether to do transcoding or not. But we should also look for the UTF-8 BOM as well. Fixes #1638, Closes #1697

alessandroasm mentioned this issue Oct 2, 2020

UTF-8 BOM Sniffing #1697

Closed

BurntSushi added bug A bug. rollup A PR that has been merged with many others in a rollup. labels May 30, 2021

BurntSushi closed this as completed in 2295061 Jun 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BOMs are not stripped consistently in --json output #1638

BOMs are not stripped consistently in --json output #1638

acheronfail commented Jul 8, 2020

BOMs are not stripped consistently in --json output #1638

BOMs are not stripped consistently in --json output #1638

Comments

acheronfail commented Jul 8, 2020

What version of ripgrep are you using?

How did you install ripgrep?

What operating system are you using ripgrep on?

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?