Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect search results for malformed Char #48283

Open
stevengj opened this issue Jan 15, 2023 · 3 comments
Open

incorrect search results for malformed Char #48283

stevengj opened this issue Jan 15, 2023 · 3 comments
Labels
domain:search & find The find* family of functions domain:strings "Strings!"

Comments

@stevengj
Copy link
Member

stevengj commented Jan 15, 2023

I noticed in #48281 that various search routines are giving incorrect results because they check for ASCII chars via c ≤ '\x7f' rather than isascii(c), which can be wrong for a malformed Char:

julia> c = reinterpret(Char, 0x7f000000 - 0x01)
'\x7e\xff\xff\xff': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> isascii(c)
false

julia> c  '\x7f'
true

julia> s = "foo~bar~\xff\xff\xff~baz"
"foo~bar~\xff\xff\xff~baz"

julia> readuntil(IOBuffer(s), c)
"foo"

julia> s[1:findfirst(==(c), s)]
"foo~"

julia> s[findlast(==(c), s):end]
"~baz"

These search results seem clearly wrong to me — they are treating c as if it were == Char(0x7e) == '~'.

It's not entirely clear to me what the correct behavior should be, however, since Julia never treats '\x7e\xff\xff\xff' as a character in a string: "$c" == "~\xff\xff\xff". Maybe these routines should throw an error if the character is malformed?

cc @StefanKarpinski, who wrote this code in #24999.

See also #44989, where such characters literals were allowed disallowed:
even though since it is impossible(?) to obtain such a Char by iterating a String.

@stevengj stevengj added kind:bug Indicates an unexpected problem or unintended behavior domain:strings "Strings!" domain:search & find The find* family of functions labels Jan 15, 2023
@jakobnissen
Copy link
Contributor

Using reinterpret is bound to be able to create instances that violate any assumptions, and thus can break code.
Is there any other way of creating such a Char, without using reinterpret? If not, we could simply consider using such a Char undefined behaviour?

@simeonschaub
Copy link
Member

See also #44989, where such characters literals were allowed:

julia> `'\x7e\xff\xff\xff'`
`'\x7e\xff\xff\xff'`

I advocated for allowing those, but it was eventually decided against. This just creates a command literal, if you try this with a character literal this will throw a syntax error:

julia> '\x7e\xff\xff\xff'
ERROR: syntax: character literal contains multiple characters
Stacktrace:
 [1] top-level scope
   @ none:1

See also @StefanKarpinski's comment in #44765 (comment), which might also be relevant here:

Moreover, there is an implicit invariant for the Char type, which is that it only ever holds data for a single character as defined by string iteration. You can violate that invariant and construct Char values like 'abcd' or \x80\x80' by using reinterpret, but we really don't want to encourage it and we certainly shouldn't provide syntax to construct Char values like that.

@stevengj
Copy link
Member Author

If not, we could simply consider using such a Char undefined behaviour?

That's certainly an option. (It doesn't crash, it just gives a surprising result.)

It might be nicer to throw an exception, since it shouldn't be too costly in the context of a search routine to check for a malformed Char. But currently we don't seem to have a predicate to check for malformed Char values? (isvalid(c) is different: it returns false for anything that does not correspond to a valid UTF-8-encoded character.)

@stevengj stevengj removed the kind:bug Indicates an unexpected problem or unintended behavior label Jan 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:search & find The find* family of functions domain:strings "Strings!"
Projects
None yet
Development

No branches or pull requests

3 participants