-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Unicode strings and characters #13
Open
robin-aws
wants to merge
15
commits into
dafny-lang:master
Choose a base branch
from
robin-aws:unicode-strings
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
f64cd44
First start
robin-aws 098ccbd
Finished front half
robin-aws efc0ba8
Progress
robin-aws 2c5ae54
Most alternatives done
robin-aws 1c9ecda
Remaining alternatives todo
robin-aws 9a636ed
Everything filled in
robin-aws 5f7d0e0
Trailing fix ups
robin-aws d347572
Apply suggestions from code review
robin-aws 16f6a97
Half-open intervals
robin-aws 04d2553
Clarified representation of string literals
robin-aws 931d679
More details in Prior Art
robin-aws 3b87491
Don’t support \uXXXX with \unicodeChar:1
robin-aws 30c7ea7
Addressing remaining comments
robin-aws cbb491e
Lingering edits
robin-aws 560f395
Apply suggestions from code review
robin-aws File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
More details in Prior Art
- Loading branch information
commit 931d679901e83a46f61b99be0fbf2f9b201627cf
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -308,21 +308,29 @@ or could in the future: | |
The `System.Text.Rune` struct is provided to represent any Unicode scalar value, | ||
and its API guarantees that invalid values (e.g. surrogates) will be rejected on construction. | ||
The method `String.EnumerateRunes()` produces the sequence of runes in a string via an `IEnumerator<Rune>`. | ||
Any invalid UTF-16 sequences are enumerated as the `U+FFFD` "Replacement Character" `Rune` value. | ||
|
||
## Java: | ||
|
||
`char` is one of the eight primitive types in Java, and also represents a UTF-16 code unit. | ||
In recent versions of the Java Runtime Environment, the `java.lang.String` class supports | ||
encoding its data either in UTF-16 or in Latin-1, where the latter is an optimization for space | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it support ill-formed sequences? |
||
when all characters in the string are supported by this encoding. | ||
when all characters in the string are supported by this encoding. | ||
A `String` value may contain invalid sequences such as unpaired surrogates. | ||
|
||
Java does not included a dedicated type for Unicode scalar values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does the API behave when given an int outside of the range of valid scalar values?
robin-aws marked this conversation as resolved.
Show resolved
Hide resolved
|
||
and instead uses the 32-bit wide `int` primitive type. | ||
The `CharSequence.codePoints()` method produces an `IntStream` enumerating the UTF-32 code points in a valid String, | ||
but will enumerate unpaired surrogates directly as zero-extended `int` values. | ||
|
||
## Go: | ||
|
||
In Go a string is a read-only slice of bytes, which generally contains UTF-8 encoded bytes | ||
but may contain arbitrary bytes. The `rune` type is an alias for `int32`. | ||
but may contain arbitrary bytes: the Go compiler will reject invalid string literals, | ||
but it is still possible for strings to contain invalid UTF-8 bytes at runtime. | ||
The `rune` type is an alias for `int32`, | ||
and the Go compiler does not prevent casting out-of-range values such as `0x11_0000` | ||
as `rune` values. | ||
|
||
## JavaScript: | ||
|
||
|
@@ -332,13 +340,13 @@ There is no distinct type for representing individual characters. | |
## C++: | ||
|
||
The `char` type represents bytes, and the `std::string` class from the standard library | ||
operates on bytes as character, and generally does not produce correct results if used | ||
operates on bytes as characters, and generally does not produce correct results if used | ||
together with any encoding other than single-byte encodings such as ASCII. | ||
C++11 added two new character types, `char16_t` and `char32_t`, | ||
and two new corresponding `std::u16string` and `std::u32string` collection classes. | ||
It also provides three new kinds of string literals, | ||
`u8"..."`, `u"..."`, and `U"..."`, | ||
which produce binary values encoded with UTF-8, UTF-16, and UTF-32 respectively. | ||
which produce `char[N]`, `char16_t[N]`, and `char32_t[N]` values encoded with UTF-8, UTF-16, and UTF-32 respectively. | ||
|
||
## Python: | ||
|
||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this API behave when given a string that is not a valid UTF-16 string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added detail on invalid data in several places now.