Correct code point format in Base/Char/show function (#33291)

* Correct code point format in Base/Char/show function Two minor changes (both on line 307) to conform to the Unicode Standard. Unicode code points currently display with: 1. Lowercase letters, a - f, when present 2. A leading 0 for 5-digit code point values (i.e. 10000 - 9ffff) However, the Unicode Standard specifies that when using the "U+" notation, you should use: 1. Uppercase letters 2. Leading zeros only when the code point would have fewer than four digits (i.e. 0000 - 0FFF) For reference, the Unicode Standard (two versions to show consistency over time) * [(Version 12.1, 2019) Appendix A: Notational Conventions ⇒ Code Points](http:https://www.unicode.org/versions/Unicode12.0.0/appA.pdf) * [(Version 4.0.0, 2003) Preface: Notational Conventions ⇒ Code Points](http:https://www.unicode.org/versions/Unicode4.0.0/Preface.pdf) states: > In running text, an individual Unicode code point is expressed as U+n, where n is four to six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15, respectively). Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345. * Add tests for U+ syntax formatting * Update code point format to match change in show() function * Update code point format to match change in show() function * Update code point format to match change in show() function * Update code point format to match change in show() function
JuliaLang · Sep 18, 2019 · 493c797 · 493c797
1 parent da86a22
commit 493c797
Show file tree

Hide file tree

Showing 6 changed files with 24 additions and 15 deletions.
diff --git a/base/char.jl b/base/char.jl
@@ -304,7 +304,7 @@ function show(io::IO, ::MIME"text/plain", c::T) where {T<:AbstractChar}
  else
  u = codepoint(c)
  end
- h = string(u, base = 16, pad = u ≤ 0xffff ? 4 : 6)
+ h = uppercase(string(u, base = 16, pad = 4))
  print(io, (isascii(c) ? "ASCII/" : ""), "Unicode U+", h)
  else
  print(io, ": Malformed UTF-8")

diff --git a/base/io.jl b/base/io.jl
@@ -135,7 +135,7 @@ Read the entirety of `io`, as a `String`.
 julia> io = IOBuffer("JuliaLang is a GitHub organization");
 
 julia> read(io, Char)
-'J': ASCII/Unicode U+004a (category Lu: Letter, uppercase)
+'J': ASCII/Unicode U+004A (category Lu: Letter, uppercase)
 
 julia> io = IOBuffer("JuliaLang is a GitHub organization");
 

diff --git a/base/iostream.jl b/base/iostream.jl
@@ -100,7 +100,7 @@ julia> io = IOBuffer("JuliaLang is a GitHub organization.");
 julia> seek(io, 5);
 
 julia> read(io, Char)
-'L': ASCII/Unicode U+004c (category Lu: Letter, uppercase)
+'L': ASCII/Unicode U+004C (category Lu: Letter, uppercase)
 ```
 """
 function seek(s::IOStream, n::Integer)
@@ -122,12 +122,12 @@ julia> io = IOBuffer("JuliaLang is a GitHub organization.");
 julia> seek(io, 5);
 
 julia> read(io, Char)
-'L': ASCII/Unicode U+004c (category Lu: Letter, uppercase)
+'L': ASCII/Unicode U+004C (category Lu: Letter, uppercase)
 
 julia> seekstart(io);
 
 julia> read(io, Char)
-'J': ASCII/Unicode U+004a (category Lu: Letter, uppercase)
+'J': ASCII/Unicode U+004A (category Lu: Letter, uppercase)
 ```
 """
 seekstart(s::IO) = seek(s,0)

diff --git a/base/strings/basic.jl b/base/strings/basic.jl
@@ -107,7 +107,7 @@ julia> isvalid(str, 1)
 true
 
 julia> str[1]
-'α': Unicode U+03b1 (category Ll: Letter, lowercase)
+'α': Unicode U+03B1 (category Ll: Letter, lowercase)
 
 julia> isvalid(str, 2)
 false

diff --git a/doc/src/manual/strings.md b/doc/src/manual/strings.md
@@ -88,8 +88,8 @@ julia> isvalid(Char, 0x110000)
 false
 ```
 
-As of this writing, the valid Unicode code points are `U+00` through `U+d7ff` and `U+e000` through
-`U+10ffff`. These have not all been assigned intelligible meanings yet, nor are they necessarily
+As of this writing, the valid Unicode code points are `U+0000` through `U+D7FF` and `U+E000` through
+`U+10FFFF`. These have not all been assigned intelligible meanings yet, nor are they necessarily
 interpretable by applications, but all of these values are considered to be valid Unicode characters.
 
 You can input any Unicode character in single quotes using `\u` followed by up to four hexadecimal
@@ -107,7 +107,7 @@ julia> '\u2200'
 '∀': Unicode U+2200 (category Sm: Symbol, math)
 
 julia> '\U10ffff'
-'\U10ffff': Unicode U+10ffff (category Cn: Other, not assigned)
+'\U10ffff': Unicode U+10FFFF (category Cn: Other, not assigned)
 ```
 
 Julia uses your system's locale and language settings to determine which characters can be printed
@@ -173,10 +173,10 @@ julia> str[1]
 'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)
 
 julia> str[6]
-',': ASCII/Unicode U+002c (category Po: Punctuation, other)
+',': ASCII/Unicode U+002C (category Po: Punctuation, other)
 
 julia> str[end]
-'\n': ASCII/Unicode U+000a (category Cc: Other, control)
+'\n': ASCII/Unicode U+000A (category Cc: Other, control)
 ```
 
 Many Julia objects, including strings, can be indexed with integers. The index of the first
@@ -192,7 +192,7 @@ a normal value:
 
 ```jldoctest helloworldstring
 julia> str[end-1]
-'.': ASCII/Unicode U+002e (category Po: Punctuation, other)
+'.': ASCII/Unicode U+002E (category Po: Punctuation, other)
 
 julia> str[end÷2]
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
@@ -223,7 +223,7 @@ Notice that the expressions `str[k]` and `str[k:k]` do not give the same result:
 
 ```jldoctest helloworldstring
 julia> str[6]
-',': ASCII/Unicode U+002c (category Po: Punctuation, other)
+',': ASCII/Unicode U+002C (category Po: Punctuation, other)
 
 julia> str[6:6]
 ","
@@ -416,7 +416,7 @@ julia> foreach(display, s)
 '\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)
 '\xe2\x88': Malformed UTF-8 (category Ma: Malformed, bad data)
 '\xe2': Malformed UTF-8 (category Ma: Malformed, bad data)
-'|': ASCII/Unicode U+007c (category Sm: Symbol, math)
+'|': ASCII/Unicode U+007C (category Sm: Symbol, math)
 
 julia> isvalid.(collect(s))
 4-element BitArray{1}:
@@ -429,7 +429,7 @@ julia> s2 = "\xf7\xbf\xbf\xbf"
 "\U1fffff"
 
 julia> foreach(display, s2)
-'\U1fffff': Unicode U+1fffff (category In: Invalid, too high)
+'\U1fffff': Unicode U+1FFFFF (category In: Invalid, too high)
 ```
 
 We can see that the first two code units in the string `s` form an overlong encoding of

diff --git a/test/char.jl b/test/char.jl
@@ -290,3 +290,12 @@ end
 @testset "broadcasting of Char" begin
  @test identity.('a') == 'a'
 end
+
+@testset "code point format of U+ syntax (PR 33291)" begin
+ @test repr("text/plain", '\n') == "'\\n': ASCII/Unicode U+000A (category Cc: Other, control)"
+ @test repr("text/plain", '/') == "'/': ASCII/Unicode U+002F (category Po: Punctuation, other)"
+ @test repr("text/plain", '\u10e') == "'Ď': Unicode U+010E (category Lu: Letter, uppercase)"
+ @test repr("text/plain", '\u3a2c') == "'㨬': Unicode U+3A2C (category Lo: Letter, other)"
+ @test repr("text/plain", '\U001f428') == "'🐨': Unicode U+1F428 (category So: Symbol, other)"
+ @test repr("text/plain", '\U010f321') == "'\\U10f321': Unicode U+10F321 (category Co: Other, private use)"
+end