export and document transcode #17323

stevengj · 2016-07-07T19:35:54Z

This exports and documents the transcode function from #16974. It also adds convenience methods transcode(String, x) == String(transcode(UInt8, x)) (an extremely common pattern in our own code) and transcode(T, s::String) == transcode(T, s.data) (since we'd prefer to avoid requiring that users muck with s.data).

I feel like this is a necessary prerequisite to #16590, in order to remove the need for the UTF16String and UTF32String types when calling C APIs.

For the same reason, I'd also prefer to add support for transcoding to/from UTF-32, for calling C APIs on systems where Cwchar_t == Int32. I'm a little confused about how that is supposed to work in #16590. It seems like a pretty basic implementation like transcode(::Type{Uint32}, s::String) = map(UInt32, s) should be fine.

stevengj · 2016-07-07T19:36:15Z

cc @tkelman, @StefanKarpinski

tkelman · 2016-07-07T19:38:05Z

base/c.jl

@@ -121,8 +121,21 @@ end

 # transcoding between data in UTF-8 and UTF-16 for Windows APIs

+"""
+ transcode(T, src)


needs rst signature so genstdlib can fill it in

Right, I was meaning to do that and forgot.

stevengj · 2016-07-07T20:27:46Z

Updated to include very basic UTF-32 conversions, especially for systems where Cwchar_t is Int32 or UInt32.

tkelman · 2016-07-07T20:29:19Z

base/c.jl

+transcode{T<:Union{Int32,UInt32}}(::Type{T}, src::String) = T[T(c) for c in src]
+transcode{T<:Union{Int32,UInt32}}(::Type{T}, src) = transcode(T, transcode(String, src))
+transcode{S<:Union{Int32,UInt32}}(T, src::Vector{S}) = transcode(T, transcode(String, src))
+transcode{S<:Union{Int32,UInt32}}(::Type{String}, src::Vector{S}) = string(map(Char, src)...)


does this splat every character?

Yes, is that a big problem? string has an optimized method for string(Char...) that I wanted to exploit.

Or should I create an IOBuffer and just print the characters one-by-one to it?

The splatting would still have to go through dispatch for all arguments to determine it can use that method, I think

Splatting performance confuses me. I feel like there should be an optimized path for foo(array...) when array::Vector{T} and foo has a foo(T...) method.

Anyway, I'll change it to use an IOBuffer.

tkelman · 2016-07-07T20:48:12Z

x-ref #17218 which we probably should have merged

nalimilan · 2016-07-08T16:37:40Z

In StringEncodings.jl, I have followed the naming from e.g. Python and used decode(x) == String(transcode(UInt8, x)) instead of adding a new method for transcode. There's also the reverse operation, called encode, which transforms a string into an array of bytes.

I think it would make sense to use this convention, and keep transcode for operations between arrays of UInt[8|16|32].

stevengj · 2016-07-08T17:20:11Z

@nalimilan, why? I'd strongly prefer having fewer symbols if the meaning is clear, as I think it is here.

stevengj · 2016-07-08T19:10:19Z

Okay, method ambiguities should be fixed.

tkelman · 2016-07-09T03:01:35Z

needs rebase (probably trivial end-of-file), and failing commits should be squashed out

nalimilan · 2016-07-09T08:52:04Z

@nalimilan, why? I'd strongly prefer having fewer symbols if the meaning is clear, as I think it is here.

Dunno, it feels more natural to me to use different functions for string -> bytes and bytes -> string. That approach is also used in the corresponding Rust package. OTOH, I couldn't find a language/package where transcode would do both operations depending on the input type. transcode usually takes an array of bytes and returns another one.

Note that the difference between the methods in this PR and those in StringEncodings.jl is that the latter take an encoding argument, since the integer size cannot be used to carry that information in all cases. This means the encoding argument would be in one case (bytes -> string) the source encoding, and in the other one the target encoding (string -> bytes).

nalimilan · 2016-07-09T08:54:46Z

base/c.jl

+"""
+ transcode(T, src)
+
+Convert string data between Unicode encodings. `src` is either a


I don't think we usually use double space between sentences? (Here and below.)

Doesn't the Markdown formatter ignore extra spaces when it formats the result?

Yes, it just ignores extra whitespace inside paragraphs. Shouldn't be causing any problems, though we should probably use consistent formatting throughout.

nalimilan · 2016-07-09T08:57:52Z

IOW: while I agree this feels natural here, where the scope is quite limited, I'm afraid it wouldn't be very natural when extending it to any encoding. Whether Base and StringEncodings should be completely consistent is up for debate, though.

stevengj · 2016-07-09T14:19:37Z

Part of the difference with Python and Rust, I think, is that those are OOP languages and the methods in that case are "owned" by an object. In that case it is more natural to have string.encode(...) and string.decode(...) or similar. Whereas the transcode function is not owned by any particular object, so it makes sense to me to have it more symmetrical (the same function regardless of which way you are transcoding).

StefanKarpinski · 2016-07-11T11:27:54Z

I merged #17218 so transcode is now documented and can be used by packages for calling Windows C APIs, alleviating some of the pressure on this issue.

This change significantly expands on the notion of transcode, which I specifically left very limited to translating between arrays of code units (and only the ones that were strictly necessary for Windows APIs). I'm not saying that we shouldn't make transcode more official and expand on it, as the PR does, but it's a move that bears discussion.

For example, the current transcode methods never raise errors. Instead they make a best-effort attempt to convert between different encodings, handling CESU-8, overlong encodings, and unpaired surrogates, as well as plain old invalid UTF-8 data (by interpreting it as Latin-1). That's a somewhat sane behavior for something as low level as passing String data to C functions: this will never raise any sort of encoding error on UNIX since no transcoding is involved, so it's arguably ok not raise an error on Windows either. The vision for the String type is that it be able to hold any data but treat it in a UTF-8-like manner, c.f. a proper UTF-8 string type which enforces valid encoding. That allows us to remove all Unicode-related machinery from Base, including error types.

If we add transcode(String, src) and transcode(T, ::String), however, then the logical conclusion is that we support transcode(<:AbstractString, src) and transcode(T, ::AbstractString) in general. This raises the question: which errors does this function throw and under what circumstances? If we have an answer to this, then I think we can proceed. Perhaps the answer is that transcode(String, src) never raises an error since String won't impose any restrictions on its data, so transcode simply does it's best to convert src to something UTF-8-like and sticks it in a String. It's less clear what to do about transcode methods that operate only on vectors of code units. Why don't those raise any errors?

More broadly, this raises and issue about errors and generic functions. Currently, we allow any method of a function raise any error. This makes it very hard to write generic code in a way that doesn't unexpected errors. At some point we'll want the errors that a function can throw to be part of the signature of the function in some sense so that we can know when generic code handles all the errors it can legitimately encounter. If we don't define encoding error types in Base, then it is in principle impossible do this for the transcode function without simply declaring that it may throw any kind of error at all. Perhaps we should have an EncodingError abstract type exception type that all string encoding errors subtype, so that we could say in base that the signature of transcode is something like this:

transcode{S<:AbstractString}(::Type{S}, src::AbstractString) throws EncodingError

Of course, that's not the whole signature since it only captures transcoding between strings.

stevengj · 2016-07-11T12:42:14Z

@StefanKarpinski, the basic motivation for transcode(String, a) = String(transcode(UInt8, array)) is that String(transcode(UInt8, array)) is pretty much the only way that this is called when converting to UInt8, and it seems silly to force every caller to type String(...).

I really don't understand your point about exceptions. What about that is specific to transcode? You might as well say that it is impossible to use sum in generic code because callers can override it to do anything.

We should treat this like any other method: we document the behavior in Base (we don't throw exceptions for converting between Unicode encodings), and we support String and UIntXX arguments. If a package does something different for some other string type or encoding, it is up to them to document it.

stevengj · 2016-07-11T12:50:41Z

@StefanKarpinski, I think you mean that you merged #17218. That adds a docstring for transcode, but does not document it for users. The docstring is purely internal (not in the manual), and the function is not exported.

StefanKarpinski · 2016-07-11T13:11:14Z

Ok, ignore the issue about generic functions and exceptions if that doesn't bother you. I completely understand the motivation for adding convenience methods to transcode. However, this PR effectively makes transcode an official way to transcode between different string types and encodings, in which case we need to think harder about the API. Can you explain what you think the general policy of transcode raising encoding exceptions should be?

stevengj · 2016-07-11T13:14:23Z

I think it should only raise an exception if it is impossible to losslessly represent the source data in the destination encoding.

StefanKarpinski · 2016-07-11T13:36:06Z

What does it mean to "losslessly represent" invalid string data? Does that mean that the round trip transcoding is guaranteed to be the identity? There are byte sequences which are invalid UTF-8 but which the current algorithm can round trip safely through UTF-16 anyway. There are other cases where UTF-8-like data can include non-canonical encodings that are useful and common (CESU-8, Modified UTF-8 and overlong encodings in general), but which do not round trip through UTF-16 – although they still encode the same sequence of code points. Is that ok? In still other cases, it's often reasonable to interpret invalid UTF-8 as Latin-1 and encode it as UTF-16 accordingly, but this doesn't round-trip, of course. Any sequence of UInt16 code points can be transcoded to UTF-8 and round-tripped back to UTF-16 identically, even invalid ones.

stevengj · 2016-07-11T14:09:34Z

I would think that as long as it encodes to the same sequence of code points and/or it round-trips, it should be accepted.

stevengj · 2016-07-11T15:13:21Z

How about something like:

The transcode function does not throw an error if the input data can be reasonably represented in the target encoding; it always succeeds for conversions between UTF-XX encodings, even for invalid Unicode data.

And then we leave it up to packages to decide on a reasonable meaning of "reasonably". (It's not like any other functions in Base impose formal constraints on user methods.)

stevengj · 2016-07-11T16:20:58Z

Added a clarification to the docs that this function is supposed to handle even invalid data.

tkelman · 2016-07-11T16:22:17Z

base/c.jl

-to return a `Vector{UIntXX}` of UTF-`XX` data. Only conversion
-to or from UTF-8 is currently supported.
+to return a `Vector{UIntXX}` of UTF-`XX` data. (The alias `Cwchar_t`
+also be used as the integer type for converting `wchar_t*` strings


can be used

StefanKarpinski · 2016-07-11T21:51:20Z

So here's another issue: as we've discussed before, we probably want encodings to be first class and write something like transcode(UTF8, src) which is somewhat at odds with using UInt8 as a way of expressing the UTF8 encoding is sort of sketchy. It's fine if this is not used for anything but interacting with Windows APIs, but it's not great as a more general API.

stevengj · 2016-07-11T23:56:34Z

If more encodings are needed, they can always be added later, no? (Probably this is best left to a package...I can't see supporting Latin1 or EBCDIC or Windows-1252 in Base.)

As time goes on, it seems like UTF-xx will be the only encodings that matter, so it is not crazy to use UIntxx as a shorthand for them.

StefanKarpinski · 2016-07-12T04:10:38Z

Here's a possible policy on raising transcoding errors, and it's simple: never return an invalid object of the requested type. Since String will be able to hold any kind of data, transcode(String, src) never raises an error – it just does it's best to transcode src as String. For something like transcode(UTF16String, src) on the other hand, if you would have to produce an invalid UTF16String object, you should raise an error. This same policy explains why transcode methods that return vectors of code units (e.g. transcode(UInt16, src)) doesn't raise an error even though transcode(UTF16String, src) might: because even if the result of transcoding is invalid UTF-16 data, it's still a perfectly valid Vector{UInt16}.

This raises another question: should we change transcode(UInt16, src) to transcode(Vector{UInt16}, src)? It's a bit more verbose, but it would be more consistent.

kmsquire · 2016-07-12T05:00:39Z

I like transcode(Vector{UInt16}, src). It has a nice parallel with convert ... Although, it almost seems that convert could be used instead?

tkelman · 2016-07-12T05:31:52Z

We went through the argument about convert before, not everyone is a fan of associating strings and vectors of unsigned integers so closely via the convert name.

edit: I also like referring to the container type rather than the element type, echoes of #11557

StefanKarpinski · 2016-07-12T16:07:40Z

Since convert(Vector{UInt16}, UInt8[1,2,3]) already means something, we clearly can't use that. Another argument for allowing transcode(UInt16, src) is that we're likely to have transcode(UTF16, src) where UTF16 is an encoding object, not a string type. Of course, we'd probably also allow transcode(EncodedString{UTF16}, src) or however we end up spelling encoded string types. But in that case, we probably want to allow transcode(Vector{UInt16}, src) as well as transcode(UInt16, src). For now, I think just having the version we actually need to use is probably the best move. We can round out the rest of the API in base as the API for encoded strings gets rounded out in packages.

StefanKarpinski · 2016-07-12T16:43:35Z

base/c.jl

+transcode{T<:Union{UInt8,UInt16,UInt32,Int32}}(::Type{T}, src::Vector{T}) = src
+transcode{T<:Union{Int32,UInt32}}(::Type{T}, src::String) = T[T(c) for c in src]
+transcode{T<:Union{Int32,UInt32}}(::Type{T}, src::Vector{UInt8}) = transcode(T, String(src))
+function transcode{S<:Union{Int32,UInt32}}(::Type{UInt8}, src::Vector{S})


What's the rationale for supporting signed 32-bit code units (Int32) but only unsigned code units for others?

Cwchar_t can be Int32, so we need to support the signed type for UTF-32.

I suppose it wouldn't hurt to support the signed types for Int8 and Int32 as well.

Ah, I was arguing for not supporting Int32 either but the point about Cwchar_t makes sense. Let's leave it as is. I'd like to keep this minimal until we find we really need to do otherwise.

StefanKarpinski · 2016-07-12T17:23:05Z

Ok, I'm convinced. @stevengj – will you rebase and merge? Or I can do it since I merged the conflicting changes. Just let me know.

stevengj · 2016-07-12T17:24:31Z

Rebased; click "merge" when green.

stevengj · 2016-07-12T20:17:56Z

!#$&-ing NEWS.md conflicts.

…ing, x) and transcode(T, ::String) convenience methods

StefanKarpinski · 2016-07-12T20:19:57Z

Can we not run CI again for a NEWS conflict fix?

stevengj · 2016-07-12T20:21:16Z

Okay.

* export and document transcode from JuliaLang#16974, add transcode(String, x) and transcode(T, ::String) convenience methods * docs * support UTF-32 in transcode * don't use splatting for UTF-32 to String conversion * typo * eliminate method ambiguities * re-run genstdlib * doc clarification * typo

stevengj added the domain:unicode Related to unicode characters and encodings label Jul 7, 2016

tkelman reviewed Jul 7, 2016
View reviewed changes

kshyatt added the domain:docs This change adds or pertains to documentation label Jul 7, 2016

nalimilan reviewed Jul 9, 2016
View reviewed changes

stevengj force-pushed the export-transcode branch from 6b31fe3 to 9016ffc Compare July 9, 2016 14:20

StefanKarpinski added this to the 0.5.0 milestone Jul 11, 2016

stevengj force-pushed the export-transcode branch from 9016ffc to d57d866 Compare July 11, 2016 12:57

StefanKarpinski added the needs decision A decision on this change is needed label Jul 11, 2016

tkelman reviewed Jul 11, 2016
View reviewed changes

StefanKarpinski reviewed Jul 12, 2016
View reviewed changes

stevengj force-pushed the export-transcode branch from 76f789f to e48811c Compare July 12, 2016 17:24

stevengj and others added 9 commits July 12, 2016 16:18

export and document transcode from JuliaLang#16974, add transcode(Str…

8afd70d

…ing, x) and transcode(T, ::String) convenience methods

docs

7576748

support UTF-32 in transcode

aa371bc

don't use splatting for UTF-32 to String conversion

66f9155

typo

35e7b69

eliminate method ambiguities

fc74630

re-run genstdlib

bef4d19

doc clarification

c49099d

typo

d10a1e8

stevengj force-pushed the export-transcode branch from e48811c to d10a1e8 Compare July 12, 2016 20:19

stevengj merged commit 5d045df into JuliaLang:master Jul 12, 2016

stevengj deleted the export-transcode branch July 12, 2016 20:21

stevengj mentioned this pull request Jul 15, 2016

Fix for cwstring, cconvert, transcode when !is_windows() #17410

Closed

export and document transcode #17323

export and document transcode #17323

Conversation

stevengj commented Jul 7, 2016

stevengj commented Jul 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj commented Jul 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman commented Jul 7, 2016

nalimilan commented Jul 8, 2016

stevengj commented Jul 8, 2016 • edited Loading

stevengj commented Jul 8, 2016

tkelman commented Jul 9, 2016

nalimilan commented Jul 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Jul 9, 2016

stevengj commented Jul 9, 2016

StefanKarpinski commented Jul 11, 2016 • edited Loading

stevengj commented Jul 11, 2016 • edited Loading

stevengj commented Jul 11, 2016 • edited Loading

StefanKarpinski commented Jul 11, 2016

stevengj commented Jul 11, 2016 • edited Loading

StefanKarpinski commented Jul 11, 2016 • edited Loading

stevengj commented Jul 11, 2016

stevengj commented Jul 11, 2016 • edited Loading

stevengj commented Jul 11, 2016

Choose a reason for hiding this comment

StefanKarpinski commented Jul 11, 2016

stevengj commented Jul 11, 2016

StefanKarpinski commented Jul 12, 2016

kmsquire commented Jul 12, 2016

tkelman commented Jul 12, 2016 • edited Loading

StefanKarpinski commented Jul 12, 2016

StefanKarpinski Jul 12, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski commented Jul 12, 2016

stevengj commented Jul 12, 2016

stevengj commented Jul 12, 2016

StefanKarpinski commented Jul 12, 2016

stevengj commented Jul 12, 2016

stevengj commented Jul 8, 2016 •

edited

Loading

StefanKarpinski commented Jul 11, 2016 •

edited

Loading

stevengj commented Jul 11, 2016 •

edited

Loading

stevengj commented Jul 11, 2016 •

edited

Loading

stevengj commented Jul 11, 2016 •

edited

Loading

StefanKarpinski commented Jul 11, 2016 •

edited

Loading

stevengj commented Jul 11, 2016 •

edited

Loading

tkelman commented Jul 12, 2016 •

edited

Loading

StefanKarpinski Jul 12, 2016 •

edited

Loading