Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export and document transcode #17323

Merged
merged 9 commits into from
Jul 12, 2016
Merged

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Jul 7, 2016

This exports and documents the transcode function from #16974. It also adds convenience methods transcode(String, x) == String(transcode(UInt8, x)) (an extremely common pattern in our own code) and transcode(T, s::String) == transcode(T, s.data) (since we'd prefer to avoid requiring that users muck with s.data).

I feel like this is a necessary prerequisite to #16590, in order to remove the need for the UTF16String and UTF32String types when calling C APIs.

For the same reason, I'd also prefer to add support for transcoding to/from UTF-32, for calling C APIs on systems where Cwchar_t == Int32. I'm a little confused about how that is supposed to work in #16590. It seems like a pretty basic implementation like transcode(::Type{Uint32}, s::String) = map(UInt32, s) should be fine.

@stevengj
Copy link
Member Author

stevengj commented Jul 7, 2016

cc @tkelman, @StefanKarpinski

@stevengj stevengj added the domain:unicode Related to unicode characters and encodings label Jul 7, 2016
@@ -121,8 +121,21 @@ end

# transcoding between data in UTF-8 and UTF-16 for Windows APIs

"""
transcode(T, src)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs rst signature so genstdlib can fill it in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I was meaning to do that and forgot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@stevengj
Copy link
Member Author

stevengj commented Jul 7, 2016

Updated to include very basic UTF-32 conversions, especially for systems where Cwchar_t is Int32 or UInt32.

transcode{T<:Union{Int32,UInt32}}(::Type{T}, src::String) = T[T(c) for c in src]
transcode{T<:Union{Int32,UInt32}}(::Type{T}, src) = transcode(T, transcode(String, src))
transcode{S<:Union{Int32,UInt32}}(T, src::Vector{S}) = transcode(T, transcode(String, src))
transcode{S<:Union{Int32,UInt32}}(::Type{String}, src::Vector{S}) = string(map(Char, src)...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this splat every character?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, is that a big problem? string has an optimized method for string(Char...) that I wanted to exploit.

Or should I create an IOBuffer and just print the characters one-by-one to it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The splatting would still have to go through dispatch for all arguments to determine it can use that method, I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splatting performance confuses me. I feel like there should be an optimized path for foo(array...) when array::Vector{T} and foo has a foo(T...) method.

Anyway, I'll change it to use an IOBuffer.

@tkelman
Copy link
Contributor

tkelman commented Jul 7, 2016

x-ref #17218 which we probably should have merged

@kshyatt kshyatt added the domain:docs This change adds or pertains to documentation label Jul 7, 2016
@nalimilan
Copy link
Member

In StringEncodings.jl, I have followed the naming from e.g. Python and used decode(x) == String(transcode(UInt8, x)) instead of adding a new method for transcode. There's also the reverse operation, called encode, which transforms a string into an array of bytes.

I think it would make sense to use this convention, and keep transcode for operations between arrays of UInt[8|16|32].

@stevengj
Copy link
Member Author

stevengj commented Jul 8, 2016

@nalimilan, why? I'd strongly prefer having fewer symbols if the meaning is clear, as I think it is here.

@stevengj
Copy link
Member Author

stevengj commented Jul 8, 2016

Okay, method ambiguities should be fixed.

@tkelman
Copy link
Contributor

tkelman commented Jul 9, 2016

needs rebase (probably trivial end-of-file), and failing commits should be squashed out

@nalimilan
Copy link
Member

@nalimilan, why? I'd strongly prefer having fewer symbols if the meaning is clear, as I think it is here.

Dunno, it feels more natural to me to use different functions for string -> bytes and bytes -> string. That approach is also used in the corresponding Rust package. OTOH, I couldn't find a language/package where transcode would do both operations depending on the input type. transcode usually takes an array of bytes and returns another one.

Note that the difference between the methods in this PR and those in StringEncodings.jl is that the latter take an encoding argument, since the integer size cannot be used to carry that information in all cases. This means the encoding argument would be in one case (bytes -> string) the source encoding, and in the other one the target encoding (string -> bytes).

"""
transcode(T, src)

Convert string data between Unicode encodings. `src` is either a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we usually use double space between sentences? (Here and below.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the Markdown formatter ignore extra spaces when it formats the result?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it just ignores extra whitespace inside paragraphs. Shouldn't be causing any problems, though we should probably use consistent formatting throughout.

@nalimilan
Copy link
Member

IOW: while I agree this feels natural here, where the scope is quite limited, I'm afraid it wouldn't be very natural when extending it to any encoding. Whether Base and StringEncodings should be completely consistent is up for debate, though.

@stevengj
Copy link
Member Author

stevengj commented Jul 9, 2016

Part of the difference with Python and Rust, I think, is that those are OOP languages and the methods in that case are "owned" by an object. In that case it is more natural to have string.encode(...) and string.decode(...) or similar. Whereas the transcode function is not owned by any particular object, so it makes sense to me to have it more symmetrical (the same function regardless of which way you are transcoding).

@StefanKarpinski StefanKarpinski added this to the 0.5.0 milestone Jul 11, 2016
@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Jul 11, 2016

I merged #17218 so transcode is now documented and can be used by packages for calling Windows C APIs, alleviating some of the pressure on this issue.

This change significantly expands on the notion of transcode, which I specifically left very limited to translating between arrays of code units (and only the ones that were strictly necessary for Windows APIs). I'm not saying that we shouldn't make transcode more official and expand on it, as the PR does, but it's a move that bears discussion.

For example, the current transcode methods never raise errors. Instead they make a best-effort attempt to convert between different encodings, handling CESU-8, overlong encodings, and unpaired surrogates, as well as plain old invalid UTF-8 data (by interpreting it as Latin-1). That's a somewhat sane behavior for something as low level as passing String data to C functions: this will never raise any sort of encoding error on UNIX since no transcoding is involved, so it's arguably ok not raise an error on Windows either. The vision for the String type is that it be able to hold any data but treat it in a UTF-8-like manner, c.f. a proper UTF-8 string type which enforces valid encoding. That allows us to remove all Unicode-related machinery from Base, including error types.

If we add transcode(String, src) and transcode(T, ::String), however, then the logical conclusion is that we support transcode(<:AbstractString, src) and transcode(T, ::AbstractString) in general. This raises the question: which errors does this function throw and under what circumstances? If we have an answer to this, then I think we can proceed. Perhaps the answer is that transcode(String, src) never raises an error since String won't impose any restrictions on its data, so transcode simply does it's best to convert src to something UTF-8-like and sticks it in a String. It's less clear what to do about transcode methods that operate only on vectors of code units. Why don't those raise any errors?

More broadly, this raises and issue about errors and generic functions. Currently, we allow any method of a function raise any error. This makes it very hard to write generic code in a way that doesn't unexpected errors. At some point we'll want the errors that a function can throw to be part of the signature of the function in some sense so that we can know when generic code handles all the errors it can legitimately encounter. If we don't define encoding error types in Base, then it is in principle impossible do this for the transcode function without simply declaring that it may throw any kind of error at all. Perhaps we should have an EncodingError abstract type exception type that all string encoding errors subtype, so that we could say in base that the signature of transcode is something like this:

transcode{S<:AbstractString}(::Type{S}, src::AbstractString) throws EncodingError

Of course, that's not the whole signature since it only captures transcoding between strings.

@stevengj
Copy link
Member Author

stevengj commented Jul 11, 2016

@StefanKarpinski, the basic motivation for transcode(String, a) = String(transcode(UInt8, array)) is that String(transcode(UInt8, array)) is pretty much the only way that this is called when converting to UInt8, and it seems silly to force every caller to type String(...).

I really don't understand your point about exceptions. What about that is specific to transcode? You might as well say that it is impossible to use sum in generic code because callers can override it to do anything.

We should treat this like any other method: we document the behavior in Base (we don't throw exceptions for converting between Unicode encodings), and we support String and UIntXX arguments. If a package does something different for some other string type or encoding, it is up to them to document it.

@stevengj
Copy link
Member Author

stevengj commented Jul 11, 2016

@StefanKarpinski, I think you mean that you merged #17218. That adds a docstring for transcode, but does not document it for users. The docstring is purely internal (not in the manual), and the function is not exported.

@StefanKarpinski StefanKarpinski added the needs decision A decision on this change is needed label Jul 11, 2016
@StefanKarpinski
Copy link
Sponsor Member

Ok, ignore the issue about generic functions and exceptions if that doesn't bother you. I completely understand the motivation for adding convenience methods to transcode. However, this PR effectively makes transcode an official way to transcode between different string types and encodings, in which case we need to think harder about the API. Can you explain what you think the general policy of transcode raising encoding exceptions should be?

@stevengj
Copy link
Member Author

stevengj commented Jul 11, 2016

I think it should only raise an exception if it is impossible to losslessly represent the source data in the destination encoding.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Jul 11, 2016

What does it mean to "losslessly represent" invalid string data? Does that mean that the round trip transcoding is guaranteed to be the identity? There are byte sequences which are invalid UTF-8 but which the current algorithm can round trip safely through UTF-16 anyway. There are other cases where UTF-8-like data can include non-canonical encodings that are useful and common (CESU-8, Modified UTF-8 and overlong encodings in general), but which do not round trip through UTF-16 – although they still encode the same sequence of code points. Is that ok? In still other cases, it's often reasonable to interpret invalid UTF-8 as Latin-1 and encode it as UTF-16 accordingly, but this doesn't round-trip, of course. Any sequence of UInt16 code points can be transcoded to UTF-8 and round-tripped back to UTF-16 identically, even invalid ones.

@stevengj
Copy link
Member Author

I would think that as long as it encodes to the same sequence of code points and/or it round-trips, it should be accepted.

@stevengj
Copy link
Member Author

stevengj commented Jul 11, 2016

How about something like:

The transcode function does not throw an error if the input data can be reasonably represented in the target encoding; it always succeeds for conversions between UTF-XX encodings, even for invalid Unicode data.

And then we leave it up to packages to decide on a reasonable meaning of "reasonably". (It's not like any other functions in Base impose formal constraints on user methods.)

@stevengj
Copy link
Member Author

Added a clarification to the docs that this function is supposed to handle even invalid data.

to return a `Vector{UIntXX}` of UTF-`XX` data. Only conversion
to or from UTF-8 is currently supported.
to return a `Vector{UIntXX}` of UTF-`XX` data. (The alias `Cwchar_t`
also be used as the integer type for converting `wchar_t*` strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be used

@StefanKarpinski
Copy link
Sponsor Member

So here's another issue: as we've discussed before, we probably want encodings to be first class and write something like transcode(UTF8, src) which is somewhat at odds with using UInt8 as a way of expressing the UTF8 encoding is sort of sketchy. It's fine if this is not used for anything but interacting with Windows APIs, but it's not great as a more general API.

@stevengj
Copy link
Member Author

If more encodings are needed, they can always be added later, no? (Probably this is best left to a package...I can't see supporting Latin1 or EBCDIC or Windows-1252 in Base.)

As time goes on, it seems like UTF-xx will be the only encodings that matter, so it is not crazy to use UIntxx as a shorthand for them.

@StefanKarpinski
Copy link
Sponsor Member

Here's a possible policy on raising transcoding errors, and it's simple: never return an invalid object of the requested type. Since String will be able to hold any kind of data, transcode(String, src) never raises an error – it just does it's best to transcode src as String. For something like transcode(UTF16String, src) on the other hand, if you would have to produce an invalid UTF16String object, you should raise an error. This same policy explains why transcode methods that return vectors of code units (e.g. transcode(UInt16, src)) doesn't raise an error even though transcode(UTF16String, src) might: because even if the result of transcoding is invalid UTF-16 data, it's still a perfectly valid Vector{UInt16}.

This raises another question: should we change transcode(UInt16, src) to transcode(Vector{UInt16}, src)? It's a bit more verbose, but it would be more consistent.

@kmsquire
Copy link
Member

I like transcode(Vector{UInt16}, src). It has a nice parallel with convert ... Although, it almost seems that convert could be used instead?

@tkelman
Copy link
Contributor

tkelman commented Jul 12, 2016

We went through the argument about convert before, not everyone is a fan of associating strings and vectors of unsigned integers so closely via the convert name.

edit: I also like referring to the container type rather than the element type, echoes of #11557

@StefanKarpinski
Copy link
Sponsor Member

Since convert(Vector{UInt16}, UInt8[1,2,3]) already means something, we clearly can't use that. Another argument for allowing transcode(UInt16, src) is that we're likely to have transcode(UTF16, src) where UTF16 is an encoding object, not a string type. Of course, we'd probably also allow transcode(EncodedString{UTF16}, src) or however we end up spelling encoded string types. But in that case, we probably want to allow transcode(Vector{UInt16}, src) as well as transcode(UInt16, src). For now, I think just having the version we actually need to use is probably the best move. We can round out the rest of the API in base as the API for encoded strings gets rounded out in packages.

transcode{T<:Union{UInt8,UInt16,UInt32,Int32}}(::Type{T}, src::Vector{T}) = src
transcode{T<:Union{Int32,UInt32}}(::Type{T}, src::String) = T[T(c) for c in src]
transcode{T<:Union{Int32,UInt32}}(::Type{T}, src::Vector{UInt8}) = transcode(T, String(src))
function transcode{S<:Union{Int32,UInt32}}(::Type{UInt8}, src::Vector{S})
Copy link
Sponsor Member

@StefanKarpinski StefanKarpinski Jul 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for supporting signed 32-bit code units (Int32) but only unsigned code units for others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cwchar_t can be Int32, so we need to support the signed type for UTF-32.

I suppose it wouldn't hurt to support the signed types for Int8 and Int32 as well.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was arguing for not supporting Int32 either but the point about Cwchar_t makes sense. Let's leave it as is. I'd like to keep this minimal until we find we really need to do otherwise.

@StefanKarpinski
Copy link
Sponsor Member

Ok, I'm convinced. @stevengj – will you rebase and merge? Or I can do it since I merged the conflicting changes. Just let me know.

@stevengj
Copy link
Member Author

Rebased; click "merge" when green.

@stevengj
Copy link
Member Author

!#$&-ing NEWS.md conflicts.

@StefanKarpinski
Copy link
Sponsor Member

Can we not run CI again for a NEWS conflict fix?

@stevengj stevengj merged commit 5d045df into JuliaLang:master Jul 12, 2016
@stevengj stevengj deleted the export-transcode branch July 12, 2016 20:21
@stevengj
Copy link
Member Author

Okay.

mfasi pushed a commit to mfasi/julia that referenced this pull request Sep 5, 2016
* export and document transcode from JuliaLang#16974, add transcode(String, x) and transcode(T, ::String) convenience methods

* docs

* support UTF-32 in transcode

* don't use splatting for UTF-32 to String conversion

* typo

* eliminate method ambiguities

* re-run genstdlib

* doc clarification

* typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:docs This change adds or pertains to documentation domain:unicode Related to unicode characters and encodings needs decision A decision on this change is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants