Require UTF-8 #1039

sideshowbarker · 2017-10-06T01:41:55Z

* Update spec to insist on UTF-8 * Fixes #1039

Fix #1039 * Update spec to insist on UTF-8 for all new content

r12a · 2018-05-31T17:07:45Z

I've been reviewing the changes as part of the i18n WG review of HTML 5.3, and while we wholeheartedly support the idea of moving the Web fully to UTF-8, we have some concerns with the spec text.

As i understand it, the spec is trying to do three things:

advise content authors what to do – for which a statement such as 'you MUST use utf-8' is ok.
indicate to the browser implementers how to parse the HTML and deal with pages that use legacy encodings – we are also happy to see that that text is still there, including the bits that prohibit use of certain legacy encodings. (There was some concern that the default encoding, in the absence of any other information, would be a non-UTF-8 encoding, but to my mind that is ok.)
indicate what is acceptable syntax for HTML – here is, i think, where our issue lies...

4.2.5.4. Specifying the document’s character encoding
https://w3c.github.io/html/document-metadata.html#specifying-the-documents-character-encoding says

The only acceptable character encoding declaration for the modern web is UTF-8.

The following restrictions apply to character encoding declarations:

The first sentence has a qualifier, 'for the modern Web', which helps, but that last sentence is missing what i think is an important rider that appears in the WhatWG version of the spec, which says:

The following restrictions apply to character encoding declarations for newly-created documents:

That, to me sounds better (though not to Addison, who thinks it's untestable).

Otherwise, one wonders:

What happens if i willfully ignore the requirement to use UTF-8?
What about legacy documents that declare the page to be in other encodings?

The danger is that the current text implies that legacy documents no longer need to be supported by a browser, even if they use the encodings listed in the Encoding spec. That will cut off the use of many pages on the Web. We're looking for some kind of wording that clearly says that from now on only utf-8 is acceptable, but that also makes it clear that existing content will still work (as long as it doesn't use encodings that aren't interoperably implemented or encodings that carry significant security risks - those being indicated in the Encoding spec).

Another suggestion from i18n WG participants was to say SHOULD but immediately follow it by text that explains why content authors REALLY SHOULD avoid anything but utf-8. There was some text to that effect previously in section 4.2.5.4. Specifying the document’s character encoding, but it was removed in this update.

Similar comments are relevant to section 12.1 text/html
https://w3c.github.io/html/iana.html#text-html which says

The parameter’s value must be an ASCII case-insensitive match for the string "utf-8".

Again, to my mind, adding 'for newly-created documents' would be more accurate here.

edent · 2018-06-01T08:52:47Z

Good points both. Let me answer.

What happens if i willfully ignore the requirement to use UTF-8?

Madame Puff - Ada Lovelace's beloved cat - will forever hide dead birds in your chimney.

Which is to say, that will depend on the browser. If you use ISO 8859-5 strange things may happen. As the document says later on:

Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings, which use the document’s character encoding by default.

I agree with your second question:

What about legacy documents that declare the page to be in other encodings?

As we say in Rendering "User agents are not required to present HTML documents in any particular way."

I think it would be helpful to add a section to either Rendering or Obsolete Features to say how browsers should treat legacy content.

r12a · 2018-06-01T09:28:02Z

Which is to say, that will depend on the browser. If you use ISO 8859-5 strange things may happen. As the document says later on:

Except that that isn't true, iiuc. The parsing of the input byte stream should detect that the intended encoding is ISO 8859-5 per section 8.2.2, and the Encoding spec then defines exactly how ISO 8895-5 should be supported by a browser. So no strange things should happen for that particular legacy encoding. It's only for encodings that aren't defined in the Encoding spec that strange things could happen, and that what happens depends on browser support (modulo, of course, browsers that don't support the standards, but that's a recipe for strange things happening whatever the topic).

If the upshot of the spec is now that strange things may happen to legacy encoded content that is covered by the provisions of the Encoding spec, then i think there's a problem, since we are putting in danger the future viability of that legacy content. Note that i agree that conversion to UTF-8 is an optimal solution, but it's not always possible, and unless there's a serious interop or security issue, i don't think we should abandon pages created in good faith in the past to future obscurity.

edent · 2018-07-04T14:15:50Z

Catching up with this. I agree with everything you've written. See incoming PR.

One minor thing:

The following restrictions apply to character encoding declarations for newly-created documents:

We don't need the "for newly-created documents" as the restriction also applies to legacy documents as well. The restrictions mentioned are about the serialisation of the declaration. They've always applied. However, I've added back in the comment about the older character encodings. Thanks for pointing that out.

Fixes #1039

nschonni mentioned this issue Oct 6, 2017

Is WET4 specifically married to UTF-8? wet-boew/wet-boew#8168

Closed

chaals added i18n-comment substantive labels Oct 10, 2017

chaals assigned siusin Dec 14, 2017

edent added a commit that referenced this issue Mar 1, 2018

UTF-8 All The Things

ea41151

* Update spec to insist on UTF-8 * Fixes #1039

edent mentioned this issue Mar 1, 2018

UTF-8 All The Things #1273

Merged

chaals closed this as completed in #1273 Mar 29, 2018

chaals pushed a commit that referenced this issue Mar 29, 2018

UTF-8 All The Things (#1273)

8d82a21

Fix #1039 * Update spec to insist on UTF-8 for all new content

r12a reopened this May 31, 2018

r12a mentioned this issue Jun 1, 2018

Require UTF-8 #1039 w3c/i18n-activity#574

Closed

LJWatson added the wide review label Jun 1, 2018

LJWatson mentioned this issue Jun 1, 2018

HTML5.3 wide review tracker #1415

Closed

chaals assigned edent and unassigned siusin Jun 19, 2018

chaals added this to the HTML5.3 WD5 milestone Jun 19, 2018

edent added a commit that referenced this issue Jul 4, 2018

Reinstate legacy encoding information

59edb9c

Fixes #1039

edent mentioned this issue Jul 4, 2018

Editorial: Reinstate legacy encoding information #1504

Merged

LJWatson closed this as completed in #1504 Jul 9, 2018

LJWatson pushed a commit that referenced this issue Jul 9, 2018

Reinstate legacy encoding information (#1504)

b7b1a01

Fixes #1039

siusin mentioned this issue Jul 20, 2018

Accept-charset attribute of form element maybe have a bug [was: Issue #1421] #1523

Closed

r12a added the alreq Tracked by the arabic layout group label Nov 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require UTF-8 #1039

Require UTF-8 #1039

sideshowbarker commented Oct 6, 2017

r12a commented May 31, 2018 •

edited

Loading

edent commented Jun 1, 2018

r12a commented Jun 1, 2018 •

edited

Loading

edent commented Jul 4, 2018

Require UTF-8 #1039

Require UTF-8 #1039

Comments

sideshowbarker commented Oct 6, 2017

r12a commented May 31, 2018 • edited Loading

edent commented Jun 1, 2018

r12a commented Jun 1, 2018 • edited Loading

edent commented Jul 4, 2018

r12a commented May 31, 2018 •

edited

Loading

r12a commented Jun 1, 2018 •

edited

Loading