Skip to content
This repository has been archived by the owner on Jul 30, 2019. It is now read-only.

Require UTF-8 #1039

Closed
sideshowbarker opened this issue Oct 6, 2017 · 4 comments
Closed

Require UTF-8 #1039

sideshowbarker opened this issue Oct 6, 2017 · 4 comments
Assignees
Labels
alreq Tracked by the arabic layout group i18n-comment substantive wide review
Milestone

Comments

@sideshowbarker
Copy link
Contributor

See whatwg/html#3091

edent added a commit that referenced this issue Mar 1, 2018
* Update spec to insist on UTF-8
* Fixes #1039
chaals pushed a commit that referenced this issue Mar 29, 2018
Fix #1039 

* Update spec to insist on UTF-8 for all new content
@r12a
Copy link

r12a commented May 31, 2018

I've been reviewing the changes as part of the i18n WG review of HTML 5.3, and while we wholeheartedly support the idea of moving the Web fully to UTF-8, we have some concerns with the spec text.

As i understand it, the spec is trying to do three things:

  1. advise content authors what to do – for which a statement such as 'you MUST use utf-8' is ok.
  2. indicate to the browser implementers how to parse the HTML and deal with pages that use legacy encodings – we are also happy to see that that text is still there, including the bits that prohibit use of certain legacy encodings. (There was some concern that the default encoding, in the absence of any other information, would be a non-UTF-8 encoding, but to my mind that is ok.)
  3. indicate what is acceptable syntax for HTML – here is, i think, where our issue lies...

4.2.5.4. Specifying the document’s character encoding
https://w3c.github.io/html/document-metadata.html#specifying-the-documents-character-encoding says

The only acceptable character encoding declaration for the modern web is UTF-8.

The following restrictions apply to character encoding declarations:

The first sentence has a qualifier, 'for the modern Web', which helps, but that last sentence is missing what i think is an important rider that appears in the WhatWG version of the spec, which says:

The following restrictions apply to character encoding declarations for newly-created documents:

That, to me sounds better (though not to Addison, who thinks it's untestable).

Otherwise, one wonders:

  1. What happens if i willfully ignore the requirement to use UTF-8?
  2. What about legacy documents that declare the page to be in other encodings?

The danger is that the current text implies that legacy documents no longer need to be supported by a browser, even if they use the encodings listed in the Encoding spec. That will cut off the use of many pages on the Web. We're looking for some kind of wording that clearly says that from now on only utf-8 is acceptable, but that also makes it clear that existing content will still work (as long as it doesn't use encodings that aren't interoperably implemented or encodings that carry significant security risks - those being indicated in the Encoding spec).

Another suggestion from i18n WG participants was to say SHOULD but immediately follow it by text that explains why content authors REALLY SHOULD avoid anything but utf-8. There was some text to that effect previously in section 4.2.5.4. Specifying the document’s character encoding, but it was removed in this update.

Similar comments are relevant to section 12.1 text/html
https://w3c.github.io/html/iana.html#text-html which says

The parameter’s value must be an ASCII case-insensitive match for the string "utf-8".

Again, to my mind, adding 'for newly-created documents' would be more accurate here.

@edent
Copy link
Member

edent commented Jun 1, 2018

Good points both. Let me answer.

What happens if i willfully ignore the requirement to use UTF-8?

Madame Puff - Ada Lovelace's beloved cat - will forever hide dead birds in your chimney.

Which is to say, that will depend on the browser. If you use ISO 8859-5 strange things may happen. As the document says later on:

Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings, which use the document’s character encoding by default.

I agree with your second question:

What about legacy documents that declare the page to be in other encodings?

As we say in Rendering "User agents are not required to present HTML documents in any particular way."

I think it would be helpful to add a section to either Rendering or Obsolete Features to say how browsers should treat legacy content.

@r12a
Copy link

r12a commented Jun 1, 2018

Which is to say, that will depend on the browser. If you use ISO 8859-5 strange things may happen. As the document says later on:

Except that that isn't true, iiuc. The parsing of the input byte stream should detect that the intended encoding is ISO 8859-5 per section 8.2.2, and the Encoding spec then defines exactly how ISO 8895-5 should be supported by a browser. So no strange things should happen for that particular legacy encoding. It's only for encodings that aren't defined in the Encoding spec that strange things could happen, and that what happens depends on browser support (modulo, of course, browsers that don't support the standards, but that's a recipe for strange things happening whatever the topic).

If the upshot of the spec is now that strange things may happen to legacy encoded content that is covered by the provisions of the Encoding spec, then i think there's a problem, since we are putting in danger the future viability of that legacy content. Note that i agree that conversion to UTF-8 is an optimal solution, but it's not always possible, and unless there's a serious interop or security issue, i don't think we should abandon pages created in good faith in the past to future obscurity.

@chaals chaals assigned edent and unassigned siusin Jun 19, 2018
@chaals chaals added this to the HTML5.3 WD5 milestone Jun 19, 2018
@edent
Copy link
Member

edent commented Jul 4, 2018

Catching up with this. I agree with everything you've written. See incoming PR.

One minor thing:

The following restrictions apply to character encoding declarations for newly-created documents:

We don't need the "for newly-created documents" as the restriction also applies to legacy documents as well. The restrictions mentioned are about the serialisation of the declaration. They've always applied. However, I've added back in the comment about the older character encodings. Thanks for pointing that out.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
alreq Tracked by the arabic layout group i18n-comment substantive wide review
Projects
None yet
Development

No branches or pull requests

6 participants