Require utf-8 when specifying character encoding #3091

sideshowbarker · 2017-10-03T13:12:09Z

This addresses #3006.

annevk · 2017-10-03T13:40:22Z

Is @hsivonen now comfortable with this? When Encoding initially required this there was a little bit of fear it might be too soon. So maybe we should split it out for <script charset> since it seems fine to start there.

domenic · 2017-10-03T15:02:59Z

I am in support of doing this everywhere. E.g. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg implies a good trend , and in general the "only UTF-8" meme has gotten pretty widespread.

I haven't reviewed the commits yet, but will do so soon, under the assumption that we're gonna go all the way.

hsivonen · 2017-10-04T07:20:47Z

Is @hsivonen now comfortable with this? When Encoding initially required this there was a little bit of fear it might be too soon. So maybe we should split it out for <script charset> since it seems fine to start there.

I think we should nudge authors towards making everything UTF-8. I'm am still a bit worried about authors reacting to an error in a silly way: Making the charset attribute UTF-8 without changing the encoding of the resource to UTF-8.

I guess the exact message that the validator gives matters here. Assuming a message that is worded to complain more about the resource not being UTF-8 than about the value of the attribute per se, I'm OK with this.

As for script vs. link, I think non-UTF-8 CSS is more harmful than non-UTF-8 JS, because style sheet encoding gets inherited into URL parsing (i.e. URLs become context-dependent and don't work in the URL bar) but JS encoding doesn't get inherited anywhere.

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it. I fail to locate a normative statement to that effect for the BOM and HTTP cases. (I see it only for the meta case.)

zcorpan

#3006 is a pull request -- I suppose this addresses #3004 and replaces #3006?

zcorpan · 2017-10-04T07:16:07Z

source


- <p class="note">A character encoding declaration is required (either in the <span
+ <div class="note">
+ <p>A character encoding declaration is required (either in the <span


indent by a space

zcorpan · 2017-10-04T07:16:30Z

source

 data-x="Content-Type">Content-Type metadata</span> or explicitly in the file) even when all
 characters are in the ASCII range, because a character encoding is needed to process non-ASCII
 characters entered by the user in forms, in URLs generated by scripts, and so forth.</p>
+ <p>Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings,


Insert a blank line between paragraphs.

zcorpan · 2017-10-04T07:23:40Z

source

- data-x="attr-meta-http-equiv-content-type">Encoding declaration state</span>, then the character
- encoding used must be an <span>ASCII-compatible encoding</span>.</p>
-
- <p>Authors should use <span>UTF-8</span>. Conformance checkers may advise authors against using


Is the idea here that the Encoding Standard already requires utf-8? Maybe put in a statement of fact here mentioning that requirement, so we're clear (since the meta encoding declaration is itself optional and encoding could be specified in HTTP/BOM/XML decl).

Is the idea here that the Encoding Standard already requires utf-8? Maybe put in a statement of fact here mentioning that requirement, so we're clear

OK, 94517b3 attempts to do that

annevk · 2017-10-04T07:30:06Z

Yeah, most of #3006 ends up withdrawn. I need to separate a separate PR for the minor things I fixed on the side.

sideshowbarker · 2017-10-04T07:30:47Z

When Encoding initially required this there was a little bit of fear it might be too soon.

I know — but that was nearly 5 years ago (January 2013). So finally requiring UTF-8 in HTML almost 5 years after Encoding initially required it doesn’t seem like we’re exactly rushing things…

So maybe we should split it out for <script charset> since it seems fine to start there.

I’m OK with just merging the <script charset> part for now if that’s all we can get agreement on at the moment, but if we were to do that, I wonder how we then decide what process we follow for deciding when to finally go all the way with the rest?

I assume we’d agree we don’t want to wait, say, another 5 years. But short of that it’s not clear to me how we can measure when it’s no longer too soon and we’re instead finally ready to go forward with it.

So it seems like instead we just need to choose some point at which to do it, and then finally just do it.

sideshowbarker · 2017-10-04T07:39:02Z

I'm am still a bit worried about authors reacting to an error in a silly way: Making the charset attribute UTF-8 without changing the encoding of the resource to UTF-8.

Yeah, agreed that would be a counterproductive outcome

I guess the exact message that the validator gives matters here. Assuming a message that is worded to complain more about the resource not being UTF-8 than about the value of the attribute per se, I'm OK with this.

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

Reviewing the patch:
- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. > <ref
-  spec=ENCODING></p>
It seems to me that the patch should upgrade this to a MUST instead of removing it.

OK, I’ll make that change.

I fail to locate a normative statement to that effect for the BOM and HTTP cases. (I see it only for the meta case.)

Not sure what you mean. I take it you don’t meant a normative statement about authoring tools in relation to the BOM, or a normative statement about authoring tools in relation to the HTTP-delivered charset.

sideshowbarker · 2017-10-04T07:40:24Z

#3006 is a pull request -- I suppose this addresses #3004 and replaces #3006?

Yeah (as @annevk noted)

annevk · 2017-10-04T07:42:19Z

@sideshowbarker it seems that everyone who commented here is okay with going ahead with it, so let's (finally) do it.

sideshowbarker · 2017-10-04T08:25:04Z

Reviewing the patch:
- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. > <ref
-  spec=ENCODING></p>
It seems to me that the patch should upgrade this to a MUST instead of removing it.

OK, made it so in 769d6fe

hsivonen · 2017-10-04T09:25:05Z

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

While the parser could make sense for meta, a datatype in the datatype library would make more sense especially for link and script.

hsivonen · 2017-10-04T09:27:06Z

Not sure what you mean. I take it you don’t meant a normative statement about authoring tools in relation to the BOM, or a normative statement about authoring tools in relation to the HTTP-delivered charset.

I meant the same thing as @zcorpan meant in the comment right after mine.

sideshowbarker · 2017-10-04T09:28:42Z

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

While the parser could make sense for meta, a datatype in the datatype library would make more sense especially for link and script.

Aha yeah OK I’ll add a datatype checker for it that way to the validator sources

zcorpan · 2017-10-04T10:23:26Z

What do we think of
https://html.spec.whatwg.org/multipage/forms.html#the-form-element:encoding-label
https://html.spec.whatwg.org/multipage/iana.html#text/html

annevk · 2017-10-04T11:20:47Z

We should update those too.

zcorpan · 2017-10-04T14:30:13Z

I agree about text/html. But I think we should probably separate accept-encoding in order to do proper reasoning and compat analysis for that.

sideshowbarker

“Update text/html registration” change LGTM

domenic · 2017-10-05T04:36:32Z

Per #3006 (comment) , I was thinking we should make charset="utf-8" on script elements obsolete but conforming (i.e. validators display a warning), since in a UTF-8 document it is redundant, and we've recently been making redundant script attributes obsolete but conforming. This would mean the charset attribute on script gets a treatment similar to type on style.

sideshowbarker · 2017-10-05T06:38:12Z

I was thinking we should make charset="utf-8" on script elements obsolete but conforming… This would mean the charset attribute on script gets a treatment similar to type on style.

Yes, will update the source on this branch to do that

domenic · 2017-10-05T20:35:09Z

I was going to do a review but then I thought it'd be easier to just tweak things myself so I got carried away and did a bit more. Let me know what you think :).

sideshowbarker · 2017-10-05T23:37:12Z

I was going to do a review but then I thought it'd be easier to just tweak things myself so I got carried away and did a bit more. Let me know what you think :).

Looks beautiful 🎉

This change adds a “must” requirement for UTF-8 in all but one of the places in the spec that define a means for specifying a character encoding. Specifically, it makes UTF-8 required for any “character encoding declaration”, which includes the HTTP Content-Type header sent with any document, the `<meta charset>` element, and the `<meta http-equiv=content-type>` element. Along with those, this change also makes UTF-8 required for `<script charset>` but also moves `<script charset>` to being obsolete-but-conforming (because now that both documents and scripts are required to use UTF-8, it’s redundant to specify `charset` on the `script` element, since it inherits from the document). To make the normative source of those requirements clear, this change also adds a specific citation to the relevant requirement from the Encoding standard, and updates the in-spec IANA registration for text/html media type to indicate that UTF-8 is required. Finally, it changes an existing requirement for authoring tools to use UTF-8 from a “should” to a “must”. The one place where this change doesn’t yet add a requirement for UTF-8 is for the `form` element’s `accept-charset` attribute. For that, see issue #3097.

annevk

I found a couple more nits. Happy to fix these later today.

annevk · 2017-10-06T04:46:58Z

source

+ <p>The Encoding standard requires use of the <span>UTF-8</span> <span data-x="encoding">character
+ encoding</span> and requires use of the "<code data-x="">utf-8</code>" <span>encoding label</span>
+ to identify it. Those requirements necessitate that the document's <span>character encoding
+ declaration</span>, if it exists, specify an <span>encoding label</span> using an <span>ASCII


annevk · 2017-10-06T04:49:13Z

source

+ case-insensitive</span> match for the string "<code data-x="">utf-8</code>". Regardless of whether
+ a <span>character encoding declaration</span> is present or not, the actual <span
+ data-x="document's character encoding">character encoding</span> used to store or transmit the
+ document must be <span>UTF-8</span>. <ref spec=ENCODING></p>


Maybe say "to encode the document". Storage and transmission have little to do with text encoding.

annevk · 2017-10-06T04:53:12Z

source

- data-x="attr-script-async">async</code>, and <code data-x="attr-script-defer">defer</code>
- attributes. Authors should omit the attribute instead of redundantly setting it.</p></li>
+ <code data-x="attr-script-async">async</code> and <code data-x="attr-script-defer">defer</code>
+ attributes (as well as the legacy <code data-x="attr-script-charset">charset</code> attribute).


Other legacy attributes can influence the processing model as well, but we don't mention them here. Is this really needed?

annevk · 2017-10-06T04:54:25Z

source

- changes to the base URL also have no effect -->
+ <code data-x="attr-script-integrity">integrity</code> attributes dynamically has no direct effect;
+ these attributes are only used at specific times described below. (The same is true for the legacy
+ <code data-x="attr-script-charset">charset</code> attribute.</p> <!-- by implication, changes to


Again, I don't think we mention the other legacy attributes here.

annevk · 2017-10-06T04:56:08Z

source

- <code>script</code> element must <span>reflect</span> the element's
- <code data-x="attr-script-event">event</code> content attribute.</p>
+ <p>The <dfn><code data-x="dom-script-event">event</code></dfn> and <dfn><code
+ data-x="dom-script-charset">charset</code></dfn> IDL attributes of the <code>script</code> element


I think we do these in alphabetical order normally.

annevk

I think it looks good now, but someone should probably double check my edit.

annevk · 2017-12-14T15:55:58Z

Sure thing, done.

duerst · 2017-12-31T00:52:56Z

domenic commented on Oct 4:

E.g. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg implies a good trend , and in general the "only UTF-8" meme has gotten pretty widespread.

It would be good to have some more recent data. The graph on Wikipedia is about 5 years old.

sideshowbarker · 2017-12-31T03:49:57Z

It would be good to have some more recent data. The graph on Wikipedia is about 5 years old.

https://w3techs.com/technologies/history_overview/character_encoding/ms/y has up-to-date data:

Encoding	2012 Jan	2013 Jan	2014 Jan	2015 Jan	2016 Jan	2017 Jan	2017 Dec
UTF-8	68.0%	74.7%	78.7%	82.3%	86.0%	88.2%	90.5%
ISO-8859-1	17.2%	13.5%	10.8%	9.3%	6.9%	5.5%	4.3%
Windows‑1251	3.3%	2.8%	2.7%	2.2%	1.9%	1.7%	1.5%
Shift JIS	1.7%	1.4%	1.4%	1.3%	1.1%	1.0%	0.8%

So the 5-6 year trend is, UTF-8 usage has grown from 68% in January 2012 to over 90% now.

And while it does show the rate of increase leveling off a bit, over the last 3 years it’s still been growing at over 2% per year.

sideshowbarker mentioned this pull request Oct 3, 2017

Make <script charset> non-conforming #3006

Closed

domenic added the document conformance label Oct 3, 2017

zcorpan reviewed Oct 4, 2017

View reviewed changes

zcorpan mentioned this pull request Oct 4, 2017

Consider restricting <form accept-charset> to utf-8 #3097

Closed

sideshowbarker commented Oct 5, 2017

View reviewed changes

sideshowbarker force-pushed the sideshowbarker/require-utf-8 branch from 13efba8 to dfef71a Compare October 5, 2017 06:19

sideshowbarker force-pushed the sideshowbarker/require-utf-8 branch from d891e56 to 7a64e46 Compare October 6, 2017 01:25

sideshowbarker force-pushed the sideshowbarker/require-utf-8 branch from 7a64e46 to 4089e5c Compare October 6, 2017 01:26

Further minor cleanup around charset obsolescence

7950855

sideshowbarker mentioned this pull request Oct 6, 2017

Require UTF-8 w3c/html#1039

Closed

annevk reviewed Oct 6, 2017

View reviewed changes

address various nits and some grammar issues

716ab5f

annevk approved these changes Oct 6, 2017

View reviewed changes

annevk merged commit fae77e3 into master Oct 6, 2017

annevk deleted the sideshowbarker/require-utf-8 branch October 6, 2017 10:09

annevk restored the sideshowbarker/require-utf-8 branch October 6, 2017 10:12

annevk deleted the sideshowbarker/require-utf-8 branch October 6, 2017 10:12

annevk added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label Dec 14, 2017

himorin mentioned this pull request Sep 12, 2019

Require UTF-8 #1039 w3c/i18n-activity#574

Closed

Require utf-8 when specifying character encoding #3091

Require utf-8 when specifying character encoding #3091

Conversation

sideshowbarker commented Oct 3, 2017 • edited Loading

annevk commented Oct 3, 2017

domenic commented Oct 3, 2017

hsivonen commented Oct 4, 2017

zcorpan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

annevk commented Oct 4, 2017

sideshowbarker commented Oct 4, 2017

sideshowbarker commented Oct 4, 2017

sideshowbarker commented Oct 4, 2017

annevk commented Oct 4, 2017

sideshowbarker commented Oct 4, 2017

hsivonen commented Oct 4, 2017

hsivonen commented Oct 4, 2017

sideshowbarker commented Oct 4, 2017

zcorpan commented Oct 4, 2017

annevk commented Oct 4, 2017

zcorpan commented Oct 4, 2017

sideshowbarker left a comment

Choose a reason for hiding this comment

domenic commented Oct 5, 2017

sideshowbarker commented Oct 5, 2017

domenic commented Oct 5, 2017

sideshowbarker commented Oct 5, 2017

annevk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

annevk left a comment

Choose a reason for hiding this comment

annevk commented Dec 14, 2017

duerst commented Dec 31, 2017

sideshowbarker commented Dec 31, 2017 • edited Loading

sideshowbarker commented Oct 3, 2017 •

edited

Loading

sideshowbarker commented Dec 31, 2017 •

edited

Loading