Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require utf-8 when specifying character encoding #3091

Merged
merged 3 commits into from
Oct 6, 2017

Conversation

sideshowbarker
Copy link
Contributor

@sideshowbarker sideshowbarker commented Oct 3, 2017

This addresses #3006.

@annevk
Copy link
Member

annevk commented Oct 3, 2017

Is @hsivonen now comfortable with this? When Encoding initially required this there was a little bit of fear it might be too soon. So maybe we should split it out for <script charset> since it seems fine to start there.

@domenic
Copy link
Member

domenic commented Oct 3, 2017

I am in support of doing this everywhere. E.g. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg implies a good trend , and in general the "only UTF-8" meme has gotten pretty widespread.

I haven't reviewed the commits yet, but will do so soon, under the assumption that we're gonna go all the way.

@hsivonen
Copy link
Member

hsivonen commented Oct 4, 2017

Is @hsivonen now comfortable with this? When Encoding initially required this there was a little bit of fear it might be too soon. So maybe we should split it out for <script charset> since it seems fine to start there.

I think we should nudge authors towards making everything UTF-8. I'm am still a bit worried about authors reacting to an error in a silly way: Making the charset attribute UTF-8 without changing the encoding of the resource to UTF-8.

I guess the exact message that the validator gives matters here. Assuming a message that is worded to complain more about the resource not being UTF-8 than about the value of the attribute per se, I'm OK with this.

As for script vs. link, I think non-UTF-8 CSS is more harmful than non-UTF-8 JS, because style sheet encoding gets inherited into URL parsing (i.e. URLs become context-dependent and don't work in the URL bar) but JS encoding doesn't get inherited anywhere.

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it. I fail to locate a normative statement to that effect for the BOM and HTTP cases. (I see it only for the meta case.)

Copy link
Member

@zcorpan zcorpan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3006 is a pull request -- I suppose this addresses #3004 and replaces #3006?

source Outdated

<p class="note">A character encoding declaration is required (either in the <span
<div class="note">
<p>A character encoding declaration is required (either in the <span
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent by a space

source Outdated
data-x="Content-Type">Content-Type metadata</span> or explicitly in the file) even when all
characters are in the ASCII range, because a character encoding is needed to process non-ASCII
characters entered by the user in forms, in URLs generated by scripts, and so forth.</p>
<p>Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert a blank line between paragraphs.

data-x="attr-meta-http-equiv-content-type">Encoding declaration state</span>, then the character
encoding used must be an <span>ASCII-compatible encoding</span>.</p>

<p>Authors should use <span>UTF-8</span>. Conformance checkers may advise authors against using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that the Encoding Standard already requires utf-8? Maybe put in a statement of fact here mentioning that requirement, so we're clear (since the meta encoding declaration is itself optional and encoding could be specified in HTTP/BOM/XML decl).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that the Encoding Standard already requires utf-8? Maybe put in a statement of fact here mentioning that requirement, so we're clear

OK, 94517b3 attempts to do that

@annevk
Copy link
Member

annevk commented Oct 4, 2017

Yeah, most of #3006 ends up withdrawn. I need to separate a separate PR for the minor things I fixed on the side.

@sideshowbarker
Copy link
Contributor Author

When Encoding initially required this there was a little bit of fear it might be too soon.

I know — but that was nearly 5 years ago (January 2013). So finally requiring UTF-8 in HTML almost 5 years after Encoding initially required it doesn’t seem like we’re exactly rushing things…

So maybe we should split it out for <script charset> since it seems fine to start there.

I’m OK with just merging the <script charset> part for now if that’s all we can get agreement on at the moment, but if we were to do that, I wonder how we then decide what process we follow for deciding when to finally go all the way with the rest?

I assume we’d agree we don’t want to wait, say, another 5 years. But short of that it’s not clear to me how we can measure when it’s no longer too soon and we’re instead finally ready to go forward with it.

So it seems like instead we just need to choose some point at which to do it, and then finally just do it.

@sideshowbarker
Copy link
Contributor Author

I'm am still a bit worried about authors reacting to an error in a silly way: Making the charset attribute UTF-8 without changing the encoding of the resource to UTF-8.

Yeah, agreed that would be a counterproductive outcome

I guess the exact message that the validator gives matters here. Assuming a message that is worded to complain more about the resource not being UTF-8 than about the value of the attribute per se, I'm OK with this.

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. > <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it.

OK, I’ll make that change.

I fail to locate a normative statement to that effect for the BOM and HTTP cases. (I see it only for the meta case.)

Not sure what you mean. I take it you don’t meant a normative statement about authoring tools in relation to the BOM, or a normative statement about authoring tools in relation to the HTTP-delivered charset.

@sideshowbarker
Copy link
Contributor Author

#3006 is a pull request -- I suppose this addresses #3004 and replaces #3006?

Yeah (as @annevk noted)

@annevk
Copy link
Member

annevk commented Oct 4, 2017

@sideshowbarker it seems that everyone who commented here is okay with going ahead with it, so let's (finally) do it.

@sideshowbarker
Copy link
Contributor Author

Reviewing the patch:

- <p>Authoring tools should default to using <span>UTF-8</span> for newly-created documents. > <ref
-  spec=ENCODING></p>

It seems to me that the patch should upgrade this to a MUST instead of removing it.

OK, made it so in 769d6fe

@hsivonen
Copy link
Member

hsivonen commented Oct 4, 2017

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

While the parser could make sense for meta, a datatype in the datatype library would make more sense especially for link and script.

@hsivonen
Copy link
Member

hsivonen commented Oct 4, 2017

Not sure what you mean. I take it you don’t meant a normative statement about authoring tools in relation to the BOM, or a normative statement about authoring tools in relation to the HTTP-delivered charset.

I meant the same thing as @zcorpan meant in the comment right after mine.

@sideshowbarker
Copy link
Contributor Author

OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?

While the parser could make sense for meta, a datatype in the datatype library would make more sense especially for link and script.

Aha yeah OK I’ll add a datatype checker for it that way to the validator sources

@annevk
Copy link
Member

annevk commented Oct 4, 2017

We should update those too.

@zcorpan
Copy link
Member

zcorpan commented Oct 4, 2017

I agree about text/html. But I think we should probably separate accept-encoding in order to do proper reasoning and compat analysis for that.

Copy link
Contributor Author

@sideshowbarker sideshowbarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Update text/html registration” change LGTM

@domenic
Copy link
Member

domenic commented Oct 5, 2017

Per #3006 (comment) , I was thinking we should make charset="utf-8" on script elements obsolete but conforming (i.e. validators display a warning), since in a UTF-8 document it is redundant, and we've recently been making redundant script attributes obsolete but conforming. This would mean the charset attribute on script gets a treatment similar to type on style.

@sideshowbarker
Copy link
Contributor Author

I was thinking we should make charset="utf-8" on script elements obsolete but conforming… This would mean the charset attribute on script gets a treatment similar to type on style.

Yes, will update the source on this branch to do that

@domenic
Copy link
Member

domenic commented Oct 5, 2017

I was going to do a review but then I thought it'd be easier to just tweak things myself so I got carried away and did a bit more. Let me know what you think :).

@sideshowbarker
Copy link
Contributor Author

I was going to do a review but then I thought it'd be easier to just tweak things myself so I got carried away and did a bit more. Let me know what you think :).

Looks beautiful 🎉

This change adds a “must” requirement for UTF-8 in all but one of the places in
the spec that define a means for specifying a character encoding.

Specifically, it makes UTF-8 required for any “character encoding declaration”,
which includes the HTTP Content-Type header sent with any document, the
`<meta charset>` element, and the `<meta http-equiv=content-type>` element.

Along with those, this change also makes UTF-8 required for `<script charset>`
but also moves `<script charset>` to being obsolete-but-conforming (because now
that both documents and scripts are required to use UTF-8, it’s redundant to
specify `charset` on the `script` element, since it inherits from the document).

To make the normative source of those requirements clear, this change also adds
a specific citation to the relevant requirement from the Encoding standard, and
updates the in-spec IANA registration for text/html media type to indicate that
UTF-8 is required. Finally, it changes an existing requirement for authoring
tools to use UTF-8 from a “should” to a “must”.

The one place where this change doesn’t yet add a requirement for UTF-8 is for
the `form` element’s `accept-charset` attribute. For that, see issue #3097.
Copy link
Member

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a couple more nits. Happy to fix these later today.

source Outdated
<p>The Encoding standard requires use of the <span>UTF-8</span> <span data-x="encoding">character
encoding</span> and requires use of the "<code data-x="">utf-8</code>" <span>encoding label</span>
to identify it. Those requirements necessitate that the document's <span>character encoding
declaration</span>, if it exists, specify an <span>encoding label</span> using an <span>ASCII
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifies?

source Outdated
case-insensitive</span> match for the string "<code data-x="">utf-8</code>". Regardless of whether
a <span>character encoding declaration</span> is present or not, the actual <span
data-x="document's character encoding">character encoding</span> used to store or transmit the
document must be <span>UTF-8</span>. <ref spec=ENCODING></p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say "to encode the document". Storage and transmission have little to do with text encoding.

source Outdated
data-x="attr-script-async">async</code>, and <code data-x="attr-script-defer">defer</code>
attributes. Authors should omit the attribute instead of redundantly setting it.</p></li>
<code data-x="attr-script-async">async</code> and <code data-x="attr-script-defer">defer</code>
attributes (as well as the legacy <code data-x="attr-script-charset">charset</code> attribute).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other legacy attributes can influence the processing model as well, but we don't mention them here. Is this really needed?

source Outdated
changes to the base URL also have no effect -->
<code data-x="attr-script-integrity">integrity</code> attributes dynamically has no direct effect;
these attributes are only used at specific times described below. (The same is true for the legacy
<code data-x="attr-script-charset">charset</code> attribute.</p> <!-- by implication, changes to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't think we mention the other legacy attributes here.

source Outdated
<code>script</code> element must <span>reflect</span> the element's
<code data-x="attr-script-event">event</code> content attribute.</p>
<p>The <dfn><code data-x="dom-script-event">event</code></dfn> and <dfn><code
data-x="dom-script-charset">charset</code></dfn> IDL attributes of the <code>script</code> element
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do these in alphabetical order normally.

Copy link
Member

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good now, but someone should probably double check my edit.

@annevk annevk merged commit fae77e3 into master Oct 6, 2017
@annevk annevk deleted the sideshowbarker/require-utf-8 branch October 6, 2017 10:09
@annevk annevk restored the sideshowbarker/require-utf-8 branch October 6, 2017 10:12
@annevk annevk deleted the sideshowbarker/require-utf-8 branch October 6, 2017 10:12
@annevk annevk added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label Dec 14, 2017
@annevk
Copy link
Member

annevk commented Dec 14, 2017

Sure thing, done.

@duerst
Copy link

duerst commented Dec 31, 2017

domenic commented on Oct 4:

E.g. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg implies a good trend , and in general the "only UTF-8" meme has gotten pretty widespread.

It would be good to have some more recent data. The graph on Wikipedia is about 5 years old.

@sideshowbarker
Copy link
Contributor Author

sideshowbarker commented Dec 31, 2017

It would be good to have some more recent data. The graph on Wikipedia is about 5 years old.

https://w3techs.com/technologies/history_overview/character_encoding/ms/y has up-to-date data:

image

 Encoding 2012 Jan 2013 Jan 2014 Jan 2015 Jan 2016 Jan 2017 Jan 2017 Dec
UTF-8 68.0% 74.7% 78.7% 82.3% 86.0% 88.2% 90.5%
ISO-8859-1 17.2% 13.5% 10.8% 9.3% 6.9% 5.5% 4.3%
Windows‑1251 3.3% 2.8% 2.7% 2.2% 1.9% 1.7% 1.5%
Shift JIS 1.7% 1.4% 1.4% 1.3% 1.1% 1.0% 0.8%

So the 5-6 year trend is, UTF-8 usage has grown from 68% in January 2012 to over 90% now.

And while it does show the rate of increase leveling off a bit, over the last 3 years it’s still been growing at over 2% per year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
document conformance i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.
Development

Successfully merging this pull request may close these issues.

None yet

6 participants