Make <script charset> non-conforming #3006

annevk · 2017-09-05T12:01:10Z

domenic · 2017-09-05T15:42:18Z

I support this generally but would like more input from people like @zcorpan and @sideshowbarker. Especially if @sideshowbarker were able to get us stats on how often this appears, if they were low, that would make me feel better.

It's also worth considering that by disallowing this you are disallowing people using <script charset="utf-8"> on non-utf-8 pages (e.g. pages left with the default windows-1252).

sideshowbarker · 2017-09-06T04:23:57Z

if @sideshowbarker were able to get us stats on how often this appears, if they were low, that would make me feel better

OK, I’ll add add a use counter for it to the HTML checker later today

annevk · 2017-09-06T07:14:57Z

@sideshowbarker you want to measure <link rel=stylesheet charset> too then. To see if it's justified one is conforming and the other is not. I suspect neither is justified.

zcorpan · 2017-09-07T13:36:55Z

link charset was defined as doing something (but not made conforming) for stylesheets in 29e0e33
https://www.w3.org/Bugs/Public/show_bug.cgi?id=23684

script charset was defined as doing something and made conforming in e74a98d
https://lists.w3.org/Archives/Public/public-whatwg-archive/2008May/0137.html

zcorpan · 2017-09-07T13:51:43Z

In the httparchive data

SELECT
  page,
  REGEXP_EXTRACT(match, r'<(link|script)') AS element,
  REGEXP_EXTRACT(match, r'charset\s*=\s*["\']?([a-z0-9_-]+)') AS charset
FROM (
  SELECT page, REGEXP_EXTRACT(LOWER(body), r'(<(?:link|script)(?:\s+[^>]+)?\scharset\s*=\s*["\']?[a-z0-9_-]+)') AS match
  FROM [httparchive:har.2017_08_15_chrome_requests_bodies]
  WHERE page = url
)
WHERE match != "null"

49497 matches, of which 45180 the element is script.

zcorpan · 2017-09-07T14:05:46Z

I notice now from looking at view-source:https://www.z-wave.me/ that the above query matches this

<link rel="shortcut icon" href="https://www.z-wave.me/fileadmin/template/img/favicon.ico" type="text/plain; charset=utf-16le" />

so, need to write a more complicated query that tries to tokenize attributes. (I've done that before so can copy from an earlier query.)

zcorpan · 2017-09-08T08:34:49Z

New query that skips over whole attributes and only counts matches where the domain in src is different from the page's domain (for #3005):

SELECT * FROM (
  SELECT
    REGEXP_EXTRACT(page, r'https://([^/]+)') AS pageDomain,
    REGEXP_EXTRACT(match, r'<(link|script)') AS element,
    REGEXP_EXTRACT(match, r'<(?:link|script)\s+(?:[a-z0-9:_-]+(?:\s*=\s*(?:"[^"]*"\s*|\'[^\']*\'\s*|[^>\s/"\']+\s+)|\s+))*(?:src|href)\s*=\s*["\']?(?:https?:)?//([^/:"\']+)') AS srcDomain,
    REGEXP_EXTRACT(match, r'<(?:link|script)\s+(?:[a-z0-9:_-]+(?:\s*=\s*(?:"[^"]*"\s*|\'[^\']*\'\s*|[^>\s/"\']+\s+)|\s+))*charset\s*=\s*["\']?([a-z0-9_-]+)') AS charset
  FROM (
    SELECT page, REGEXP_EXTRACT(LOWER(body), r'(<(?:link|script)\s+(?:[a-z0-9:_-]+(?:\s*=\s*(?:"[^"]*"\s*|\'[^\']*\'\s*|[^>\s/"\']+\s+)|\s+))*charset\s*=[^>]+>)') AS match
    FROM [httparchive:har.2017_08_15_chrome_requests_bodies]
    WHERE page = url
  )
  WHERE match != "null"
)
WHERE srcDomain != "null"
AND charset != "null"
AND pageDomain != srcDomain

30578 matches of which 29676 are script. https://gist.github.com/zcorpan/662946db12e58cb517240179e878c535

zcorpan · 2017-09-08T09:04:58Z

In an attempt to find the encoding of the page itself:

SELECT 
  REGEXP_EXTRACT(page, r'https://([^/]+)') AS pageDomain,
  REGEXP_EXTRACT(payload, r'"name":"Content-Type","value":"text/html; charset=([^"]+)') AS httpCharset
FROM [httparchive:har.2017_08_15_chrome_requests]
WHERE page = url

~~Download results from storage.googleapis.com/zcorpan/page_domain_http_charset.csv.gzip (3.4MB, temporary link)~~

SELECT
  REGEXP_EXTRACT(page, r'https://([^/]+)') AS pageDomain,
  REGEXP_EXTRACT(LOWER(body), r'<meta(?:\s+[^>]+)?\scharset\s*=\s*["\']?([a-z0-9_-]+)') AS metaCharset
FROM [httparchive:har.2017_08_15_chrome_requests_bodies]
WHERE page = url

~~Download results from storage.googleapis.com/zcorpan/page_domain_meta_charset.csv.gzip (3.28MB, temporary link)~~

~~If somebody else can take over analysis from this data, that would be much appreciated.~~

See whatwg/html#3006

zcorpan · 2017-09-08T10:09:51Z

30578 matches of which 29676 are script.

There are 455,868 pages in this dataset, so:

~6.5% of pages have <script charset> with a cross-origin src
~0.2% of pages have <link charset> with a cross-origin href.

zcorpan · 2017-09-08T10:16:32Z

3108 pages, or ~0.7%, have <script charset> or <link charset> with a value other than utf-8/utf8/unicode-1-1-utf-8.

sideshowbarker · 2017-09-09T02:09:57Z

The HTML-checker use counter for this shows ~5.5% of pages have a script element with the charset attribute. I can add more specific use counters for it later if desired.

annevk · 2017-09-09T09:06:04Z

So now the question is whether it's worth being an error. I think it is as overriding the encoding from the document side means you have to do that everywhere and viewing the resource directly will fail, all of which seems suboptimal.

But I guess the argument against would be that 5.5% is high to add additional noise over to the checker results?

sideshowbarker · 2017-09-09T09:47:08Z

But I guess the argument against would be that 5.5% is high to add additional noise over to the checker results?

That shouldn’t stop us if we otherwise agree the right thing to do is to have the spec make it a non-conforming.

For the sake of comparison, note that the checker stats show that at least 16% of the documents it’s being used to check contain a style element in the document body. And the checker emits an error message for all those, because in spite of that relatively high usage, we resolved that the best thing to do for developers/authors was for the spec to say the style in the body non-conforming.

zcorpan · 2017-09-09T19:09:51Z

So I think there are several aspects against making it non-conforming:

JS doesn't have a way to declare encoding in the file except for BOM, while CSS also has @charset.
It might not be possible for the person including the script element to change the encoding of the enclosing page or the external script (especially cross-origin scripts).
It seems more common for scripts to include non-ASCII things than CSS, since script might contain snippets of markup or text to be inserted.
link is used for more things than just stylesheets but charset only has an effect for CSS.
It is more common than link charset.

What is the benefit for making charset for script non-conforming, other than consistency with link?

annevk · 2017-09-10T10:01:10Z

Well for cross-origin scripts I'm not even sure we want to make this work (see other issue) since it's a minor security risk. By letting the script decode differently you might be able to extract data you wouldn't otherwise have access to.

And for same-origin my rationale is what I gave above.

annevk · 2017-09-10T10:07:25Z

@zcorpan also, if you only use utf-8 throughout as you should, the other reasons you gave go away (since you'd still inherit from the page).

zcorpan · 2017-09-11T09:12:09Z

I've now INNER JOINed the tables, added a column with pageGuessedCharset (ISNULL(pageHttpCharset, pageMetaCharset)) and updated https://gist.github.com/zcorpan/662946db12e58cb517240179e878c535

The first 3, as a preview:

pageDomain	pageHttpCharset	pageMetaCharset	pageGuessedCharset	element	elementSrcDomain	elementCharset
www.gs4u.net	iso-8859-1	utf-8	iso-8859-1	script	vk.com	windows-1251
www.blyzka.by		utf-8	utf-8	script	cdn.sendpulse.com	utf-8
www.itresearches.ir	iso-8859-1	utf-8	iso-8859-1	script	cdn.sendpulse.com	utf-8

What I don't have here is the encoding of the script/stylesheet, if specified with HTTP or BOM or @charset (I didn't even store the full URL to the resource). But I can query this table to find interesting cases. I suppose pages where pageGuessedCharset is not utf-8, and elementCharset != pageGuessedCharset?

zcorpan · 2017-09-11T09:30:42Z

I suppose pages where pageGuessedCharset is not utf-8, and elementCharset != pageGuessedCharset?

SELECT * FROM [pelagic-breaker-785:test.link_script_all]
WHERE LOWER(pageGuessedCharset) != "utf-8"
AND LOWER(pageGuessedCharset) != elementCharset

26950 pages.

Of those, pages where elementCharset != "utf-8", gives 414 matches. https://gist.github.com/zcorpan/e6f2bf260e09ad319c73cb4a9e308f94

So there are 26536 pages where the page's encoding is not utf-8 but they include a cross-origin script with charset="utf-8". The error is really that the page's encoding is not utf-8; removing the charset attribute in response to a conformance checker error seems like it would be a bad effect.

zcorpan · 2017-09-11T09:47:29Z

As far as document conformance goes, how about making the only allowed value be "utf-8"? The effect of that is:

New conformance error:

(3108 - 414) = 2694 pages (~0.6%) use the same elementCharset as pageGuessedCharset, which is redundant; removing the attribute has no effect.
414 pages (~0.1%) use different elementCharset from pageGuessedCharset, which will have to specify the script's encoding with HTTP header or switch to utf-8.

No change:

26536 pages (5.8%) are not inconvenienced for using charset="utf-8" (but the conformance checker can give an error about the page's encoding not being utf-8).

domenic · 2017-09-21T12:07:45Z

Maybe we could do the following, in order:

Introduce a conformance error for charset=anything but UTF-8 on script
Introduce a conformance error for charset=anything but UTF-8 on meta
(Optional) introduce a conformance warning for charset=UTF-8 on script on pages that also specify it on meta, since doing so is redundant. (Similar to our warning for redundant type="" on style and script.)
(Speculative) Make it a conformance error to ever serve your document as non-UTF-8, including both meta and Content-Type, and also erroring on the implicit default encoding (so you have to do one of those two).

annevk · 2017-09-21T12:16:42Z

That seems reasonable, but then we should probably close this PR as that requires something quite different.

sideshowbarker · 2017-10-03T13:17:27Z

Maybe we could do the following, in order:

Introduce a conformance error for charset=anything but UTF-8 on script

Introduce a conformance error for charset=anything but UTF-8 on meta

I’ve raised #3091 for those.

The Encoding spec already has language that clearly states the requirement:

Authors must use the UTF-8 encoding and must use the ASCII case-insensitive "utf-8" label to identify it.

…so #3091 just brings us into conformance with what the Encoding spec already requires.

annevk · 2017-10-03T13:45:20Z

If we don't go with this PR I should make sure I do move that comment and add the <hr> as I did here separately somehow.

domenic · 2017-10-03T15:04:23Z

Interesting, I didn't realize Encoding disallowed e.g. "utf8" (no hyphen). I wonder how many non-conformant documents I've written...

hsivonen · 2017-10-04T07:07:42Z

Interesting, I didn't realize Encoding disallowed e.g. "utf8" (no hyphen). I wonder how many non-conformant documents I've written...

FWIW, including the hyphen improves compatibility with Microsoft browsers. (Old ones at least, I'm too lazy to check the latest right now.)

annevk · 2017-10-06T10:16:31Z

This has been replaced by #3091 and #3100.

annevk added 2 commits September 5, 2017 14:00

Make <script charset> non-conforming

f643895

Fixes #3004.

forgot the index

14c43e5

domenic added document conformance removal/deprecation Removing or deprecating a feature topic: script and removed removal/deprecation Removing or deprecating a feature labels Sep 5, 2017

doh

a08a2d8

annevk assigned sideshowbarker and zcorpan Sep 5, 2017

zcorpan mentioned this pull request Sep 7, 2017

Consider restricting the influence of <script charset> and <link charset> #3005

Open

sideshowbarker added a commit to validator/validator that referenced this pull request Sep 8, 2017

Add use counter for script[charset]

28454f5

See whatwg/html#3006

sideshowbarker mentioned this pull request Oct 3, 2017

Require utf-8 when specifying character encoding #3091

Merged

annevk closed this Oct 6, 2017

annevk deleted the annevk/script-charset-conformance branch October 6, 2017 10:16

zcorpan mentioned this pull request Dec 19, 2017

Consider restricting <form accept-charset> to utf-8 #3097

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make <script charset> non-conforming #3006

Make <script charset> non-conforming #3006

annevk commented Sep 5, 2017

domenic commented Sep 5, 2017

sideshowbarker commented Sep 6, 2017

annevk commented Sep 6, 2017

zcorpan commented Sep 7, 2017

zcorpan commented Sep 7, 2017

zcorpan commented Sep 7, 2017

zcorpan commented Sep 8, 2017

zcorpan commented Sep 8, 2017 •

edited

Loading

zcorpan commented Sep 8, 2017

zcorpan commented Sep 8, 2017

sideshowbarker commented Sep 9, 2017

annevk commented Sep 9, 2017

sideshowbarker commented Sep 9, 2017

zcorpan commented Sep 9, 2017

annevk commented Sep 10, 2017

annevk commented Sep 10, 2017

zcorpan commented Sep 11, 2017

zcorpan commented Sep 11, 2017

zcorpan commented Sep 11, 2017

domenic commented Sep 21, 2017

annevk commented Sep 21, 2017

sideshowbarker commented Oct 3, 2017

annevk commented Oct 3, 2017

domenic commented Oct 3, 2017

hsivonen commented Oct 4, 2017

annevk commented Oct 6, 2017

Make <script charset> non-conforming #3006

Make <script charset> non-conforming #3006

Conversation

annevk commented Sep 5, 2017

domenic commented Sep 5, 2017

sideshowbarker commented Sep 6, 2017

annevk commented Sep 6, 2017

zcorpan commented Sep 7, 2017

zcorpan commented Sep 7, 2017

zcorpan commented Sep 7, 2017

zcorpan commented Sep 8, 2017

zcorpan commented Sep 8, 2017 • edited Loading

zcorpan commented Sep 8, 2017

zcorpan commented Sep 8, 2017

sideshowbarker commented Sep 9, 2017

annevk commented Sep 9, 2017

sideshowbarker commented Sep 9, 2017

zcorpan commented Sep 9, 2017

annevk commented Sep 10, 2017

annevk commented Sep 10, 2017

zcorpan commented Sep 11, 2017

zcorpan commented Sep 11, 2017

zcorpan commented Sep 11, 2017

domenic commented Sep 21, 2017

annevk commented Sep 21, 2017

sideshowbarker commented Oct 3, 2017

annevk commented Oct 3, 2017

domenic commented Oct 3, 2017

hsivonen commented Oct 4, 2017

annevk commented Oct 6, 2017

zcorpan commented Sep 8, 2017 •

edited

Loading