Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make <script charset> non-conforming #3006

Closed
wants to merge 3 commits into from

Conversation

annevk
Copy link
Member

@annevk annevk commented Sep 5, 2017

Fixes #3004.

@domenic
Copy link
Member

domenic commented Sep 5, 2017

I support this generally but would like more input from people like @zcorpan and @sideshowbarker. Especially if @sideshowbarker were able to get us stats on how often this appears, if they were low, that would make me feel better.

It's also worth considering that by disallowing this you are disallowing people using <script charset="utf-8"> on non-utf-8 pages (e.g. pages left with the default windows-1252).

@domenic domenic added document conformance removal/deprecation Removing or deprecating a feature topic: script and removed removal/deprecation Removing or deprecating a feature labels Sep 5, 2017
@sideshowbarker
Copy link
Contributor

if @sideshowbarker were able to get us stats on how often this appears, if they were low, that would make me feel better

OK, I’ll add add a use counter for it to the HTML checker later today

@annevk
Copy link
Member Author

annevk commented Sep 6, 2017

@sideshowbarker you want to measure <link rel=stylesheet charset> too then. To see if it's justified one is conforming and the other is not. I suspect neither is justified.

@zcorpan
Copy link
Member

zcorpan commented Sep 7, 2017

link charset was defined as doing something (but not made conforming) for stylesheets in 29e0e33
https://www.w3.org/Bugs/Public/show_bug.cgi?id=23684

script charset was defined as doing something and made conforming in e74a98d
https://lists.w3.org/Archives/Public/public-whatwg-archive/2008May/0137.html

@zcorpan
Copy link
Member

zcorpan commented Sep 7, 2017

In the httparchive data

SELECT
  page,
  REGEXP_EXTRACT(match, r'<(link|script)') AS element,
  REGEXP_EXTRACT(match, r'charset\s*=\s*["\']?([a-z0-9_-]+)') AS charset
FROM (
  SELECT page, REGEXP_EXTRACT(LOWER(body), r'(<(?:link|script)(?:\s+[^>]+)?\scharset\s*=\s*["\']?[a-z0-9_-]+)') AS match
  FROM [httparchive:har.2017_08_15_chrome_requests_bodies]
  WHERE page = url
)
WHERE match != "null"

49497 matches, of which 45180 the element is script.

@zcorpan
Copy link
Member

zcorpan commented Sep 7, 2017

I notice now from looking at view-source:https://www.z-wave.me/ that the above query matches this

<link rel="shortcut icon" href="https://www.z-wave.me/fileadmin/template/img/favicon.ico" type="text/plain; charset=utf-16le" />

so, need to write a more complicated query that tries to tokenize attributes. (I've done that before so can copy from an earlier query.)

@zcorpan
Copy link
Member

zcorpan commented Sep 8, 2017

New query that skips over whole attributes and only counts matches where the domain in src is different from the page's domain (for #3005):

SELECT * FROM (
  SELECT
    REGEXP_EXTRACT(page, r'https://([^/]+)') AS pageDomain,
    REGEXP_EXTRACT(match, r'<(link|script)') AS element,
    REGEXP_EXTRACT(match, r'<(?:link|script)\s+(?:[a-z0-9:_-]+(?:\s*=\s*(?:"[^"]*"\s*|\'[^\']*\'\s*|[^>\s/"\']+\s+)|\s+))*(?:src|href)\s*=\s*["\']?(?:https?:)?//([^/:"\']+)') AS srcDomain,
    REGEXP_EXTRACT(match, r'<(?:link|script)\s+(?:[a-z0-9:_-]+(?:\s*=\s*(?:"[^"]*"\s*|\'[^\']*\'\s*|[^>\s/"\']+\s+)|\s+))*charset\s*=\s*["\']?([a-z0-9_-]+)') AS charset
  FROM (
    SELECT page, REGEXP_EXTRACT(LOWER(body), r'(<(?:link|script)\s+(?:[a-z0-9:_-]+(?:\s*=\s*(?:"[^"]*"\s*|\'[^\']*\'\s*|[^>\s/"\']+\s+)|\s+))*charset\s*=[^>]+>)') AS match
    FROM [httparchive:har.2017_08_15_chrome_requests_bodies]
    WHERE page = url
  )
  WHERE match != "null"
)
WHERE srcDomain != "null"
AND charset != "null"
AND pageDomain != srcDomain

30578 matches of which 29676 are script. https://gist.github.com/zcorpan/662946db12e58cb517240179e878c535

@zcorpan
Copy link
Member

zcorpan commented Sep 8, 2017

In an attempt to find the encoding of the page itself:

SELECT 
  REGEXP_EXTRACT(page, r'https://([^/]+)') AS pageDomain,
  REGEXP_EXTRACT(payload, r'"name":"Content-Type","value":"text/html; charset=([^"]+)') AS httpCharset
FROM [httparchive:har.2017_08_15_chrome_requests]
WHERE page = url

Download results from storage.googleapis.com/zcorpan/page_domain_http_charset.csv.gzip (3.4MB, temporary link)

SELECT
  REGEXP_EXTRACT(page, r'https://([^/]+)') AS pageDomain,
  REGEXP_EXTRACT(LOWER(body), r'<meta(?:\s+[^>]+)?\scharset\s*=\s*["\']?([a-z0-9_-]+)') AS metaCharset
FROM [httparchive:har.2017_08_15_chrome_requests_bodies]
WHERE page = url

Download results from storage.googleapis.com/zcorpan/page_domain_meta_charset.csv.gzip (3.28MB, temporary link)

If somebody else can take over analysis from this data, that would be much appreciated.

sideshowbarker added a commit to validator/validator that referenced this pull request Sep 8, 2017
@zcorpan
Copy link
Member

zcorpan commented Sep 8, 2017

30578 matches of which 29676 are script.

There are 455,868 pages in this dataset, so:

  • ~6.5% of pages have <script charset> with a cross-origin src
  • ~0.2% of pages have <link charset> with a cross-origin href.

@zcorpan
Copy link
Member

zcorpan commented Sep 8, 2017

3108 pages, or ~0.7%, have <script charset> or <link charset> with a value other than utf-8/utf8/unicode-1-1-utf-8.

@sideshowbarker
Copy link
Contributor

The HTML-checker use counter for this shows ~5.5% of pages have a script element with the charset attribute. I can add more specific use counters for it later if desired.

@annevk
Copy link
Member Author

annevk commented Sep 9, 2017

So now the question is whether it's worth being an error. I think it is as overriding the encoding from the document side means you have to do that everywhere and viewing the resource directly will fail, all of which seems suboptimal.

But I guess the argument against would be that 5.5% is high to add additional noise over to the checker results?

@sideshowbarker
Copy link
Contributor

But I guess the argument against would be that 5.5% is high to add additional noise over to the checker results?

That shouldn’t stop us if we otherwise agree the right thing to do is to have the spec make it a non-conforming.

For the sake of comparison, note that the checker stats show that at least 16% of the documents it’s being used to check contain a style element in the document body. And the checker emits an error message for all those, because in spite of that relatively high usage, we resolved that the best thing to do for developers/authors was for the spec to say the style in the body non-conforming.

@zcorpan
Copy link
Member

zcorpan commented Sep 9, 2017

So I think there are several aspects against making it non-conforming:

  • JS doesn't have a way to declare encoding in the file except for BOM, while CSS also has @charset.
  • It might not be possible for the person including the script element to change the encoding of the enclosing page or the external script (especially cross-origin scripts).
  • It seems more common for scripts to include non-ASCII things than CSS, since script might contain snippets of markup or text to be inserted.
  • link is used for more things than just stylesheets but charset only has an effect for CSS.
  • It is more common than link charset.

What is the benefit for making charset for script non-conforming, other than consistency with link?

@annevk
Copy link
Member Author

annevk commented Sep 10, 2017

Well for cross-origin scripts I'm not even sure we want to make this work (see other issue) since it's a minor security risk. By letting the script decode differently you might be able to extract data you wouldn't otherwise have access to.

And for same-origin my rationale is what I gave above.

@annevk
Copy link
Member Author

annevk commented Sep 10, 2017

@zcorpan also, if you only use utf-8 throughout as you should, the other reasons you gave go away (since you'd still inherit from the page).

@zcorpan
Copy link
Member

zcorpan commented Sep 11, 2017

I've now INNER JOINed the tables, added a column with pageGuessedCharset (ISNULL(pageHttpCharset, pageMetaCharset)) and updated https://gist.github.com/zcorpan/662946db12e58cb517240179e878c535

The first 3, as a preview:

pageDomain pageHttpCharset pageMetaCharset pageGuessedCharset element elementSrcDomain elementCharset
www.gs4u.net iso-8859-1 utf-8 iso-8859-1 script vk.com windows-1251
www.blyzka.by utf-8 utf-8 script cdn.sendpulse.com utf-8
www.itresearches.ir iso-8859-1 utf-8 iso-8859-1 script cdn.sendpulse.com utf-8

What I don't have here is the encoding of the script/stylesheet, if specified with HTTP or BOM or @charset (I didn't even store the full URL to the resource). But I can query this table to find interesting cases. I suppose pages where pageGuessedCharset is not utf-8, and elementCharset != pageGuessedCharset?

@zcorpan
Copy link
Member

zcorpan commented Sep 11, 2017

I suppose pages where pageGuessedCharset is not utf-8, and elementCharset != pageGuessedCharset?

SELECT * FROM [pelagic-breaker-785:test.link_script_all]
WHERE LOWER(pageGuessedCharset) != "utf-8"
AND LOWER(pageGuessedCharset) != elementCharset

26950 pages.

Of those, pages where elementCharset != "utf-8", gives 414 matches. https://gist.github.com/zcorpan/e6f2bf260e09ad319c73cb4a9e308f94

So there are 26536 pages where the page's encoding is not utf-8 but they include a cross-origin script with charset="utf-8". The error is really that the page's encoding is not utf-8; removing the charset attribute in response to a conformance checker error seems like it would be a bad effect.

@zcorpan
Copy link
Member

zcorpan commented Sep 11, 2017

As far as document conformance goes, how about making the only allowed value be "utf-8"? The effect of that is:

New conformance error:

  • (3108 - 414) = 2694 pages (~0.6%) use the same elementCharset as pageGuessedCharset, which is redundant; removing the attribute has no effect.
  • 414 pages (~0.1%) use different elementCharset from pageGuessedCharset, which will have to specify the script's encoding with HTTP header or switch to utf-8.

No change:

  • 26536 pages (5.8%) are not inconvenienced for using charset="utf-8" (but the conformance checker can give an error about the page's encoding not being utf-8).

@domenic
Copy link
Member

domenic commented Sep 21, 2017

Maybe we could do the following, in order:

  • Introduce a conformance error for charset=anything but UTF-8 on script
  • Introduce a conformance error for charset=anything but UTF-8 on meta
  • (Optional) introduce a conformance warning for charset=UTF-8 on script on pages that also specify it on meta, since doing so is redundant. (Similar to our warning for redundant type="" on style and script.)
  • (Speculative) Make it a conformance error to ever serve your document as non-UTF-8, including both meta and Content-Type, and also erroring on the implicit default encoding (so you have to do one of those two).

@annevk
Copy link
Member Author

annevk commented Sep 21, 2017

That seems reasonable, but then we should probably close this PR as that requires something quite different.

@sideshowbarker
Copy link
Contributor

Maybe we could do the following, in order:

  • Introduce a conformance error for charset=anything but UTF-8 on script
  • Introduce a conformance error for charset=anything but UTF-8 on meta

I’ve raised #3091 for those.

The Encoding spec already has language that clearly states the requirement:

Authors must use the UTF-8 encoding and must use the ASCII case-insensitive "utf-8" label to identify it.

…so #3091 just brings us into conformance with what the Encoding spec already requires.

@annevk
Copy link
Member Author

annevk commented Oct 3, 2017

If we don't go with this PR I should make sure I do move that comment and add the <hr> as I did here separately somehow.

@domenic
Copy link
Member

domenic commented Oct 3, 2017

Interesting, I didn't realize Encoding disallowed e.g. "utf8" (no hyphen). I wonder how many non-conformant documents I've written...

@hsivonen
Copy link
Member

hsivonen commented Oct 4, 2017

Interesting, I didn't realize Encoding disallowed e.g. "utf8" (no hyphen). I wonder how many non-conformant documents I've written...

FWIW, including the hyphen improves compatibility with Microsoft browsers. (Old ones at least, I'm too lazy to check the latest right now.)

@annevk
Copy link
Member Author

annevk commented Oct 6, 2017

This has been replaced by #3091 and #3100.

@annevk annevk closed this Oct 6, 2017
@annevk annevk deleted the annevk/script-charset-conformance branch October 6, 2017 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants