diff --git a/docs/2022-10-decision-on-cdata-nodes.md b/docs/2022-10-decision-on-cdata-nodes.md new file mode 100644 index 0000000..bd85795 --- /dev/null +++ b/docs/2022-10-decision-on-cdata-nodes.md @@ -0,0 +1,1223 @@ +# Decision Summary + +When encountering CDATA nodes in an HTML4 document that might be treated as PCDATA by HTML5 parsers, we decided to escape the `<`, `>`, and `&` characters in that CDATA node to prevent XSS attacks, accepting that those characters will escaped even in the valid HTML4 CDATA context of a `style` tag, because that situation seems to be a rare use case that impacts only browser style, and because once Loofah defaults to HTML5 parsing (potentially in the very near future) this drawback goes away. + + + + + +- [Recommendations](#recommendations) + * [Short term](#short-term) + * [Long term](#long-term) +- [What's Happening in these CVEs](#whats-happening-in-these-cves) + * [The underlying behavior](#the-underlying-behavior) + * [Side note: CDATA vs PCDATA](#side-note-cdata-vs-pcdata) + * [2015-08-08: rails-html-sanitizer CVE-2015-7580](#2015-08-08-rails-html-sanitizer-cve-2015-7580) + * [2017-10-10: loofah vulnerability](#2017-10-10-loofah-vulnerability) + * [2022-04-05: `select` and `style` parents, `script` payload:](#2022-04-05-select-and-style-parents-script-payload) + * [Today: the foreign context vulnerability](#today-the-foreign-context-vulnerability) +- [Solutions](#solutions) + * [Potential solution 1](#potential-solution-1) + * [Potential solution 2](#potential-solution-2) + + + +# Recommendations + +## Short term + +I recommend adopting [Solution 2](#potential-solution-2) below: + +- revert the patch in rails-html-sanitizer v1.4.3 +- apply `CGI.escapeHTML` on all CDATA nodes created by Nokogiri's HTML4 parser. + +This addresses all known rails-html-sanitizer vulnerabilities related to HTML4/HTML5 parser behavior mismatches, does not introduce any additional backwards-incompatible changes in behavior, is simple enough to easily reason about, and does not suffer from the performance or stack depth problems present in [Solution 1](#potential-solution-1). + + +(Note that both solutions considered in this doc share one small backwards-incompatibility, which is "special characters in valid `style` tags will be entity-escaped". Unfortunately this is hard to prevent without making this patch significantly more complex, and once Loofah defaults to an HTML5 parser this behavior goes away. I am recommending we move forward with this solution despite this inconvenience.) + +Action items: + +- implement this logic in Loofah and release v2.19.1 +- call into Loofah from rails-html-sanitizer, bump the dependency, and cut a v1.4.x release + + +## Long term + +### Use HTML5 sanitization + +It's important to note that if we use an HTML5 parser for the sanitization pass, this entire class of problem goes away. + +We should move RHS to an HTML5 parser as soon as possible. + +- [Loofah using HTML5 by default](https://github.com/flavorjones/loofah/pull/239) is blocked on a Nokogiri v1.14.0 release +- [RHS behavior changes are documented here](https://github.com/rails/rails-html-sanitizer/pull/133) in the updated tests + + +### Avoid functional drift between Loofah and RHS + +It's important to note that Loofah and RHS initially solved the same problem in two different ways. I consider this to be a huge missed opportunity. + +We should adopt a policy of building features or hooks into Loofah to enable RHS to be as small as possible with only Rails-specific modifications to scrubbers. + +- https://github.com/rails/rails-html-sanitizer/pull/136 examines the behavior differences between the two implementations +- `data-` attributes are the big diff, https://hackerone.com/reports/42728 was the motivating issue for that change + + +# What's Happening in these CVEs + +There are a few related CVE and bugs that are all related to the same underlying behavior: + +- [#81212 Potential XSS on sanitize/Rails::Html::WhiteListSanitizer](https://hackerone.com/reports/81212) +- [Nested Scripts · Issue #127 · flavorjones/loofah](https://github.com/flavorjones/loofah/issues/127) +- [#1530898 Rails::Html::SafeListSanitizer vulnerable to xss attack in an environment that allows the style tag](https://hackerone.com/bugs?report_id=1530898&subject=rails) +- [#1654310 Incomplete fix for CVE-2022-32209 (XSS in Rails::Html::Sanitizer under certain configurations)](https://hackerone.com/reports/1654310) +- [#1656627 Rails::Html::SafeListSanitizer vulnerable to XSS when certain tags are allowed (math+style || svg+style)](https://hackerone.com/reports/1656627) + +Most of these are analyzed in detail below. + +You may wish to refer to the [WHATWG HTML5 Standard](https://html.spec.whatwg.org/multipage/parsing.html) as an aid to follow the detailed explanations below. + +## The underlying behavior + +First, let's build a mental model, using only Nokogiri's HTML4 and HTML5 parsers, for the underlying mechanism that's causing problems. + +Let's start with this input string: + +``` html + +``` + +Nokogiri uses the libxml2 HTML4 parser to generate this HTML4 DOM: + +``` text +#(Element:0x564 { + name = "body", + children = [ + #(Element:0x578 { + name = "select", + children = [ + #(Element:0x58c { + name = "style", + children = [ #(CDATA "")] + })] + }), + #(Text "\n")] +``` + +**⚠ Note that in the above DOM structure, the `style` tag's child is CDATA.** + +This HTML4 DOM serializes as: + +``` html + + + + +``` + +**⚠ Note that the CDATA payload of `style` is presented literally, without any entity escaping.** + +Feeding the serialized HTML4 document into an HTML5 parser builds the following HTML5 DOM: + +``` text +#(Element:0x5f0 { + name = "body", + children = [ + #(Text "\n"), + #(Element:0x604 { + name = "select", + children = [ + #(Element:0x618 { + name = "script", + children = [ #(Text "alert(1);")] + })] + }), + #(Text "\n" + "\n")] + })] +``` + +**⚠ Note that the `style` tag has been removed because the HTML5 parser considers it to be an invalid child of `select`!** The internal HTML5 parser states are described in detail later in this document. + +This DOM is equivalent to: + +``` html + + + + +``` + +**⚠ Browsers will execute this javascript!** + +TL;DR: The underlying issue is the combination of two behaviors: + +1. libxml2's HTML4 parser serializes CDATA nodes without entity-escaping; +2. the user agent's HTML5 parser can be made to parse that HTML4 CDATA payload as HTML5 PCDATA. + +libxml2's HTML4 parser creates CDATA nodes **only** for `script` and `style` tag contents. As a result, this class of sanitization vulnerability appears in situations where input contains nested `script` tags or where `style` tags are permitted but might be invalid in HTML5. + + +## Side note: CDATA vs PCDATA + +CDATA is a string of bytes that HTML4 parsers will parse as **text and only text**; any embedded characters like `<` or `>` that might be invalid in HTML4 text nodes are valid here! Think of it as a **string literal**. + +CDATA is not a thing in HTML5, which instead specifies a series of tokenizer and parser states and transitions -- for example, [tokenizing a `script` tag](https://html.spec.whatwg.org/multipage/parsing.html#script-data-state) or [tokenizing a `style` tag](https://html.spec.whatwg.org/multipage/parsing.html#rawtext-state). + +PCDATA is a string of bytes that HTML4 parsers will interpret as a DOM. This data is _structural_, so characters like `<` and `>` are likely to be interpreted as parts of HTML tags. + +There is no such thing as PCDATA in the HTML5 spec (reiterating, HTML5 is specified using tokenizer and parser states and transitions), but it's useful shorthand to mean _bytes that will determine the structure of the document_ and so I hope you'll forgive my slight misuse of the term in this document. + + +## 2015-08-08: rails-html-sanitizer CVE-2015-7580 + +(Reported on 2015-08-08 in https://hackerone.com/reports/81212 (and separately in https://hackerone.com/reports/89914) and announced in https://nvd.nist.gov/vuln/detail/CVE-2015-7580 and https://github.com/advisories/GHSA-ghqm-pgxj-37gq.) + +This vulnerability describes nested script tags going unsanitized: + +``` ruby +#! /usr/bin/env ruby + +require "bundler/inline" + +gemfile do + source "https://rubygems.org" + gem "rails-html-sanitizer", "=1.0.2" + gem "loofah", "=2.0.2" + gem "nokogiri", "=1.6.6.2" +end + +require "rails-html-sanitizer" + +def sanitize(input, tags) + Rails::Html::WhiteListSanitizer.new.sanitize(input, tags: tags) +end + +input = "
alert(1);/
" +tags = %w(div) +sanitize(input, tags) # => "
\n\n
" +``` + +This was fixed with https://github.com/rails/rails-html-sanitizer/commit/63903b0eaa6d2a4e1c91bc86008256c4c8335e78 + +``` patch +diff --git a/lib/rails/html/scrubbers.rb b/lib/rails/html/scrubbers.rb +index d6f8ce4..1e6f887 100644 +--- a/lib/rails/html/scrubbers.rb ++++ b/lib/rails/html/scrubbers.rb +@@ -60,6 +60,11 @@ def attributes=(attributes) + end + + def scrub(node) ++ if node.cdata? ++ text = node.document.create_text_node node.text ++ node.replace text ++ return CONTINUE ++ end + return CONTINUE if skip_node?(node) + + unless keep_node?(node) +@@ -76,7 +81,7 @@ def allowed_node?(node) + end + + def skip_node?(node) +- node.text? || node.cdata? ++ node.text? + end + + def scrub_attribute?(name) +``` + +which changed the behavior to: + +``` ruby +#! /usr/bin/env ruby + +require "bundler/inline" + +gemfile do + source "https://rubygems.org" + gem "rails-html-sanitizer", "=1.0.3" # contains the fix + gem "loofah", "=2.0.2" + gem "nokogiri", "=1.6.6.2" +end + +require "rails-html-sanitizer" + +def sanitize(input, tags) + Rails::Html::WhiteListSanitizer.new.sanitize(input, tags: tags) +end + +input = "
alert(1);/
" +tags = %w(div) +sanitize(input, tags) # => "
<script>alert(1);</script>
" +``` + +Here's how this patch works: + +- the `script` tag's CDATA contents are converted into a Text node +- the parent `script` start and end tags are removed by the sanitizer +- the Text node left behind is then made a child of `div` +- as a child of `div`, the Text node will be entity-escaped +- which means the HTML5 parser receives `<script>...` instead of `" +sanitize(input) # => "
" +sanitize(input) # => "
" +``` + +One drawback to this recursive approach is that it's possible to trigger a "stack level too deep" exception in Loofah with sufficient nesting of script tags (see [Uncontrolled Recursion in Loofah · Advisory · flavorjones/loofah](https://github.com/flavorjones/loofah/security/advisories/GHSA-3x8r-x6xp-q4vm)). + + +## 2022-04-05: `select` and `style` parents, `script` payload: + +Looking at CVE-2022-32209 ([#1530898 Rails::Html::SafeListSanitizer vulnerable to xss attack in an environment that allows the style tag](https://hackerone.com/bugs?report_id=1530898&subject=rails)) + +``` ruby +#! /usr/bin/env ruby + +require "bundler/inline" + +gemfile do + source "https://rubygems.org" + gem "rails-html-sanitizer", "=1.4.2" +end + +require "rails-html-sanitizer" + +def sanitize(input, tags) + Rails::Html::WhiteListSanitizer.new.sanitize(input, tags: tags) +end + +input, tags = "", %w(select style) +sanitize(input, tags) # => "" +``` + +In this case, the HTML4 doc's CDATA payload: + +``` text +#(Element:0x5f0 { + name = "body", + children = [ + #(Element:0x604 { + name = "select", + children = [ #(Element:0x618 { name = "style", children = [ #(CDATA "")] })] + }), + #(Text "\n")] + })] +``` + +is serialized as: + +``` html + +``` + +The states that the HTML5 parser goes through as it parses this are roughly: + +- ... +- insertion mode: ["in body"](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody) + - see "select" + - insert `select` tag into `body` + - move to ["in select" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inselect) +- insertion mode: "in select" + - see "style" + - **ignore tag and drop it** + - see "script" + - **insert `script` tag into `select`** + - move to ["text" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-incdata) +- insertion mode: "text" + - see text tokens + - insert them into `script` +- ... + + +resulting in this HTML5 DOM: + +``` text +#(Element:0x67c { + name = "body", + children = [ + #(Text "\n"), + #(Element:0x690 { + name = "select", + children = [ #(Element:0x6a4 { name = "script", children = [ #(Text "alert(1)")] })] + }), + #(Text "\n" + "\n")] + })] +``` + +which is equivalent to + +``` html + + + + +``` + +**⚠ Browsers will execute this javascript!** + +With [the fix in v1.4.3](https://github.com/rails/rails-html-sanitizer/commit/45a5c10) the result is sanitized: + +``` ruby +#! /usr/bin/env ruby + +require "bundler/inline" + +gemfile do + source "https://rubygems.org" + gem "rails-html-sanitizer", "=1.4.3" # contains the select/style fix +end + +require "rails-html-sanitizer" + +def sanitize(input, tags) + Rails::Html::WhiteListSanitizer.new.sanitize(input, tags: tags) +end + +input, tags = "", %w(select style) +sanitize(input, tags) # => "" +# >> WARNING: Rails::Html::SafeListSanitizer: removing 'style' from safelist, should not be combined with 'select' +``` + +It's worth noting that this fix relies on the Text node fix for CVE-2015-7580 by reparenting the `style` CDATA payload as a Text child of `select`, so that the entities are properly escaped. + + +## Today: the foreign context vulnerability + +Looking at [#1656627 Rails::Html::SafeListSanitizer vulnerable to XSS when certain tags are allowed (math+style || svg+style)](https://hackerone.com/reports/1656627) which describes a similar attack vector to CVE-2022-32209: + +``` ruby +#! /usr/bin/env ruby + +require "bundler/inline" + +gemfile do + source "https://rubygems.org" + gem "rails-html-sanitizer", "=1.4.3" # contains the select/style fix +end + +require "rails-html-sanitizer" + +def sanitize(input, tags) + Rails::Html::WhiteListSanitizer.new.sanitize(input, tags: tags) +end + +input, tags = "", %w(svg style) +sanitize(input, tags) # => "" + +input, tags = "", %w(math style img) +sanitize(input, tags) # => "" +``` + +In both of these cases, the HTML5 parser is put into a "foreign context" parsing mode that causes it to parse CDATA as PCDATA. + + +### the `svg` case + +In the `svg` case, the HTML4 DOM is: + +``` text +#(Element:0x5f0 { + name = "body", + children = [ + #(Element:0x604 { + name = "svg", + children = [ #(Element:0x618 { name = "style", children = [ #(CDATA "")] })] + }), + #(Text "\n")] + })] +``` + +serialized as: + +``` html + +``` + +The states that the HTML5 parser goes through as it parses this are roughly: + +- ... +- insertion mode: "in body" + - see "svg" + - insert `svg` tag into `body` + - move to ["in foreign content" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inforeign) +- insertion mode: "in foreign content" + - see "style" + - insert `style` tag into `svg` + - see "script" + - insert `script` tag into `style` + - see text tokens + - insert them into `script` +- ... + +resulting in this HTML5 DOM: + +``` text +#(Element:0x67c { + name = "body", + children = [ + #(Text "\n"), + #(Element:0x690 { + name = "svg", + namespace = #(Namespace:0x6a4 { prefix = "svg", href = "http://www.w3.org/2000/svg" }), + children = [ + #(Element:0x6b8 { + name = "style", + namespace = #(Namespace:0x6a4 { prefix = "svg", href = "http://www.w3.org/2000/svg" }), + children = [ + #(Element:0x6cc { + name = "script", + namespace = #(Namespace:0x6a4 { prefix = "svg", href = "http://www.w3.org/2000/svg" }), + children = [ #(Text "alert(1)")] + })] + })] + }), + #(Text "\n" + "\n")] + })] +``` + +which is equivalent to: + +``` html + +``` + +**⚠ Browsers will execute this javascript!** + +Note that this behavior is also demonstrated with a `math` parent tag, but is not a vulnerability as browsers will not execute the contents of `script` tags in a MathML context. (See [SVG11](https://www.w3.org/TR/SVG11/script.html#ScriptElement) for documentation on SVG support for the `script` element.) + + +### the `math` case + +For the `math` case, the HTML4 DOM is: + +``` text +#(Element:0x5f0 { + name = "body", + children = [ + #(Element:0x604 { + name = "math", + children = [ + #(Element:0x618 { name = "style", children = [ #(CDATA "")] })] + }), + #(Text "\n")] + })] +``` + +serialized as: + +``` html + +``` + +The states that the HTML5 parser goes through as it parses this are roughly: + +- ... +- insertion mode: "in body" + - see "math" + - insert `math` tag into `body` + - move to "in foreign content" insertion mode +- insertion mode: "in foreign content" + - see "style" + - insert tag into `math` + - see "img" + - **consider it a parse error** + - pop `style` node (close it) + - pop `math` node (close it) + - **insert `img` tag into `body`** + - move to "in body" insertion mode +- insertion mode: "in body" + - see `` + - consider it a parse error + - ignore it + - see `` + - consider it a parse error + - ignore it +- ... + +resulting in this HTML5 DOM: + +``` text +#(Element:0x67c { + name = "body", + children = [ + #(Text "\n"), + #(Element:0x690 { + name = "math", + namespace = #(Namespace:0x6a4 { prefix = "math", href = "http://www.w3.org/1998/Math/MathML" }), + children = [ + #(Element:0x6b8 { + name = "style", + namespace = #(Namespace:0x6a4 { prefix = "math", href = "http://www.w3.org/1998/Math/MathML" }) + })] + }), + #(Element:0x6cc { + name = "img", + attributes = [ + #(Attr:0x6e0 { name = "src", value = "x" }), + #(Attr:0x6f4 { name = "onerror", value = "alert(1)" })] + }), + #(Text "\n" + "\n")] + })] +``` + +which is equivalent to: + +``` html + +``` + +**⚠ Browsers will execute this javascript!** + + +Here we see that the `img` tag data which was originally contained within HTML4 CDATA context (and avoided sanitization of its attributes) has been lifted out of that context by the HTML5 parser, and parsed as a sibling to the `math` element. + +Note that this behavior (and the vulnerability) is also present if we replace `...` with `...`. + + +# Solutions + +## Potential solution 1 + +Rollback [the RHS fix applied in c871aa4 / 45a5c10](https://github.com/rails/rails-html-sanitizer/commit/45a5c10) and adopt Loofah's strategy (noted described above) of recursively sanitizing any HTML4 CDATA nodes. + +### The patch + +The code changes, after the revert, for this behavior would be: + +``` patch +diff --git a/lib/rails/html/scrubbers.rb b/lib/rails/html/scrubbers.rb +index 09cfe95..7f29c12 100644 +--- a/lib/rails/html/scrubbers.rb ++++ b/lib/rails/html/scrubbers.rb +@@ -61,9 +61,10 @@ def attributes=(attributes) + end + + def scrub(node) +- if node.cdata? +- text = node.document.create_text_node node.text +- node.replace text ++ if needs_further_escaping(node) ++ safe_text = Loofah.fragment(node.text).scrub!(self).to_html ++ safe_node = node.document.create_text_node(safe_text) ++ node.replace safe_node + return CONTINUE + end + return CONTINUE if skip_node?(node) +@@ -77,6 +78,11 @@ def scrub(node) + + protected + ++ def needs_further_escaping(node) ++ # Nokogiri's HTML4 parser on JRuby doesn't flag the child of a `style` tag as cdata, but it is. ++ node.cdata? || (Nokogiri.jruby? && node.text? && node.parent.name == "style") ++ end ++ + def allowed_node?(node) + @tags.include?(node.name) + end +diff --git a/test/sanitizer_test.rb b/test/sanitizer_test.rb +index 5bf188e..69eee45 100644 +--- a/test/sanitizer_test.rb ++++ b/test/sanitizer_test.rb +@@ -581,6 +581,18 @@ def test_exclude_node_type_comment + assert_equal("
text
text", safe_list_sanitize("
text
text")) + end + ++ [ ++ ["", ["select", "style"], "script"], ++ ["", ["svg", "style"], "script"], ++ ["", ["math", "style", "img"], "onerror"], ++ ["", ["svg", "style", "img"], "onerror"], ++ ].each do |input, tags, should_not_include| ++ define_method "test_disallow_the_dangerous_safelist_combination_of_#{tags.join("_")}" do ++ sanitized = safe_list_sanitize(input, tags: tags) ++ refute_includes(sanitized, should_not_include) ++ end ++ end ++ +``` + +### The resulting behavior + +The resulting behavior for each of our attack inputs would be: + +``` ruby +#! /usr/bin/env ruby + +require "bundler/inline" + +gemfile do + source "https://rubygems.org" + gem "rails-html-sanitizer", path: "." +end + +require "rails-html-sanitizer" + +def sanitize(input, tags) + Rails::Html::WhiteListSanitizer.new.sanitize(input, tags: tags) +end + +input, tags = "", %w(select style) +sanitize(input, tags) # => "" + +input, tags = "", %w(style) +sanitize(input, tags) # => "" + +input, tags = "", %w(select style) +sanitize(input, tags) # => "" + +input, tags = "", %w(select style) +sanitize(input, tags) # => "" + +input, tags = "", %w(svg style) +sanitize(input, tags) # => "" + +input, tags = "", %w(math style) +sanitize(input, tags) # => "" + +input, tags = "", %w(math style img) +sanitize(input, tags) # => "" +``` + +For the safe usage case, the output would revert the behavior change introduced in v1.4.3: + +``` +input: "" +v1.4.2: "" +v1.4.3: "" +proposed: "" +``` + +For `select` and `style` parents, the `script` tag has been removed by the act of recursively sanitizing. This is different both from the 1.4.2 (unsafe) and 1.4.3 (safe) behavior, but is still safe: + +``` +input: "" +v1.4.2: "" +v1.4.3: "" +proposed: "" +``` + +For the "foreign element" `script` case, the `script` tag has again been removed by recursively sanitizing, and r-h-s is finally safe from this attack. + +``` +input: "" +v1.4.2: "" +v1.4.3: "" +proposed: "" +``` + +For the "foreign element" `img` case, the recursive scrubbing results in: + +- the `img` tag being removed in the case where `img` is **disallowed**, +- the `onerror` attribute being removed in the case where the `img` tag is **allowed**. + +... Disallowing the `img` tag: + +``` +input: "" +v1.4.2: "" +v1.4.3: "" +proposed: "" +``` + +... Allowing the `img` tag: + +``` +input: "" +v1.4.2: "" +v1.4.3: "" +proposed: "" +``` + + +### Backwards incompatibilities + +For `style` tags with special characters **there will be a backwards-incompatible change**: + +``` text +input: "" +v1.4.2: "" +v1.4.3: "" +proposed: "" +``` + +However, once Loofah upgrades to HTML5 (hopefully very soon), this goes away and the behavior from `<= 1.4.3` is restored. I don't think this is a reason not to choose this option. + + +### Other failing tests + +The behavior difference, more generally, is introducing entity escaping on nested script tags, and I think this is acceptable. + +The test changes necessary for this patch: + +``` patch +diff --git a/test/sanitizer_test.rb b/test/sanitizer_test.rb +index d19063c..69eee45 100644 +--- a/test/sanitizer_test.rb ++++ b/test/sanitizer_test.rb +@@ -14,11 +14,11 @@ def test_sanitizer_sanitize_raises_not_implemented_error + end + + def test_sanitize_nested_script +- assert_equal '<script>alert("XSS");</script>', safe_list_sanitize('alert("XSS");/', tags: %w(em)) ++ assert_equal 'alert("XSS");&lt;/script&gt;', safe_list_sanitize('alert("XSS");/', tags: %w(em)) + end + + def test_sanitize_nested_script_in_style +- assert_equal '<script>alert("XSS");</script>', safe_list_sanitize('alert("XSS");/', tags: %w(em)) ++ assert_equal 'alert("XSS");&lt;/script&gt;', safe_list_sanitize('alert("XSS");/', tags: %w(em)) + end + + class XpathRemovalTestSanitizer < Rails::Html::Sanitizer +@@ -366,7 +366,7 @@ def test_should_sanitize_invalid_script_tag + end + + def test_should_sanitize_script_tag_with_multiple_open_brackets +- assert_sanitized %(<), "<alert(\"XSS\");//<" ++ assert_sanitized %(<), "<alert(\"XSS\");//&lt;" + assert_sanitized %(