Mistune.util.escape_url is too aggressive #295
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds
';'
,'!'
, and'$'
to the set of characters which will be passed unmolested bymistune.util.escape_url
. These are all in RFC 3986’s reserved character list — that is to say: escaping these may change the meaning of a URL.Bugs Fixed
I noticed this when a
data:
URL was mangled by the escaping of a semicolon. The two URLs:data:text/plain;base64,d293
anddata:text/plain%3Bbase64,d293
are not equivalent (the first has content-typetext/plain
, the second has content-typetext/plain;base64
.)Further Discussion
There are three other characters in the RFC 3986 reserved character list which may be worth adding to the list of unescaped octets:
'['
,']'
,"'"
.Escaping square brackets breaks URLs with numeric ipv6 addresses in the netloc (e.g.
https://[::1]/
).However, I have not yet included these in this PR because doing so breaks the Backslash-escapes do not work inside autolinks test in
tests/fixtures/commonmark.txt
, and I'm not quite sure how sacrosanct those tests are to you.As for single quotes, I can not come up with an example of a URL scheme that relies on them, but they are in the list of reserved characters, and it would probably be safer not to escape them.