URLs: It's Complicated

mananaysiempre · on June 23, 2021

Just to share a little more of the weirdness (discovered while reading a couple of the historical URL & URI RFCs several days ago):

Per the original spec, in FTP URLs,

- ftp://example.net/foo/bar will get you bar inside the foo directory inside the default directory of the FTP server at example.net (i.e. CWD foo, RETR bar);

- ftp://example.net//foo/bar will get you bar inside the foo directory inside the empty string directory inside the default directory of the FTP server at example.net (i.e. CWD, CWD foo, RETR bar; what do FTP servers even do with this?);

- and it’s ftp://example.net/%2Ffoo/bar that you must use if you want bar inside the foo directory inside the root directory of the FTP server at example.net (i.e. CWD /foo, RETR bar; %2F being the result of percent-encoding a slash character).

jfrunyon · on June 23, 2021

> what do FTP servers even do with this?

Pretty sure CWD by itself isn't even valid (at least RFC959 assumes it has an argument), and therefore // isn't valid in FTP URLs.

The %2Ffoo/bar is needed because of the fact that FTP CWD and RETR paths are system dependent (with, theoretically, system dependent path separators), but URLs are not, so the FTP client breaks the URL on / and sequentially executes CWD down the tree so that it doesn't need to know what it's connected to.

In other words: URL paths are not system paths, and it's a mistake to think of them as such.

(Alternate in other words: FTP is awful)

mananaysiempre · on June 23, 2021

> Pretty sure CWD by itself isn’t even valid [...] and therefore // isn’t valid in FTP URLs.

So, I looked it up carefully and it appears that (despite the promises in later RFCs such as 2396 and 3986) the current specification of the ftp scheme is still the ancient RFC 1738 which predates not only the URL / URI distinction but even the notion of relative URLs. In §3.2.2 <https://tools.ietf.org/html/rfc1738#section-3.2.2> it specifically says that a null segment in the path should result in a “CWD ” command (i.e. CWD, space, null string argument) being sent to the FTP server, going against both the current RFC 959 and its predecessor 765 (apparently the earliest formal specification of FTP to include CWD) which require the argument to CWD to be non-null.

Thus apparently a conformant implementation of the ftp URL scheme cannot be a conformant implementation of an FTP client. Joy.

It still seems unlikely that Berners-Lee et al. would specifically call this case out if it were useless at the time... What were the servers that made this necessary, I wonder?

> FTP CWD and RETR paths are system dependent (with, theoretically, system dependent path separators), but URLs are not

Thank you, that’s the insight that I was missing. So a %2F inside an ftp URL component is just performing a (sanctioned) injection of the (supposedly UNIXy) server path syntax.

> FTP is awful

I’d go with “unbelievably ancient, with the attendant problems”, but yes. Funny how it still manages to be better than everything else (that I know) at transferring files by not multiplexing control and data onto the same TCP connection. (I think HTTP over QUIC can do this as well?)

a1369209993 · on June 23, 2021

> i.e. CWD, CWD foo, RETR bar; what do FTP servers even do with this?

If you go by shell sematics, that pokes around the home directory of the user running the FTP daemon; hopefully that doesn't actually work.

> it's ftp://example.net/%2Ffoo/bar that you must use if you want bar inside the foo directory inside the root directory

This smells like a security vulnerability for most setups.

mananaysiempre · on June 23, 2021

> This smells like a security vulnerability for most setups.

Yes, but if you look around on some old FTP servers (like on the few still-extant mirror networks) you’ll find that some do actually let you CWD to the system /, and sometimes they even drop you there by default (so you have to CWD pub or whatever to get at the things you actually want).

jfrunyon · on June 23, 2021

> If you go by shell sematics, that pokes around the home directory of the user running the FTP daemon; hopefully that doesn't actually work.

This is why FTP servers have default directories. They're the equivalent of user home directories. By the way, many FTP servers (especially historically) map FTP logins to real, local users.

> This smells like a security vulnerability for most setups.

How do you figure? Surely your sensitive files aren't world-readable... /s

cookiengineer · on June 24, 2021

I wanted to mention that in practice, most FTP server implementations are not unicode compatible and are very likely vulnerable to effective-power-like abuses of RTL/LTR switching characters as well.

Let alone that probably all server implementations on Windows seem to have been a fork of BSD's original ftpd at some point, which had an RCE vulnerability when the password exceeded the limited bytelength of 256 bytes iirc.

Even software like ProFTPd where vulnerable over 30 years later.

Just writing this to make a point to stay the fuck away from FTP, because software is heavily outdated in that space and never updated to fix issues. Use ssh/sftp, always.

mananaysiempre · on June 24, 2021

You know, in a fantasy world where standards of comparable complexity have equally good implementations I would much rather use Telnet and FTP over TLS (1.3) than SSH and SFTP. For all that they show their age they just seem to me to be cleaner designs.

I will have to concede, though, that FTP servers in the real world are surprisingly awful. Even the supposedly easy task of spinning up an anonymous read-only FTP server to serve the current directory for five minutes, all permissions and security be damned, is annoyingly non-trivial.

(Unrelated to that awfulness, does anyone know how to get active FTP to pass through SLIRP networking on Qemu?)

cookiengineer · on June 25, 2021

I totally agree with you in regards of complexity. The main issue behind a server's level of security is probably more related to using a memory safe language than we care to admit.

I have the feeling that way too many libraries and implementations written in C use a linter or any kind of mechanism to catch the obvious type errors.

Everyone loves typed languages, but nobody uses their obvious advantages in regards to security. Kinda ironic when you see a -Wall all over the place.

1vuio0pswjnm7 · on June 23, 2021

From source code of preferred ftp/http client, maybe this is helpful. Also suggest reading source code for djb's ftp server.

   Parse URL of form (per RFC 3986):
       <type>://[<user>[:<password>]@]<host>[:<port>][/<path>]
 
   XXX: this is not totally RFC 3986 compliant; <path> will have the
   leading `/' unless it's an ftp:// URL, as this makes things easier
   for file:// and http:// URLs.  ftp:// URLs have the `/' between the
   host and the URL-path removed, but any additional leading slashes
   in the URL-path are retained (because they imply that we should
   later do "CWD" with a null argument).
 
   Examples:
        input URL                       output path
        ---------                       -----------
       "http://host"                   "/"
       "http://host/"                  "/"
       "http://host/path"              "/path"
       "file://host/dir/file"          "dir/file"
       "ftp://host"                    ""
       "ftp://host/"                   ""
       "ftp://host//"                  "/"
       "ftp://host/dir/file"           "dir/file"
       "ftp://host//dir/file"          "/dir/file"
 
    If we are dealing with a classic `[user@]host:[path]'
    (urltype is CLASSIC_URL_T) then we have a raw directory
    name (not encoded in any way) and we can change
    directories in one step.
   
    If we are dealing with an `ftp://host/path' URL
    (urltype is FTP_URL_T), then RFC 3986 says we need to
    send a separate CWD command for each unescaped "/"
    in the path, and we have to interpret %hex escaping
    *after* we find the slashes.  It's possible to get
    empty components here, (from multiple adjacent
    slashes in the path) and RFC 3986 says that we should
    still do `CWD ' (with a null argument) in such cases.
   
    Many ftp servers don't support `CWD ', so if there's an
    error performing that command, bail out with a descriptive
    message.
   
    Examples:
                 
    host:                                dir="", urltype=CLASSIC_URL_T
                 logged in (to default directory)
    host:file                            dir=NULL, urltype=CLASSIC_URL_T
                 "RETR file"
    host:dir/                            dir="dir", urltype=CLASSIC_URL_T
                 "CWD dir", logged in
    ftp://host/                          dir="", urltype=FTP_URL_T
                 logged in (to default directory)
    ftp://host/dir/                      dir="dir", urltype=FTP_URL_T
                 "CWD dir", logged in
    ftp://host/file                      dir=NULL, urltype=FTP_URL_T
                 "RETR file"
    ftp://host//file                     dir="", urltype=FTP_URL_T
                 "CWD ", "RETR file"
    host:/file                           dir="/", urltype=CLASSIC_URL_T
                 "CWD /", "RETR file"
    ftp://host///file                    dir="/", urltype=FTP_URL_T
                 "CWD ", "CWD ", "RETR file"
    ftp://host/%2F/file                  dir="%2F", urltype=FTP_URL_T
                 "CWD /", "RETR file"
    ftp://host/foo/file                  dir="foo", urltype=FTP_URL_T
                 "CWD foo", "RETR file"
    ftp://host/foo/bar/file              dir="foo/bar"
                 "CWD foo", "CWD bar", "RETR file"
    ftp://host//foo/bar/file             dir="/foo/bar"
                 "CWD ", "CWD foo", "CWD bar", "RETR file"
    ftp://host/foo//bar/file             dir="foo//bar"
                 "CWD foo", "CWD ", "CWD bar", "RETR file"
    ftp://host/%2F/foo/bar/file          dir="%2F/foo/bar"
                 "CWD /", "CWD foo", "CWD bar", "RETR file"
    ftp://host/%2Ffoo/bar/file           dir="%2Ffoo/bar"
                 "CWD /foo", "CWD bar", "RETR file"
    ftp://host/%2Ffoo%2Fbar/file         dir="%2Ffoo%2Fbar"
                 "CWD /foo/bar", "RETR file"
    ftp://host/%2Ffoo%2Fbar%2Ffile       dir=NULL
                 "RETR /foo/bar/file"
   
    Note that we don't need `dir' after this point.
   
    The `CWD ' command (without a directory), which is required by   
    RFC 3986 to support the empty directory in the URL pathname (`//'),   
    conflicts with the server's conformance to RFC 959.

micuentita · on July 4, 2021

leifg · on June 23, 2021

It seems like the colon is too ambiguous (is used as a protocol delimiter, delimiter for user/pass, delimiter for port).

Reminds a little bit of Java labels where you can do this:

  public class Labels {
    public static void main(String args[]){
        https://hn.ycombinator.com
        for(int i=0; i<10; i++){
            System.out.println("......."+i );
        }  
    }  
  }

the https: is a label named https and everything after the colon is a comment so this is valid code.

jackewiehose · on June 23, 2021

> It seems like the colon is too ambiguous (is used as a protocol delimiter, delimiter for user/pass, delimiter for port).

and because that was still too boring they came up with ipv6

mananaysiempre · on June 23, 2021

IIUC the IPv6 weirdness here is simply due to very unfortunate timing: IPv6 was being finalized at a time (first half of the 90s) when the Web (and with it URLs) was already nearly frozen but still not obviously important.

chungy · on June 24, 2021

The colons also make IPv6 addresses unambiguous to IPv4 notation.

gpvos · on June 23, 2021

Not in URLs, but related:

    Larry's 1st Law of Language Redesign: Everyone wants the colon
    Larry's 2nd Law of Language Redesign: Larry gets the colon

https://thelackthereof.org/Perl6_Colons

zepearl · on June 23, 2021

All extremely useful: the overview, the examples and the comments.

A few months ago while writing a bot/crawler I searched for hours for something like this, but I found only full specs or just bits and pieces scattered around that used different terminology and/or had different opinions.

In the end I didn't even clearly understand what should be the max total URL length (e.g. mixed opinions here https://stackoverflow.com/questions/417142/what-is-the-maxim... - come on, a xGiB long URL?) => most of the time 2000 bytes is mentioned but it's not 100% clear.

Writing a bot made me understand 1) why browsers are so complicated and 2) that the Internet is a mess (e.g. once I even found a page that used multiple character encodings...).

My personal opinion is that everything is too lax. Browsers try to be the best ones by implementing workarounds for stuff that does not have (yet) or does not comply to a spec => this way it can only end up in a mess. A simple example is the HTTP-header "Content-Encoding" ( https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Co... ) which I think should only indicate what kind of compression is being used, but I keep seeing in there stuff like "utf8"/"image/jpeg"/"base64"/"8bit"/"none"/"binary"/etc... and all those pages/files work perfectly in the browsers even if with those values they should actually be rejected... .

mananaysiempre · on June 23, 2021

The use of Content-Encoding for compression is actually something of a historical wart: what was intended to be used for that purpose is Transfer-Encoding, but modern browsers don’t even send the TE header necessary to permit the HTTP server to use it (except for Transfer-Encoding: chunked which every HTTP 1.1 client must accept), even though some servers are perfectly capable of it and all but the most broken will at least ignore it. Things like 7bit, 8bit, binary, or quoted-printable are not supposed to be in the HTTP Content-Encoding header, either, but their presence is at least somewhat understandable as they are valid in the MIME Content-Transfer-Encoding header, and HTTP originally shares much of its infrastructure with MIME (think Content-Disposition: attachment).

I guess what I’m getting at here is that the blame for the C-E weirdness lies in large part on the browsers, which could’ve made a clean break and improved the semantics at the same time by using T-E, but instead chose to initiate a chicken-and-egg dilemma out of a desire to support broken HTTP servers from the last century.

(The intended semantics is that C-E, an “end-to-end” header, says “this resource genuinely exists in this encoded form”, while T-E, a “hop-to-hop” header, says “the origin or proxy server you’re using incidentally chose to encode this resource in this form”; this is why sometimes the wrong combination of hacks in the HTTP server and the Web browser will lead you to downloading a tar file when you expected a tar.gz file.)

The use of “gzip” as the compression is also a wart, because it’s “deflate” (which is what you want: DEFLATE compression with a checksum) with a useless decompressed filename (wat?) + decompressed mtime (double wat?) header stacked on top.

nayuki · on June 24, 2021

Even though HTTP DEFLATE saves ~20 bytes compared to GZIP, it itself is a wart because of some vendor misunderstandings. HTTP DEFLATE is actually DEFLATE data wrapped in a zlib container, not raw DEFLATE. See https://en.wikipedia.org/wiki/HTTP_compression#Problems_prev... ; https://stackoverflow.com/questions/3932117/handling-http-co...

benibela · on June 24, 2021

I just implemented decompression in my HTTP client this week

I could not test that part because both server I tried send raw deflate, without zlib container

jfrunyon · on June 23, 2021

The original filename is optional in gzip. It is not included in the response sent by, for example, Apache.

(There is a mandatory MTIME which is included, and an OS byte, but those only waste 5 bytes total. Far less than gzip will typically save.)

indymike · on June 23, 2021

The spec is silent on length. 2000 bytes came from some web servers (old IIS comes to mind) that capped the URL at 2K or something close to that. So extra long URLS were problematic (and a lot of early web apps went nuts with parameters). So, max length is up to the implementer. All I know is that I've had to fix lots of code where someone assumed that 255 characters is all you'll ever need for a URL.

tpetry · on June 24, 2021

255 characters is the default to a variable length string column in databases. So if a developer did not pay attention he just used the default which is in some cases to short for an url.

masklinn · on June 24, 2021

> old IIS comes to mind

And msie for which it’s a hard limit not just a default.

jfrunyon · on June 23, 2021

There is no single max total URL length. You probably shouldn't enforce one other than to prevent DoS.

surfingdino · on June 23, 2021

I have come across even more issues caused by IRIs used incorrectly in place of URIs by a popular web framework, causing havoc with OAuth redirects.

https://en.wikipedia.org/wiki/Internationalized_Resource_Ide...

jfrunyon · on June 23, 2021

> making this is a valid URL: https://!$%:)(*&^@www.netmeister.org/blog/urls.html

Uh, no. "%:)" is not <"%" HEXDIG HEXDIG> nor is % allowed outside of that. (Although your browser will likely accept it)

> This includes spaces, and the following two URLs lead to the same file located in a directory that's named " ": > https://www.netmeister.org/blog/urls/ /f > https://www.netmeister.org/blog/urls/%20/f > Your client may automatically percent-encode the space, but e.g., curl(1) lets you send the raw space:

Uh, no. Just because one of your clients is wrong and some servers allow it doesn't mean it's allowed by the spec.

In fact, the HTTP/1.1 RFC defers to RFC2396 for the meaning of <abs_path>: <path_segments> which begin with a /.

What is <path_segments>? A bunch of slash-delimited <segment>s.

What is <segment>? A bunch of <pchar> and maybe a semicolon.

What is <pchar>? <unreserved>, <escaped>, or some special characters (not including space).

What is <unreserved>? Letters, digits, and some special characters (not including space).

What is <escaped>? <"%" hex hex>.

Most HTTP clients and servers are pretty forgiving about what they accept, because other people do broken stuff, like sending them literal spaces. But that doesn't mean it's "allowed", that doesn't mean every server allows it, and that doesn't mean it's a good idea.

> That is, if your web server supports (and has enabled) user directories, and you submit a request for "~username": [it does stuff]

Uh, no. If you're using Apache, that might be true. As you mentioned, this is implementation-defined (as are all pathnames).

> Now with all of this long discussion, let's go back to that silly URL from above: ... Now this really looks like the Buffalo buffalo equivalent of a URL.

Not really.

> Now we start to play silly tricks: "⁄ ⁄www.netmeister.org" uses the fraction slash characters

You are aware that URLs predate Unicode, right? Not to mention that Unicode lookalike characters are a Unicode (or UI) problem, not a URL problem?

> The next "https" now is the hostname component of the authority: a partially qualified hostname, that relies on /etc/hosts containing an entry pointing https to the right IP address.

Or on a search domain (which could be configured locally, or through GPO on Windows, or through DHCP!). Or maybe your resolver has a local zone for it. Or maybe ...

sitdown · on June 23, 2021

Layouts using <table>s are complicated too. For example, this page has a ~7800px-wide <pre> tag in a <table> that's 720px wide.

scandinavian · on June 23, 2021

Specifically using another font for the code tag then the rest of the blog to hide the difference between ⁄⁄ and // seems weird. I get that it wouldn't be interesting if not doing that, but doesn't that just show that it's really not as complicated as you make it out to be?

teknopaul · on June 23, 2021

URLs are not complicated, unless you complicate them.

foo|foo -foo 's^foo^foo^'"">foo 2>>foo

is not a very good example for teaching the structure of the the command line.

Pick a better one.

It's simple.

LambdaComplex · on June 23, 2021

"The average URL" and "what is allowed by the URL specifications" are two very different things. (And the same could be said about your command line example)

prepend · on June 23, 2021

It doesn’t seem complicated at all. Complicated to me means difficult to understand. This just involves reading the spec and it all seems pretty simple and consistent.

Complicated doesn’t mean “new to me.” If I haven’t read a man page, that doesn’t mean the command is complicated.

treve · on June 23, 2021

I'm trying to figure out what your point is. Is it a criticism of the article? Are you sharing with us that you are clever? I'm trying to give it a generous interpretation, but I'm having a hard time.

So it's not complicated to you... what made you want to share this?

kube-system · on June 23, 2021

If we're being sticklers for reading the docs...

> com·pli·cat·ed | ˈkämpləˌkādəd |

> adjective

> 1 consisting of many interconnecting parts or elements

Stratoscope · on June 23, 2021

> This just involves reading the spec and it all seems pretty simple and consistent.

Worth noting from the article:

> And this is one of the main take-aways here: while the URL specification prescribes or allows one thing, different clients and servers behave differently.

alphabet9000 · on June 23, 2021

Even browser developers have made mistakes as a result of the complexity of the spec, resulting in things like CVE-2018-6128 [0] happening.

[0] https://bugs.chromium.org/p/chromium/issues/detail?id=841105