Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing of backslashes in custom string literals seems inconsistent #22926

Closed
Keno opened this issue Jul 23, 2017 · 29 comments
Closed

Parsing of backslashes in custom string literals seems inconsistent #22926

Keno opened this issue Jul 23, 2017 · 29 comments
Assignees
Labels
kind:breaking This change will break code needs decision A decision on this change is needed parser Language parsing and surface syntax
Milestone

Comments

@Keno
Copy link
Member

Keno commented Jul 23, 2017

julia> parse(readline(STDIN))
raw"\""

:(@raw_str "\"")

julia>

julia> parse(readline(STDIN))
raw"\\""
:($(Expr(:incomplete, "incomplete: invalid string syntax")))

julia> parse(readline(STDIN))
raw"\\\""
:(@raw_str "\\\\\"")

julia> dump(ans)
Expr
  head: Symbol macrocall
  args: Array{Any}((2,))
    1: Symbol @raw_str
    2: String "\\\\\""
  typ: Any

In particular there is a confusion about what backslash actually does. It certainly escapes ", but as we see in the second example, it also escapes itself. However, in the third example it suddenly doesn't appear to escape itself anymore. In particular, it does not seem to be possible to obtain a string literal that parses to @raw_str "\\\"" at the moment.

The options I see are

  1. Make raw"\\"" parse to @raw_str "\\\"" (rather than an error)
  2. Make raw"\\\"" parse to @raw_str "\\\"".
  3. Decouple parsing from the content and always give what was actually written, i.e. make raw"\"" parse to @raw_str "\\\"" and raw"\\\"" parse to @raw_str "\\\\\\\""
  4. Disallow " in custom string literals entirely.
  5. Do nothing.

FWIW, I think I prefer option 3.

Whatever the decision with respect to parsing, I think we should adjust raw_str to have the invariant that whenever print(r"<x>") is a valid expression for some sequence of characters <x>, the output of that expression is <x>. That's currently not the case in the above example:

julia> print(raw"\\\"")
\\"
julia> print(raw"\"")
"

This discrepancy was noted when raw_str was originally introduced and is documented, but it bothers me.

@Keno Keno added kind:breaking This change will break code needs decision A decision on this change is needed parser Language parsing and surface syntax labels Jul 23, 2017
@JeffBezanson
Copy link
Sponsor Member

I think it should work like normal backslash escapes, but only allowing backslash and quote to be escaped.

@Keno
Copy link
Member Author

Keno commented Jul 23, 2017

Ok, so that would be option 2 in effect?

@stevengj
Copy link
Member

stevengj commented Jul 24, 2017

I thought that the point of raw strings was that backslashes don't need to be escaped either, e.g. raw"\n" == "\\n"? So the only thing that needs to be escaped is " or maybe a backslash preceding the closing "?

i.e. essentially option 1

@stevengj
Copy link
Member

It's pretty important not to require escaping of backslashes in custom string literals, as otherwise regex literals become much messier. (There is also the LaTeXStrings package example.)

@Keno
Copy link
Member Author

Keno commented Jul 24, 2017

Your observations are correct of course. Do you have a preference as to what to do?

@stevengj
Copy link
Member

I don't understand option 3. If quotes aren't escaped, how can you tell where the end of the string is?

e.g. how can you distinguish raw"\" between "\\" and a string that hasn't stopped yet?

@stevengj
Copy link
Member

I just ran into this in bramtayl/BibTeX.jl#3: it seems to be extremely hard to type \" (a backslash followed by a quote) in a raw literal:

julia> "\\\"" # the desired string, manually escaped
"\\\""

julia> raw"\"" # wrong string, no backslash
"\""

julia> raw"\\"" # does not parse
       ^C

julia> raw"\\\"" # wrong string, too many backslashes
"\\\\\""

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Sep 29, 2017

The semantics of raw string literals are not general in the sense that there are strings which cannot be input that way. The string with contents \" is one of them. If you want full generality, you need real escaping semantics.

@JeffBezanson
Copy link
Sponsor Member

JeffBezanson commented Sep 29, 2017

I think we should fix this. If raw strings support any kind of escaping with \, it needs to be possible to escape \ itself.

@JeffBezanson
Copy link
Sponsor Member

Re-reading the thread, I see allowing lots of backslashes in strings is part of the goal of raw strings, so we shouldn't require escaping them.

Option 1 seems good, but then it's impossible to write a string ending in a backslash.

Option 4 is also pretty good: we could have no escaping at all in raw strings, and you can include quote characters using raw""" ... """. Then it's impossible to write a string containing three quotes in a row, but that seems like a pretty mild restriction.

@StefanKarpinski
Copy link
Sponsor Member

The real issue is picking a parsing strategy for custom string literals such that you can both implement normal string behavior and raw string behavior.

@JeffBezanson
Copy link
Sponsor Member

I'm not sure that's possible.

@StefanKarpinski
Copy link
Sponsor Member

Then I think that supporting normal strings needs to take priority since most string literals are closer to normal than they are like raw strings. Not sure there's anything to be done here.

@JeffBezanson JeffBezanson added the status:triage This should be discussed on a triage call label Sep 29, 2017
@JeffBezanson
Copy link
Sponsor Member

I propose making only the sequence \" special (for escaping double quotes). The only thing that disallows is strings ending in backslash.

@JeffBezanson
Copy link
Sponsor Member

Another possibility: convert \\ to a single backslash, \" to a quote, and \x to \x for any other x.

@StefanKarpinski
Copy link
Sponsor Member

Coming full circle here, the current leading proposal is that \\ --> \, \" --> " and \x --> \x for any other x. This would allow normal string escaping to be implemented by an additional pass.

@JeffBezanson JeffBezanson added this to the 1.0 milestone Oct 5, 2017
@JeffBezanson JeffBezanson removed the status:triage This should be discussed on a triage call label Oct 5, 2017
@stevengj
Copy link
Member

stevengj commented Oct 5, 2017

Wouldn't that mean that you would need \\\\ to type a literal backslash in a regex?

@Keno
Copy link
Member Author

Keno commented Oct 5, 2017

\\\ if what comes after is not a backslash. I don't like it either though. There's also @vtjnash's proposal of only allowing escaping in sequences that preceed a ".

@StefanKarpinski
Copy link
Sponsor Member

r"\\" would go through as a single backslash.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Oct 5, 2017

@vtjnash's proposal should be spelled out explicitly here if it wants to be considered.

@JeffBezanson
Copy link
Sponsor Member

@vtjnash 's full proposal is that any number of backslashes followed by a quote is special --- if the number of backslashes is even, you get n/2 backslashes in the string followed by end-of-string. If the number of backslashes is odd, you get a quote character instead of end-of-string.

@StefanKarpinski
Copy link
Sponsor Member

If the number of backslashes is odd, you get a quote character instead of end-of-string.

Preceded by how many backslashes?

@Keno
Copy link
Member Author

Keno commented Oct 7, 2017

div(n,2)

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Oct 9, 2017

That makes more sense – previous explanations made it seems like it would be n, which seemed insane to me. So the actual @vtjnash proposal is:

  • Any number of \ preceding a non-quote is passed through verbatim.
  • 2n times \ before a quote becomes n times \ and end of string.
  • 2n+1 times \ before a quote becomes n times \ and a literal quote.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Oct 9, 2017

That's correct

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Oct 26, 2017

Here's a summary of the options:

Current behavior

\" --> "
\x --> \x for any other x (including \)
\\" --> \ + end string

Drawback: cannot express strings containing \"

Jeff’s proposal

\" --> "
\\ --> \
\x --> \x for any other x

Drawback: cannot express trailing backslash in string.

Jameson’s proposal

\ --> \
\" --> "
\\" --> \ + end string
\\\" --> \"
\\\\" --> \\ + end string
\x --> \x for any other x
\\x --> \\x for any other x

Drawback: kinda weird.

Examples:

raw"C:\dir" --> C:\dir
raw"\\drive\dir" --> \\drive\dir
raw"\\drive\\" --> \\drive\
raw"\"quo\ted\"" --> "quo\ted"

Keno’s proposal

Pass string literal content through verbatim.
Provide a standard unescaping helper function.

Drawback: massively breaking; requires calling the helper function in every string macro.

@StefanKarpinski
Copy link
Sponsor Member

Yeah, I suppose Jameson's proposal is probably the way to go here. I can't think of another approach that has a better combination of passing things through as literally as possible and allowing any possible string to be expressed somehow.

@ylxdzsw
Copy link
Contributor

ylxdzsw commented Jul 29, 2018

Can I add an argument for option 4 since the current solution seems unfortunatly too complex?

We could allow the triple-quote syntax extended to n-quote where n>=3 like in GFM. i.e., the block literal string can starts with an arbitrary number of quotes as long as the closing fence is equally long as the opening one. Thus we can have no escaping at all in raw strings yet still able to express strings that have 3 or more quotes in a row.

A good thing is that unlike Python or Javascript, we currently ignore the first line break in block string literals. Thus we can express strings that start with several quotes and do not let them fused into the opening quotes if we are going to allow arbitrary numbers of opening quotes: just add a line break in this edge case.

With proper syntax highlights, it's easy to identify the content since they could have different colors with the opening and closing quotes. On the other hand, trying to figure out the actual RegExp from [/////"] hurts my eye.

@StefanKarpinski
Copy link
Sponsor Member

Way too late.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:breaking This change will break code needs decision A decision on this change is needed parser Language parsing and surface syntax
Projects
None yet
Development

No branches or pull requests

6 participants