Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use [[OriginalFlags]] from within RegExpBuiltinExec? #489

Closed
littledan opened this issue Mar 24, 2016 · 11 comments · Fixed by #494
Closed

Use [[OriginalFlags]] from within RegExpBuiltinExec? #489

littledan opened this issue Mar 24, 2016 · 11 comments · Fixed by #494

Comments

@littledan
Copy link
Member

RegExpBuiltinExec can only be called on objects with a [[RegExpMatcher]] internal slot. It currently uses a combination of flags from [[OriginalFlags]] (to determine fullUnicode, and also from within the matcher) as well as other flags from the "sticky" and "global" properties. It seems to me like this algorithm would be more consistent if [[OriginalFlags]] were also used to determine whether the "y" and "g" flags are set, rather than actually performing a Get. As a side benefit, in V8's implementation, "y" is compiled into the equivalent of the matcher, so the change to use [[OriginalFlags]] would mean that we would not have to compile the RegExp when the flags change.

What do you think? @allenwb @ljharb @bterlson

@allenwb
Copy link
Member

allenwb commented Mar 24, 2016

there's actually a reason for this.

[[RegExpMatcher]] represents the "compiled" matcher that is created by a built-in RegExp constructor. The semantics of such a matcher are defined by 21.2.2. Both the creation of a [[RegExpMatcher]] and its evaluation semantics depend upon the original state of some (but not all) of the flags that were presented to the constructor. The state of the original flags are immutably captured in [[OriginalFlags]]. So, the value of [[OriginalFlags]] is always consistent with the flag observations made when generating the associated [[RegExpMatcher]]. That is also why flag values used within 21.2.2 are derived from [[OriginalFlags]]. Let's call the flags that 21.2.2 actually depends upon the matcher flags. The matcher flags are "u", "m", and "i".

The RegExpBuiltinExec abstract operation is essentially a loop around multiple invocations of a [[RegExpMatcher]]. There are three ways flags are used within RegExpBuiltinExec

  1. matcher flags are directly used within the [[RegExpMatcher]] as specified in 21.2.2
  2. after the iterations completes the value of the "u" flag is used used. The "u" flag is a matcher flag and it is important that the "u" flag state used after the loop is the same that was used by the actual [[RegExpMatcher]]. So, RegExpBuiltinExec obtains the value of the "u" flag from [[OriginalFlags] which is the source also used by [[RegExpMatcher]].
  3. RegExpBuiltinExec uses the value of the "g" and "y" flags to control the actual loop. Note that these are not matcher flags. There are no [[RegExpMatcher]] semantics that depend upon those flags so RegExpBuiltinExec doesn't have to worry about ensuring that it is using the same flag values as [[RegExpMatcher]]. Because there are no such consistency constraints relating to those flags it uses a property [[Get]] prior to the loop to determine the value to use for "g" and "y".

Using a property access to get "global" and "sticky" enables a RegExp subclass to change the policy for managing those flags while still using the builtin matcher. For example, a subclass might allow either or both of them to be modified between exec calls.

This design provides a little bit of flexibility for subclass authors without significantly constraining implementations. You can still merge the RegExpBuiltinExec loop with a [[RegExpMatcher]] is you want. You just need to guard it with "global" and "sticky" checks.

@littledan
Copy link
Member Author

It's exactly this guarding that is difficult for us. Because we compile "y" into the RegExp, we could "guard" for it, but when the guard fails, we will have to compile a different RegExp. This is not impossible; I am just wondering about the application that motivates this dynamic treatment of these particular flags, while it's considered OK for other flags to be static. I have a hard time understanding the use case; a RegExp subclass which is always global or sticky can modify the constructor, and I don't see why a subclass would want to modify these flags between exec calls. What would they get out of that that they couldn't get by using multiple RegExps, as they would need to do if they want to vary, say, the "u" flag between calls?

@msaboff
Copy link
Contributor

msaboff commented Mar 25, 2016

The WebKit JSC implementation sounds similar to V8's. JSC also compiles in the "y" flag into the RegExp [[RegExpMatcher]] and would have to recompile the RegExp if RegExpBuiltinExec is called and the "y" flag has changed from the prior invocation.

Since getting the "sticky" property is observable from within RegExpBuiltinExec, a guard isn't sufficient. The property still needs to be accessed. A guard could trigger a speculative compilation before an exec call.

A related issue is that requiring property access for global and sticky instead of [[OriginalFlags]] means that repetitive RegExp operations like RegExpPrototype[@@match] invoked with a RegExp that has a true "global" property needs to make those property accesses for every invocation of RegExpBuiltinExec. Complicating this is that there are no semantics for when one of these flags changes while looping. What should happen in step 6. e. of 21.2.5.6 if RegExpBuiltinExec's get of "global" and/or "sticky" have changed values?

@geoffreygaren
Copy link

Having just read it now, I have to agree with littledan that this API is pretty darn confusing.

(1) sticky+global vs unicode+multiline+ignoreCase. It is simply inconsistent that the left-hand-side of this list of flags is modifiable while the right-hand-side is not. The fact that this inconsistency happens to be made possible by an artifact of how [[RegExpMatcher]] is specified does not mean that it is good. I believe that inconsistency is bad in API design because it makes an API harder to follow.

(2) Base class vs subclass. The base class built-in regular expression object provides read-only access to its flags (through accessors that provide getters but not setters). The idea that a subclass should take a base class's read-only flags and make them writeable violates the Liskov Substitution Principle -- and is not something I've ever wanted to do in my own code.

Even if all built-in code has been specified to check for these violations of consistency and object oriented design principle, and tolerate them, we're leaving behind a huge trap for any library authors who want to write helper functions that accept regular expression objects as inputs. All such library code will need to change to accommodate this new, odd feature of a regular expression changing its flags (but only some of them) mid-operation.

littledan added a commit to littledan/ecma262 that referenced this issue Mar 25, 2016
This closes tc39#489. In multiple ECMAScript implementations, the
'sticky' flag is compiled into the matcher. Given that
RegExpBuiltinExec already has a RegExp instance, and it makes other
references to [[OriginalFlags]] in that algorithm, it seems
reasonable to refer to [[OriginalFlags]] for these flags as well.
@allenwb
Copy link
Member

allenwb commented Mar 25, 2016

I think it's probably ok to make this change, assuming there aren't any objections from other implementors.

The split treatment of the flags is mostly a remnant of some historic structuring of the the spec. I agree that consistency would be better and the subclassing argument should apply to either all or none of the them. I think there were some intermediate drafts of the ES6 spec where the separate treatment of the flags were significant, but things evolved so that was no longer the case

@msaboff Complicating this is that there are no semantics for when one of these flags changes while looping. What should happen in step 6. e. of 21.2.5.6 if RegExpBuiltinExec's get of "global" and/or "sticky" have changed values?

Sure there are, the semantics is to follow the algorithm steps exactly as specified. That means that each invocation of RegExpExec will use the value of the flags that are observable during that specific invocation. That may not be what you want, but that is the specified semantics. And note that 6.e invokes RegExpExec and not RegExpBuiltinExec. The object it is invoking exec upon is not necessarily a built-in RegExp object. It does not necessarily have readonly flags or any flags at all. You may implement a specialization of the built-in @@match method for the cases where the regexp obj is a conforming instance of RegExp. But for the general case you have to follow the specification steps exactly.

@geoffreygaren
Subclass extensibility in a dynamically typed language isn't Liskov Substitution. Or to quote the title of a famous paper: inheritance is not subtyping

littledan added a commit to littledan/ecma262 that referenced this issue Mar 26, 2016
This closes tc39#489. RegExpBuiltinExec reads other flags from
[[OriginalFlags]], based on the fact that it is operating on a RegExp
instance. In ES2015, however, it accessed 'global' and 'sticky' via
property accesses. This patch changes RegExpBuiltinExec to access
those flags via [[OriginalFlags]] as well.
@tschneidereit
Copy link
Member

This would also simplify the implementation in SpiderMonkey quite a bit and enable some more optimizations.

I also agree that the flags should be treated in a consistent manner regardless of any implementation considerations.

So, a wholehearted +1.

@bterlson
Copy link
Member

Seems fine, I am also happy to delete some code.

@msaboff
Copy link
Contributor

msaboff commented Mar 26, 2016

During the Nov 2015 TC-39 meeting, @littledan brought up ”Simplification of ES2015 RegExp Semantics which was rejected. IIRC, @allenwb shared some slides with a broader philosophy of the RegExp portion of ES2015 . I cannot find minutes covering that discussion or any slides for @allenwb.

I think it would help focus the discussion on this item next week if the committee had those minutes and slides available before the meeting next week.

@msaboff
Copy link
Contributor

msaboff commented Mar 28, 2016

@msaboff Complicating this is that there are no semantics for when one of these flags changes while looping. What should happen in step 6. e. of 21.2.5.6 if RegExpBuiltinExec's get of "global" and/or "sticky" have changed values?

Sure there are, the semantics is to follow the algorithm steps exactly as specified. That means that each invocation of RegExpExec will use the value of the flags that are observable during that specific invocation. That may not be what you want, but that is the specified semantics. And note that 6.e invokes RegExpExec and not RegExpBuiltinExec. The object it is invoking exec upon is not necessarily a built-in RegExp object. It does not necessarily have readonly flags or any flags at all. You may implement a specialization of the built-in @@match method for the cases where the regexp obj is a conforming instance of RegExp. But for the general case you have to follow the specification steps exactly.

Looking at RegExpBuiltinExec's algorithm in conjunction with the algorithm described for 21.2.5.6 RegExp.prototype[@@match] and it appears that the combination of the two will never terminate in at least one case.

Consider a RegExp derived object that does not override exec, with a pattern of /a/, and the argument string of "aa". It does have a "global" getter that initially returns true, but all subsequent calls to that getter return false. At step 4 of 21.2.5.6, global will be set to true. This will result in evaluating step 6 and the loop described in 6.e. At step 6b, lastIndex will be set to 0. When RegExpExec is called at 6.e.i., it ends up invoking 21.2.5.2.2 RegExpBuiltinExec. Since the "global" getter now returns false and assuming that a get of "sticky" also returns false, lastIndex will be set to 0. At step 12.b., we'll invoke the internal [[RegExpMatcher]] with the starting position of 0. The matcher will return a match starting at position 0 and ending at position 1. Since global and sticky are false, we won't update lastIndex in step 15.a. We'll end up returning back to the RegExp.prototype[@@match] loop an array with the string "a" as element 0. We'll add this string to array A and loop. The only exit to the process is an out of memory exception or hitting some internal timeout.

I think this also supports the proposal that RegExpBuiltinExec should use [[OriginalFlags]] to access "global" and "sticky" instead of calling a getter.

littledan added a commit to littledan/ecma262 that referenced this issue Mar 28, 2016
This closes tc39#489. RegExpBuiltinExec reads other flags from
[[OriginalFlags]], based on the fact that it is operating on a RegExp
instance. In ES2015, however, it accessed 'global' and 'sticky' via
property accesses. This patch changes RegExpBuiltinExec to access
those flags via [[OriginalFlags]] as well.
@littledan
Copy link
Member Author

At the March TC39 meeting, the change to reference [[OriginalFlags]] here achieved consensus.

bterlson pushed a commit that referenced this issue Apr 7, 2016
)

This closes #489. RegExpBuiltinExec reads other flags from
[[OriginalFlags]], based on the fact that it is operating on a RegExp
instance. In ES2015, however, it accessed 'global' and 'sticky' via
property accesses. This patch changes RegExpBuiltinExec to access
those flags via [[OriginalFlags]] as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants