Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the fancy_regex crate instead of regex #191

Closed
ethindp opened this issue Aug 28, 2023 · 18 comments
Closed

Use the fancy_regex crate instead of regex #191

ethindp opened this issue Aug 28, 2023 · 18 comments

Comments

@ethindp
Copy link

ethindp commented Aug 28, 2023

Would it be possible to use the fancy_regex crate over regex? It adds features like lookbehind/lookahead, which a grammar I'm writing needs (I need to ensure an identifier is (not) a keyword). It also supports character classes.

@jsinger67
Copy link
Owner

Hi @ethindp,
Thanks for your interest in parol.
Actually parol uses the crate regex-automata for the realization of it's lexers. Anyway it isn't easy to change the used regex crate for several reasons. One of them is runtime performance.
But maybe I can help you with some useful ideas to solve your tasks.
To ensure that an identifier is not a keyword you have to define all keywords before the terminal of your identifier. This prioritizes the detection of keywords.
Character classes are supported by regex-automata out of the box.
Please let me know if I can help you any further.

@ethindp
Copy link
Author

ethindp commented Aug 29, 2023

@jsinger67 Thank you for that, I didn't know that regex-automata allowed that already. I completely understand the performance angle. I'm looking forward to giving Parol a try to see if it can handle (or help me help it handle) the particular grammar I have (which is quite complicated).

@jsinger67
Copy link
Owner

Parol is definitely worth a try 😄
And I can give you support.
Also the book should be a helpful reference.

@ethindp
Copy link
Author

ethindp commented Aug 29, 2023

Thanks! (Parol is attempting to comulcate k for my grammar.... It's taking a while to do that so I honestly am wondering if K might be infinite... I hope not but if so, I might need to left-factor.) The grammar I'm working with is ambiguous (I'm working on writing an Ada compiler) but getting it to parse would be great!

@jsinger67
Copy link
Owner

jsinger67 commented Aug 29, 2023

Try to start with small pieces.
To commission a large grammar can be time consuming.
Try to minimize k as much as possible.
Left factoring is done by parol. For this please have a look at the generated exp.par file.

@ethindp
Copy link
Author

ethindp commented Aug 29, 2023

@jsinger67 You aren't the only one to tell me to do this, and I'm trying to do that. But the grammar (mainly the name rule) is ambiguous and left-recursive, and almost everything (minus a few constructs) depends on that rule. I wonder if left-factoring can break that cycle.

@jsinger67
Copy link
Owner

One good technique is to comment out some less important alternatives of a certain non-terminal near the start symbol of your grammar. This will temporarily cut off a great portion of your grammar.

I can offer you to have a look at your grammar.

@ethindp
Copy link
Author

ethindp commented Aug 29, 2023

Any help would be appreciated. The majority of left-recursive rules are indirectly left-recursive; the rule they all have in common is name. (I've tried to refactor this into a name_tail rule already.) I've attached the grammar. The name rule is on line 315.
alang.txt

@jsinger67
Copy link
Owner

I will have a look at this soon.
First impression:
Lots of productions with "name" on their RHS. This is actually redundant and results in larger equation systems parol has to solve.

@ethindp
Copy link
Author

ethindp commented Aug 29, 2023

@jsinger67 I know, I'm planning on eliminating those later once I get this to work. The name rule is a tangled mess.

@ethindp
Copy link
Author

ethindp commented Aug 29, 2023

At least, the ones I can (e.g. reference_object_name). Might also be able to take care of prefix too.

@jsinger67
Copy link
Owner

@ethindp, there are about twenty recursive non-terminals in this grammar.
I have to convert them manually into a non-recursive form. This will take a while but should be a good practice for me 😉.
Or, even better, I will provide support for this kind of problem somehow in parol itself.
The transformation of a traditional left-recursive grammar into a right-recursive one seems to be an underestimated use case.

@jsinger67
Copy link
Owner

jsinger67 commented Aug 30, 2023

I removed all left-recursions successfully.
To convert indirect left-recursions into direct ones I had to "inline" some productions so that they don't exist as standalone ones anymore. I commented these changes thoroughly.
Anyway, the grammar still has some flaws.
Parol decries some unreachable non-terminals which might be a problem.
Please note that I renamed the start symbol from “compilation” to “ada”. You can, of course, revert that.
Feedback appreciated. Have fun. 😎

ada.txt

@ethindp
Copy link
Author

ethindp commented Aug 30, 2023

Thanks! I'll look at the changes -- I'll definitely look at modifying the grammar further to get it to stop complaining. (It'd be nice if LL(k) supported direct/indirect left recursion, but from what I understand this inherently isn't possible.)

@ethindp ethindp closed this as completed Aug 30, 2023
@jsinger67
Copy link
Owner

You're right. LL parsers can't handle left recursions.
To remove them automatically is possible but can result in major redrafts of the original grammar.
I'll definitely think this over and try to provide more support for users here.

@ethindp
Copy link
Author

ethindp commented Aug 30, 2023

If it wouldn't take a significant amount of time, could you do it during code generation? (Also, another alternative that could be used to disambiguate things, if necessary, would be something like an interception function that you define that's called on rules you know are ambiguous, which would allow you to influence which path is taken based on arbitrary lookahead; Coco/R does this, and it's quite neat.)

@jsinger67
Copy link
Owner

@ethindp
I tried hard to minimize your grammar.
Especially, I did some aggressive inlining of productions if they were in the form of
A: B;
Then I substituted all occurrences of A by B and removed that production.
I did this recursively and finally there showed up some ambiguous productions in the form of
X: Y | Y | Y | Z;
which I minimized to
X: Y | Z;
At some final state I had an expanded grammar with still 1435 productions.
The equation systems to solve for FOLLOW(k) then still contained 2157 equations and the equation system for FIRST(k) had a size of 1436 equations.
This is much but normally not a problem if k is limited. For instance, the grammar of veryl has equation system sizes of 1161 resp. 813, but k is 3 and thus the grammar automaton is completely computable.

The computation of your grammar for a limit of k <= 3 finished after about ten minutes on my machine with no solution found. Maybe some very powerful computer could calculate a solution with k in the range of 5..10.

To make a long story short. I think, parol is not able to handle such grammars effectively at this stage of its development.
The top-down parsing strategy it uses needs a grammar with finite (and not too high) value of lookahead.
I'm sorry I don't have better news.

@ethindp
Copy link
Author

ethindp commented Sep 4, 2023

@jsinger67 It's alright, I completely understand. I've been struggling to find a good way of implementing this grammar for a while now. I've thought about using parser combinators, but the way Nom and such work don't exactly give me confidence in my ability to translate the grammar into a form that Nom/combine/pom would like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants