Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to parse indentation based grammars #20

Open
MTRNord opened this issue Apr 25, 2018 · 14 comments
Open

Document how to parse indentation based grammars #20

MTRNord opened this issue Apr 25, 2018 · 14 comments

Comments

@MTRNord
Copy link

MTRNord commented Apr 25, 2018

Hi I want to check if the prevoius char is a :. The exact example code I have is this:

Prefix: !

I want basicly to be able to check based on if the previous char is a : the ! should get a string and later reuse the look back logic to check if I am inside an array which is defined like in yaml meaning to make sure the last 2 lines of the following example don't get into the array:

Commands:
    - Trigger: afk
      Response: "{{user}} just went afk"
    - Trigger: [join_event]
      Response: "Welcome to the Room {{last_joined_user}}"
Homeserver: matrix.org
HomeRoom: {{name}}

(Commands defines a Array with 2 objects in it which both have two key value pairs in them. HomeServer and HomeRoom are outside of that array)

@alecthomas
Copy link
Owner

You cannot perform look-behind with Participle.

If you're trying to parse YAML, you will almost certainly need a custom lexer that includes indentation as a first-class token.

If you could describe and give examples of both situations where the ! is used that require the look-behind, I might be able to help further.

@MTRNord
Copy link
Author

MTRNord commented Apr 27, 2018

The ! is just any char possible. I wanted to stringify anything after the : before a new line. But thats something I have no problem with if that is impossible.

The yaml like array is something I woud like to be able to have. In that case of arrays a look behind would be nice or as you mentioned a way to check the identation to see if I am inside an array.

@alecthomas
Copy link
Owner

I more meant that if you show me an example of what you're trying to achieve I may be able to provide you with a grammar that will parse it.

@MTRNord
Copy link
Author

MTRNord commented Apr 27, 2018

Oh ok. I can do that later this day.

@MTRNord
Copy link
Author

MTRNord commented Apr 27, 2018

So my current code is at https://github.com/Nordgedanken/Matrix-DSL I tried to have a custom lexer based on participle but as you mentioned look-behind failed (nil pointers^^). You can find the gramar I am using at https://github.com/Nordgedanken/Matrix-DSL/blob/master/cmd/lexer/ast.go

The full example I currently have is:

[[BOT]]
Name: Test Bot
Description: Some Demo Bot
Prefix: !
Autojoin: true
Commands:
    - Trigger: afk
      Response: {{user}} just went afk
    - Trigger: [join_event]
      Response: Welcome to the Room {{last_joined_user}}
Homeserver: matrix.org
HomeRoom: {{name}}

So lets break up how I wanted to have this parsed:

  1. [[BOT]] is a header for the type of config. This already works
  2. From "Name" to "Autojoin" there are 4 key value pairs which I want to parse into a string. This currently fails as the default lexer requires me to have " around strings.
  3. "Commands" starts a Array. This is exact the same as yaml defines it. But with the same type of strings as mentioned in 2. The first line of a object has a - as a prefix. The following key value pairs inside the object are on the same "level" (have the same amount of spaces in front)
  4. [join_event] are special keys that are predefined.
  5. Strings with {{}} around them are also special keys but reflect either predefined variables or variables from the Config. (these can be parsed after the work of participle just fine as they get replaced anyway)
  6. "Homeserver" and "HomeRoom" are like the pairs in 1 but are after the array means I need some kind of tracking if we left the array.

@MTRNord
Copy link
Author

MTRNord commented Apr 27, 2018

I hope thats detailed enough or atleast the right thing

@alecthomas
Copy link
Owner

If I understand correctly, this will not be possible to achieve with Participle because it isn't a context-free grammar (CFG). It looks similar to YAML, which requires context across lines in order to track the current indentation level.

I will update the README to clarify that Participle can only be used to parse CFGs.

Sorry about that :(

@alecthomas
Copy link
Owner

Actually I'm going to correct myself here and go back to my original suggestion: this will require a custom lexer.

Basically, each time the text is indented the lexer will need to emit an Indent token and update its state to track that indentation level. When unindentation occurs the lexer will need to emit an Unindent token.

The grammar will then need to match against these tokens.

eg. (this will not work but might give you a starting point)

type Entry struct {
  Key string `@Ident ":"`
  Value *Value `@@`
}

type Value struct {
  Array []*Value `Indent @@ { @@ } Unindent`
  Boolean *bool `| ( @"true" | "false")`
}

You might also need #3 to be solved.

@MTRNord
Copy link
Author

MTRNord commented Apr 30, 2018

@alecthomas I will try that. Thanks for the suggestion and for the help!

@alecthomas alecthomas changed the title How do I check if previous char is a : Document how to parse indentation based grammars May 7, 2018
@alecthomas alecthomas reopened this May 7, 2018
@apparentlymart
Copy link

apparentlymart commented Jul 18, 2018

In case it's helpful, I've written lexers for a few indentation-sensitive languages, and found a pattern that I found helpful:

I usually first write a "normal" lexer which (unusually) considers a newline as an explicit token and considers a sequence of consecutive other whitespace characters as an explicit token. This might give you a token stream like this for part of the example above:

  • Ident Commands
  • Punct :
  • Newline \n
  • Spaces (four)
  • Punct -
  • Ident Trigger
  • Punct :
  • Spaces (one)
  • Ident afk
  • Newline \n
  • Spaces (six)
  • Ident Response
  • (etc)

With that lexer working, you can then wrap it in another lexer that consumes tokens from the first while tracking a stack of currently-open indent levels, passing through verbatim any tokens that aren't Spaces and synthesizing Indent and Dedent tokens as needed between Newline tokens and whatever follows them.

It's important to keep a stack of indent levels that are "open" so you can generate the right number of synthetic Dedent tokens if multiple indented blocks all close at once.

I've not tried to do the above with participle.Lexer yet, but since its interface is to just pull one token at a time implementing the above shouldn't be too tricky with the indent stack and a state flag to deal with the need to "look behind" and see if the previous token was a newline.

I'm sure some or all of the above will be familiar to some readers but I thought I'd leave this here in case it's useful to someone who finds this issue in future. Doing the indent processing separately means that a more "normal" lexer (e.g. EBNF-based) can be used for most of the scanning, as long as it doesn't filter out the whitespace when generating tokens.

@alecthomas
Copy link
Owner

alecthomas commented Jul 18, 2018

@apparentlymart Nice idea about composing the lexers 👍

It seems like it should be possible to provide a generic IndentAwareLexer using an approach like that, assuming the underlying lexer conforms to emitting Newline and Spaces tokens, as you describe.

@apparentlymart
Copy link

That could be useful, indeed! Another detail I meant to mention: indent-aware grammars often have certain syntax constructs where indentation-awareness is temporarily suspended.

For example, Python's lexer suspends generation of Indent and Dedent when there is a nonzero number of bracketing delimiters open.

This can also be handled in a wrapping lexer like I was describing, but it requires a little more state tracking (a stack of open brackets, or just a count if you're not concerned about handling incorrect nesting) and recognizing those open and close bracket tokens. The hypothetical IndentAwareLexer could potentially handle this too by taking in its configuration a set of token pairs that are to be treated as suspending the synthetic Indent and Dedent token emission.

@pyrossh
Copy link

pyrossh commented Mar 7, 2020

@apparentlymart thanks. Took me a few tries to figure it out. If someone wants to know how a basic one is done. Its here https://github.com/pyros2097/pine/blob/master/compiler/lexer.go. Its still work in progress though and is not final.

@alecthomas
Copy link
Owner

I have a need for an indenting lexer in a side project I'm working on. It seems to work quite well so far. Once it's a bit more battle tested I think I'll include it in Participle itself, but for now here it is:

// Package indenter provides a lexer that inserts INDENT and DEDENT
// tokens based on indentation.
//
// It relies on the parent lexer to provide an "NL" token that matches
// newlines followed by whitespace.
package indenter

import (
	"io"
	"maps"
	"strings"

	"github.com/alecthomas/participle/v2/lexer"
)

func New(parent lexer.Definition) lexer.Definition {
	out := maps.Clone(parent.Symbols())
	if _, ok := out["NL"]; !ok {
		panic("parent lexer must have an NL symbol similar to \"[\\n\\r]+[\\s\\t]*\"")
	}
	if _, ok := out["Indent"]; ok {
		panic("parent lexer must not have an Indent symbol")
	}
	if _, ok := out["Dedent"]; ok {
		panic("parent lexer must not have a Dedent symbol")
	}
	next := lexer.TokenType(-1)
	for _, t := range out {
		if t <= next {
			next = t - 1
		}
	}
	out["Indent"] = next
	out["Dedent"] = next - 1
	return &indentLexerDef{parent: parent, symbols: out}
}

type indentLexerDef struct {
	parent  lexer.Definition
	symbols map[string]lexer.TokenType
}

func (i *indentLexerDef) Symbols() map[string]lexer.TokenType {
	return i.symbols
}

func (i *indentLexerDef) Lex(filename string, r io.Reader) (lexer.Lexer, error) {
	lex, err := i.parent.Lex(filename, r)
	if err != nil {
		return nil, err
	}
	return &indentLexer{
		lexer:      lex,
		nlType:     i.symbols["NL"],
		indentType: i.symbols["Indent"],
		dedentType: i.symbols["Dedent"],
	}, nil
}

var _ lexer.Definition = (*indentLexerDef)(nil)

type indentLexer struct {
	nlType     lexer.TokenType
	indentType lexer.TokenType
	dedentType lexer.TokenType
	indents    []string
	buffered   []lexer.Token
	lexer      lexer.Lexer
}

func (i *indentLexer) Next() (lexer.Token, error) {
	if len(i.buffered) > 0 {
		t := i.buffered[0]
		i.buffered = i.buffered[1:]
		return t, nil
	}

	t, err := i.lexer.Next()
	if err != nil || t.Type != i.nlType {
		return t, err
	}

	pos := t.Pos
	i.buffered = append(i.buffered, lexer.Token{Pos: pos, Type: i.nlType, Value: "\n"})
	pos.Line++
	pos.Column = 1

	// At this point we have an NL symbol. Figure out if we need to
	// insert Indent tokens or Dedent tokens.
	indent := strings.TrimLeft(t.Value, "\n\r")
	for j, ind := range i.indents {
		if strings.HasPrefix(indent, ind) {
			indent = indent[len(ind):]
			pos.Column += len(ind)
			pos.Offset += len(ind)
		} else {
			k := len(i.indents) - j
			i.indents = i.indents[:j]
			for ; k > 0; k-- {
				i.buffered = append(i.buffered, lexer.Token{Pos: pos, Type: i.dedentType, Value: "⇤"})
			}
			break
		}
	}
	if indent != "" {
		i.indents = append(i.indents, indent)
		i.buffered = append(i.buffered, lexer.Token{Pos: pos, Type: i.indentType, Value: "⇥"})
	}
	t = i.buffered[0]
	i.buffered = i.buffered[1:]
	return t, nil
}

var _ lexer.Lexer = (*indentLexer)(nil)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants