Skip to content

A NFA regex engine with a web app to visualize the matches

License

Notifications You must be signed in to change notification settings

DanielBV/RegexEngine

Repository files navigation

Regex Engine

A regex engine based on NFA. The project also includes a webpage to visualize the generated NFA and test the results: https://danielbv.github.io/RegexEngine/.

Disclaimer: This project is for educational projects and shouldn't be used in real development.

Usage

First match:

const regex = new NFARegex("[a-zA-Z][a-zA-Z0-9_]*");
const match = regex.findFirstMatch("   myVar  ");
console.log(match.group(0)); //myVar

All matches:

const regex = new NFARegex("[a-zA-Z][a-zA-Z0-9_]*");
const matches = regex.findAllMatches("Remember Alf? He’s back, in pog form.");
console.log(matches.map(x => x.group(0))); //  [ "Remember", "Alf", "He", "s", "back", "in", "pog", "form" ]

If the regex is invalid it'll throw a parsing error.

Tests

The tests are built with jest.

npm run test

About the page

This project includes a page to test and visualize the regex. Regex engine page It also has an animation page (/animation.html) to visualize the NFA in action. Regex engine working animation

Built with

  • React
  • React-bootstrap
  • Monaco - Input editor
  • Viz.js - To create the NFA diagrams

Execute the web

npm install
npm start

Generate a static site

The main reason I did the engine in Javascript was to be able to create a static site and publish it on github pages. The project has github actions configured to generate the page automatically when it detects a push in the main branch.

To build the page manually, run:

npm run build

About the implementation

I intend to do several blog posts about the implementation of the regex engine. I'll update this with the links when it's ready.

Supported syntax

Character classes

Character classes match a set of characters:

  • .: Match any single character (except linebreaks).
  • \d: Match a single digit.
  • \D: Match any single character that isn't a digit.
  • \w: Match any word character ([A-Za-z0-9_])
  • \W: Match any character that isn't a word character ([^A-Za-z0-9_])
  • \s: Matches a single whitespace character
  • \S: Matches any single non-whitespace character
  • [ab]: Matches a or b
  • [a-z]: Matches any single character between a and z.
  • [a-zA-Z_]: Matches the character _ or any single character that is inside the ranges a-z and A-Z
  • [^a-zA-Z_]: Matches any single character that isn't _ and isn't inside the ranges a-z and A-Z

Quantifiers

Quantifiers specify how many times a part of the regular expression should be repeated.

  • * – match zero or more times
  • + – match one or more times
  • ? – match zero or one time

Quantifiers are greedy by default, so they'll match as much input as they can. To make them lazy add a ? at the end: *?, +? and ??

A regex like .*(\w+).* would fail to capture complete words because .* is greedy. For the input aaa the regex (\w+) would only match one a. To fix this make both .* lazy: .*?(\w+).*?

Capturing groups

Capturing groups allow to extract the part of the input that matched the group. To create a capturing group just enclose a subexpression between parenthesis: (subexpression)

Regex (\w+)@(\w+)\.(\w+)
Input [email protected]
Result Group 0: [email protected]
Group 1: myemail
Group 2: gmail
Group 3: com

The group 0 always contains the full text that matched. To access the value of the groups do match.group(name). You can also set a name for the group instead of a increasing number using (?<name>subexpression) :

Regex (?<user>\w+)@(?<domain>\w+)\.(?<extension>\w+)
Input [email protected]
Result Group 0:               [email protected]
Group user:          myemail
Group domain:    gmail
Group extension: com

If you want to group a expression but don't want to capture it, use a non capturing group (?:subexpression)

Regex (?:a+)(b+)
Input aaabbb
Result Group 0: aaabbb
Group 1: bbb

Alternation:

  • a|b: Matches a or b

Character escapes

There are multiple characters that have a special meaning. If instead you want to match that especific character you can escape it:

  • .: Matches any character (except line breaks)
  • \.: Matches the dot character

Anchors

Anchors are a bit weird because they don't match characters. Instead they match positions:

  • ^: Matches the beginning of the string
  • $: Matches the end of the string

Note: Some regex engines have multiline modes, in which the meaning of these anchors get expanded. Sadly this engine hasn't that mode :(.

Without anchors:

Regex (findAllMatches) \d+
Input 10 20 30
Result Match 0: 10
Match 1: 20
Match 1: 30

With ^ anchor:

Regex (findAllMatches) ^\d+
Input 10 20 30
Result Match 0: 10

With $ anchor:

Regex (findAllMatches) \d+$
Input 10 20 30
Result Match 0: 30

Nice things I could add but probably won't

  • Allow character classes inside []. For example: [\w]+
  • {n}, {n,} and {n,m} quantifiers
  • Backreferences (might be relatively easy if I change the matchers to decide how much input it consumes and giving it access to the memory)
  • Multiple modes (case insensitive, multiline)
  • Unicode escaping

License

Distributed under the MIT License. See LICENSE.md for more information.

About

A NFA regex engine with a web app to visualize the matches

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published