A regex engine based on NFA. The project also includes a webpage to visualize the generated NFA and test the results: https://danielbv.github.io/RegexEngine/.
Disclaimer: This project is for educational projects and shouldn't be used in real development.
First match:
const regex = new NFARegex("[a-zA-Z][a-zA-Z0-9_]*");
const match = regex.findFirstMatch(" myVar ");
console.log(match.group(0)); //myVar
All matches:
const regex = new NFARegex("[a-zA-Z][a-zA-Z0-9_]*");
const matches = regex.findAllMatches("Remember Alf? He’s back, in pog form.");
console.log(matches.map(x => x.group(0))); // [ "Remember", "Alf", "He", "s", "back", "in", "pog", "form" ]
If the regex is invalid it'll throw a parsing error.
The tests are built with jest.
npm run test
This project includes a page to test and visualize the regex.
It also has an animation page (/animation.html
) to visualize the NFA in action.
- React
- React-bootstrap
- Monaco - Input editor
- Viz.js - To create the NFA diagrams
npm install
npm start
The main reason I did the engine in Javascript was to be able to create a static site and publish it on github pages. The project has github actions configured to generate the page automatically when it detects a push in the main branch.
To build the page manually, run:
npm run build
I intend to do several blog posts about the implementation of the regex engine. I'll update this with the links when it's ready.
Character classes match a set of characters:
.
: Match any single character (except linebreaks).\d
: Match a single digit.\D
: Match any single character that isn't a digit.\w
: Match any word character ([A-Za-z0-9_]
)\W
: Match any character that isn't a word character ([^A-Za-z0-9_]
)\s
: Matches a single whitespace character\S
: Matches any single non-whitespace character[ab]
: Matchesa
orb
[a-z]
: Matches any single character betweena
andz
.[a-zA-Z_]
: Matches the character_
or any single character that is inside the rangesa-z
andA-Z
[^a-zA-Z_]
: Matches any single character that isn't_
and isn't inside the rangesa-z
andA-Z
Quantifiers specify how many times a part of the regular expression should be repeated.
*
– match zero or more times+
– match one or more times?
– match zero or one time
Quantifiers are greedy by default, so they'll match as much input as they can. To make them lazy add a ?
at the end: *?
, +?
and ??
A regex like
.*(\w+).*
would fail to capture complete words because.*
is greedy. For the inputaaa
the regex(\w+)
would only match onea
. To fix this make both.*
lazy:.*?(\w+).*?
Capturing groups allow to extract the part of the input that matched the group.
To create a capturing group just enclose a subexpression between parenthesis: (subexpression)
Regex | (\w+)@(\w+)\.(\w+) |
Input | [email protected] |
Result |
Group 0: [email protected] Group 1: myemail Group 2: gmail Group 3: com |
The group 0 always contains the full text that matched. To access the value of the groups do match.group(name)
. You can also set a name for
the group instead of a increasing number using (?<name>subexpression)
:
Regex | (?<user>\w+)@(?<domain>\w+)\.(?<extension>\w+) |
Input | [email protected] |
Result |
Group 0: [email protected] Group user: myemail Group domain: gmail Group extension: com |
If you want to group a expression but don't want to capture it, use a non capturing group (?:subexpression)
Regex | (?:a+)(b+) |
Input | aaabbb |
Result |
Group 0: aaabbb Group 1: bbb
|
a|b
: Matchesa
orb
There are multiple characters that have a special meaning. If instead you want to match that especific character you can escape it:
.
: Matches any character (except line breaks)\.
: Matches the dot character
Anchors are a bit weird because they don't match characters. Instead they match positions:
^
: Matches the beginning of the string$
: Matches the end of the string
Note: Some regex engines have multiline modes, in which the meaning of these anchors get expanded. Sadly this engine hasn't that mode :(.
Without anchors:
Regex (findAllMatches) | \d+ |
Input | 10 20 30 |
Result |
Match 0: 10 Match 1: 20 Match 1: 30
|
With ^ anchor:
Regex (findAllMatches) | ^\d+ |
Input | 10 20 30 |
Result |
Match 0: 10
|
With $ anchor:
Regex (findAllMatches) | \d+$ |
Input | 10 20 30 |
Result |
Match 0: 30
|
- Allow character classes inside
[]
. For example:[\w]+
{n}
,{n,}
and{n,m}
quantifiers- Backreferences (might be relatively easy if I change the matchers to decide how much input it consumes and giving it access to the memory)
- Multiple modes (case insensitive, multiline)
- Unicode escaping
Distributed under the MIT License. See LICENSE.md for more information.