Skip to content
/ regexp Public

A command line tool for simplified regular expression matching

License

Notifications You must be signed in to change notification settings

Lai-YT/regexp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

87 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

regexp

CI Status GitHub Issues GitHub Pull Requests License


Regular expression implementation.
Supports . ( ) | * + ?. No escapes.

πŸ“ Table of Contents

🧐 About

regexp is a side project I wrote when learning Compilers. It's a command line tool for simplified regular expression matching which converts the regular expression into an NFA and simulates the NFA using Thompson's algorithm.

🏁 Getting Started

To try out regexp, clone (or download) this repo and compile the source locally.

Prerequisites

regexp uses GNU Make to control the generation of executable and gcc as the C compiler.

Installing

# Get the repo
$ git clone https://github.com/Lai-YT/regexp.git

# Compile to executable with optimizations
$ make release

The executable will be located in the bin/ folder and named regexp.

πŸ”§ Running the tests

regexp uses cmocka for unit-testing and Valgrind for detecting memory management bugs.

  • Run the unit tests
$ make tests
  • Check for memory-related bugs
$ make valgrind
  • Run the command line tests
# Give permission on execution
$ chmod +x cli_test.sh

# Run CLI tests on release (or debug) build
$ ./cli_test.sh --release # --debug

🎈 Usage

Get the help message,

$ bin/regexp -h # or --help
regexp

Usage: regexp [-h] [-V] {-g regexp [-o FILE] | [-c] regexp string}

Description: Regular expression implementation.
Supports . ( ) | * + ?. No escapes.
Compiles to NFA and then simulates NFA using Thompson's algorithm.

One can either match a string (default) or graph the regexp.
See the following options.
Notice that character # can't appear in the regular expression,
it's reserved technically as the special character.

Options:
  -h, --help            Shows this help message and exit
  -V, --version         Shows regexp version and exit

Match mode:
  Matches the string with the regular expression,
  exits with 1 if regexp is ill-formed or it does not match

  -c, --cache           Caches NFA states to build DFA on the fly
  regexp                The regular expression to use on matching
  string                The string to be matched

Graph mode:
  Converts the regular expression into a graph,
  exits with 1 if regexp is ill-formed or the file can't be opened

  -g, --graph           Converts the NFA of the regexp into a Graphviz
                        dot file (default: False)
  -o FILE, --output FILE
                        The name of the dot file.
                        A .dot extension is appended automatically
                        (default: nfa)
  regexp                The regular expression to be converted

Written by: Lai-YT

regexp version: 1.0.2

Example

Match mode

This is the default mode.
regexp takes two arguments: a regular expression and a string to match.

$ bin/regexp '(a|b)*abb' 'bababb'

This exits with 0 if the string is matched by the regular expression or 1 if the regular expression is ill-formed or too long.

You can check the exit code with the following command if you're on an Unix shell.

$ echo $?

Caching the NFA to build a DFA on the fly

"In a sense, Thompson's NFA simulation is executing the equivalent DFA by reconstructing each DFA state as it is needed. Rather than throw away this work after each step, we could cache them, avoiding the cost of repeating the computation in the future and essentially computing the equivalent DFA as it is needed." (Russ Cox, see Acknowledgments)

regexp doesn't cache the states by default.
Set the --cache (or -c) option to enable the caching.

$ bin/regexp -c '(a|b)*abb' 'bababb'

Graph mode

regexp uses Graphviz to graph the NFA of a regular expression. It represents the NFA with the DOT language.

By setting the --graph (or -g) option, you can use the graph mode. This takes a single regular expression as an argument.

  • Convert the NFA of a regular expression into a DOT file,
$ bin/regexp --graph '(a|b)*abb'

By default, the DOT file is written into nfa.dot.

  • You can also specify the output file with the --output (or -o) option, which takes a file as an argument.
$ bin/regexp --graph '(a|b)*abb' --output 'some/path/and/filename'

This will then write the graph into some/path/and/filename.dot.

Note

A .dot extension is always append to the you given.

  • After the DOT file is generated, you can convert it into a PNG image with Graphviz.
$ dot -Tpng nfa.dot -o nfa.png

The NFA of "(a|b)*abb"

Note

The numbering of the states is related to the order of their creations.

See the command line documentation of Graphviz to learn more.

πŸš€ Development

Run regex with debug messages and less optimizations (-O0).

$ make debug

There are several other make targets to use.

$ make help
Target rules:
    debug    - Compiles and generates binary file with
               debug messages and less optimizations
    release  - Compiles and generates optimized binary file
    tests    - Compiles with cmocka and runs test binary file
    valgrind - Runs test binary file using valgrind tool
    fmt      - Formats the source and test files
    tidy     - Checks naming conventions and bug-proneness
    clean    - Cleans the project by removing binaries
    help     - Prints a help message with target rules

Implementation

regex matches strings with regular expressions in 3 steps:

  1. The regular expression is converted into a parenthesis-free postfix notation using the # operator to make concatenations explicit. This is implemented in re2post.c.
  2. The postfixed regular expression is converted into a Nondeterministic Finite Automaton (NFA) using Thompson's algorithm. This step is implemented in post2nfa.c.
  3. Reads in the input string character by character and walks along the NFA. If it stops at the accepting state when the entire string has been read, the string is considered a match. This step is implemented in regexp.c.

By breaking down the process into these 3 steps, regexp is able to efficiently match strings with regular expressions.

Codebase structure

regexp
β”‚   Makefile
β”‚   README.md
β”‚   ...
└───bin
β”‚     regexp
β”‚     regexp_test_runner
└───src
β”‚     main.c
β”‚     regexp.h
β”‚     regexp.c
β”‚     ...
└───test
β”‚     main.c
β”‚     regexp.h
β”‚     regexp.c
β”‚     ...
└───lib
β”‚     main.o
β”‚     regexp.o
β”‚     ...
└───log
      valgrind.log
  • bin/: executables
  • src/: source files
  • test/: test files
  • lib/: object files
  • log/: output message of Valgrind

Note

There aren't any prefix or postfix on the filename of test files.

Code formatting

regexp uses ClangFormat as formatting tool.

Format source and test codes,

$ make fmt

Naming conventions & bug-proneness checking

regexp uses Clang-Tidy as the linter, which checks naming conventions and several bug-proneness.

Check naming conventions and bug-proneness,

$ make tidy

πŸŽ‰ Acknowledgements

I would like to express my gratitude to Russ Cox for his inspiring work, which motivated me to embark on this project. His webpage on regular expression matching has been a valuable reference, demonstrating that simplicity can indeed be fast.
I would also like to thank Alfred Aho at el. for their excellent book on compilers, which has guided me through the implementation of the algorithms used in this project. Without their clear and insightful explanations, this project would not have been possible.

  • Cox, Russ. "Regular Expression Matching Can Be Simple And Fast," https://swtch.com/~rsc/regexp/regexp1.html, 2007
  • Aho, Alfred V., Sethi, Ravi, Ullman, Jeffrey D. "Compilers: Principles, Techniques, and Tools," 2nd ed., Addison-Wesley, 2006, pp. 147-166

✍️ License

This project is licensed under the MIT License. However, if any individual file within the project includes its own copyright claim, that claim should be respected and take precedence over the project-wide license.

Note

They are still guaranteed to be licensed under the MIT License but with different authors.

About

A command line tool for simplified regular expression matching

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published