Skip to content
/ Matcher Public

A high performance matcher designed to solve AND OR NOT logical word matching and TEXT VARIATIONS problems.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

Lips7/Matcher

Repository files navigation

Matcher

RustPythonJavaC

PyPI - License

Crates.io VersionGitHub Actions Workflow Statusdocs.rsCrates.io Total Downloads

PyPI - VersionPyPI - Python VersionPyPI - Downloads

A high-performance matcher designed to solve AND OR NOT and TEXT VARIATIONS problems in word/word_list matching.

It's helpful for

  • Content Filtering: Detecting and filtering out offensive or sensitive words.
  • Search Engines: Improving search results by identifying relevant keywords.
  • Text Analysis: Extracting specific information from large volumes of text.
  • Spam Detection: Identifying spam content in emails or messages.
  • ···

Features

For detailed implementation, see the Design Document.

  • Multiple Matching Methods:
    • Simple Word Matching
    • Regex-Based Matching
    • Similarity-Based Matching
  • Text Transformation:
    • Fanjian: Simplify traditional Chinese characters to simplified ones. Example: 蟲艸 -> 虫草
    • Delete: Remove specific characters. Example: *Fu&*iii&^%%*&kkkk -> Fuiiikkkk
    • Normalize: Normalize special characters to identifiable characters. Example: 𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟! -> hello world!
    • PinYin: Convert Chinese characters to Pinyin for fuzzy matching. Example: 西安 -> xi an, matches 洗按 -> xi an, but not -> xian
    • PinYinChar: Convert Chinese characters to Pinyin. Example: 西安 -> xian, matches 洗按 and -> xian
  • AND OR NOT Word Matching:
    • Takes into account the number of repetitions of words.
    • Example: hello&world matches hello world and world,hello
    • Example: 无&法&无&天 matches 无无法天 (because is repeated twice), but not 无法天
    • Example: hello~helloo~hhello matches hello but not helloo and hhello
  • Customizable Exemption Lists: Exclude specific words from matching.
  • Efficient Handling of Large Word Lists: Optimized for performance.

Rust Users

See the Rust README.

Python Users

See the Python README.

C, Java and Other Users

We provide dynamic library to link. See the C README and Java README.

Build from source

git clone https://github.com/Lips7/Matcher.git
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain nightly -y
cargo build --release

Then you should find the libmatcher_c.so/libmatcher_c.dylib/matcher_c.dll in the target/release directory.

Pre-built binary

Visit the release page to download the pre-built binary.

Benchmarks

Please refer to benchmarks for details.

Roadmap

Performance

  • Cache middle results during different ProcessType reduce_process_text function calling. (failed, too slow)
  • Try more aho-corasick library to improve performance and reduce memory usage.
  • Make aho-corasick unsafe.
  • Optimize NOT logic word-wise.
  • Optimize RegexMatcher using RegexSet.
  • Optimize SimpleMatcher when multiple ProcessType are used.
    1. Consider if there are multiple ProcessType
    • None
    • Fanjian
    • FanjianDelete
    • FanjianDeleteNormalize
    • FanjianNormalize
    1. We can construct a chain of transformations,
    • None -> Fanjian -> Delete -> Normalize
    •                  \ -> Normalize.
    1. Calcuate all possible transformations, and cache the results, so that instead calculating 8 times (Fanjian, Fanjian + Delete, Fanjian + Delete + Normalize, Fanjian + Normalize), we only need to calculate 4 times (Fanjian, Delete, Normalize, Normalize).
  • Optimize process matcher when perform reduce text processing.
    1. Consider we have to perform FanjianDeleteNormalize, we need to perform Fanjian first, then Delete, then Normalize, 3 kinds of Process Matcher are needed to perform replacement or delete, the text has to be scanned 3 times.
    2. What if we only construct only 1 Process Matcher which's patterns contains all the Fanjian, Delete and Normalize 3 kinds of patterns? We could scan the text only once to get all the positions that should be perform replacement or delete.
    3. We need to take care of the byte index will change after replacement or delete, so we need to take the offset changes into account.
  • Merge multiple aho-corasick matcher into one when multiple ProcessType are used.
  • When dfa feature is disabled, use daachorse to perform text processing.
    • Do not use it for simple process function, too slow to build.
  • Use more regex set to optimize regex matcher.

Flexibility

  • Cache get_process_matcher results globally, instead of caching result inside SimpleMatcher.
  • Expose reduce_process_text to Python.
  • Add a new function that can handle single simple match type.
    • text_process now is available.
  • Add fuzzy matcher, https://github.com/lotabout/fuzzy-matcher.
    • Use rapidfuzz instead.
  • Make SimpleMatcher and Matcher serializable.
  • Implement NOT logic word-wise.
  • Support stable rust.
  • Support iterator.
  • A real java package.
  • Multiple Python version wheel build.
  • Customize str conversion map.
  • Add Matcher process function to py, c and java.
  • For simple matcher, is it possible to use regex-automata to replace aho-corasick? and support regex.
  • Add simple match type to RegexMatcher and SimMatcher to pre-process a text.
  • Try to replace msgpack.

Readability

  • More precise and convenient MatchTable.
  • More detailed and rigorous benchmarks.
  • More detailed and rigorous tests.
  • More detailed simple match type explanation.
  • More detailed DESIGN.
  • Write a Chinese README.