Matcher

A high-performance matcher designed to solve AND OR NOT and TEXT VARIATIONS problems in word/word_list matching.

It's helpful for

Content Filtering: Detecting and filtering out offensive or sensitive words.
Search Engines: Improving search results by identifying relevant keywords.
Text Analysis: Extracting specific information from large volumes of text.
Spam Detection: Identifying spam content in emails or messages.
···

Features

For detailed implementation, see the Design Document.

Multiple Matching Methods:
- Simple Word Matching
- Regex-Based Matching
- Similarity-Based Matching
Text Transformation:
- Fanjian: Simplify traditional Chinese characters to simplified ones. Example: 蟲艸 -> 虫草
- Delete: Remove specific characters. Example: *Fu&*iii&^%%*&kkkk -> Fuiiikkkk
- Normalize: Normalize special characters to identifiable characters. Example: 𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟! -> hello world!
- PinYin: Convert Chinese characters to Pinyin for fuzzy matching. Example: 西安 -> xi an, matches 洗按 -> xi an, but not 先 -> xian
- PinYinChar: Convert Chinese characters to Pinyin. Example: 西安 -> xian, matches 洗按 and 先 -> xian
AND OR NOT Word Matching:
- Takes into account the number of repetitions of words.
- Example: hello&world matches hello world and world,hello
- Example: 无&法&无&天 matches 无无法天 (because 无 is repeated twice), but not 无法天
- Example: hello~helloo~hhello matches hello but not helloo and hhello
Customizable Exemption Lists: Exclude specific words from matching.
Efficient Handling of Large Word Lists: Optimized for performance.

Rust Users

See the Rust README.

Python Users

See the Python README.

C, Java and Other Users

We provide dynamic library to link. See the C README and Java README.

Build from source

git clone https://github.com/Lips7/Matcher.git
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain nightly -y
cargo build --release

Then you should find the libmatcher_c.so/libmatcher_c.dylib/matcher_c.dll in the target/release directory.

Pre-built binary

Visit the release page to download the pre-built binary.

Benchmarks

Please refer to benchmarks for details.

Roadmap

Performance

Flexibility

Readability

More precise and convenient MatchTable.
More detailed and rigorous benchmarks.
More detailed and rigorous tests.
More detailed simple match type explanation.
More detailed DESIGN.
Write a Chinese README.

Name		Name	Last commit message	Last commit date
Latest commit History 358 Commits
.cargo		.cargo
.github		.github
ci		ci
data		data
matcher_c		matcher_c
matcher_java		matcher_java
matcher_py		matcher_py
matcher_rs		matcher_rs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DESIGN.md		DESIGN.md
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
README_CN.md		README_CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Matcher

Features

Rust Users

Python Users

C, Java and Other Users

Build from source

Pre-built binary

Benchmarks

Roadmap

Performance

Flexibility

Readability

About

Licenses found

Releases 30

Packages

Languages

License

Licenses found

Lips7/Matcher

Folders and files

Latest commit

History

Repository files navigation

Matcher

Features

Rust Users

Python Users

C, Java and Other Users

Build from source

Pre-built binary

Benchmarks

Roadmap

Performance

Flexibility

Readability

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 30

Packages 0

Languages

Packages