Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/bitap pattern matching #458

Merged
merged 4 commits into from
Jul 25, 2024

Conversation

Kalkwst
Copy link
Contributor

@Kalkwst Kalkwst commented Jul 25, 2024

This PR introduces the Bitap algorithm, a powerful string-matching technique renowned for its efficiency in finding exact patterns within text. Bitap leverages bitwise operations to achieve rapid matching, making it particularly well-suited for scenarios where speed is crucial, such as large-scale text processing, search engines, and bioinformatics applications.

The implemented functions, 1FindExactPattern1 and 1FindFuzzyPattern`, provide implementation for both exact matching and fuzzy matching using this algorithm.

Motivation

  • Speed Optimization: Bitap offers a substantial performance improvement over naïve string search approaches, especially when dealing with longer patterns or large datasets.
  • Versatility: While primarily designed for exact pattern matching, Bitap can be extended to handle fuzzy matching with minor modifications.
  • Reduced Complexity: Bitap's elegant bit-parallel approach can simplify code compared to other string-matching algorithms.

How Bitap Works (Simplified)

  • Pattern Preprocessing: A bitmask is created for the pattern, representing each character's presence or absence.
  • Text Scanning: The algorithm iterates over the text, maintaining a "state" register. Bitwise operations are used to update this state based on the current character.
  • Match Detection: When the state register indicates a match, the corresponding index is returned.

Pros

  • Exceptional Speed: Outperforms naïve string search in most cases.
  • Compact Implementation: Relatively straightforward to code compared to some other algorithms.
  • Low Memory Usage: Bitap primarily relies on bitwise operations, using minimal memory.

Cons

  • Limited to Exact Matching: While fuzzy matching extensions exist, they can be less efficient than dedicated fuzzy search algorithms.
  • Preprocessing Overhead: Bitap requires a preprocessing step to create the pattern mask.
  • Character Set Limitations: The classic Bitap algorithm assumes a fixed character set (e.g., ASCII).

Use Cases

  • Text Editors: Implementing fast search functionality.
  • Search Engines: Efficiently indexing and retrieving text data.
  • Bioinformatics: Quickly identifying patterns in DNA or protein sequences.
  • Network Intrusion Detection: Scanning network traffic for malicious patterns.

See also

Bitap Algorithm - Wikipedia


  • I have performed a self-review of my code
  • My code follows the style guidelines of this project
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Comments in areas I changed are up to date
  • I have added comments to hard-to-understand areas of my code
  • I have made corresponding changes to the README.md

This commit introduces an implementation of the Bitap algorithm, a powerful string searching method renowned for its ability to perform both exact and fuzzy (approximate) pattern matching.
@Kalkwst Kalkwst requested a review from siriak as a code owner July 25, 2024 13:01
Copy link

codecov bot commented Jul 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.88%. Comparing base (327af3d) to head (7ab8ba1).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #458      +/-   ##
==========================================
+ Coverage   94.84%   94.88%   +0.03%     
==========================================
  Files         234      235       +1     
  Lines        9897     9967      +70     
  Branches     1393     1408      +15     
==========================================
+ Hits         9387     9457      +70     
  Misses        392      392              
  Partials      118      118              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Kalkwst Kalkwst force-pushed the feature/Bitap-Pattern-Matching branch from 2801c9f to bc85a0c Compare July 25, 2024 14:11
@Kalkwst Kalkwst force-pushed the feature/Bitap-Pattern-Matching branch from bc85a0c to 7ab8ba1 Compare July 25, 2024 20:05
Copy link
Member

@siriak siriak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks a lot for contributing this algorithm!

@siriak siriak merged commit b0838cb into TheAlgorithms:master Jul 25, 2024
4 checks passed
@TheAlgorithms TheAlgorithms deleted a comment from Esk-C Jul 25, 2024
@Kalkwst Kalkwst deleted the feature/Bitap-Pattern-Matching branch July 26, 2024 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants