spaCyEx
is a powerful extension for spaCy, designed to make pattern matching as flexible and easy as using regular expressions. It builds upon the existing capabilities of spaCy's Matcher
, enhancing it with a more accessible syntax for defining complex patterns. spaCyEx
allows for intuitive and detailed text pattern specifications, perfect for extracting detailed linguistic features from texts.
You can install spaCyEx
via pip:
pip install spacyex
- Dynamic Pattern Creation: Create complex token matching patterns using a simple string-based syntax.
- Integration with spaCy: Leverage spaCy's Matcher capabilities to find sequences in text that match defined patterns.
- Customizable Matching Rules: Define token attributes including text characteristics, lexical attributes, and grammatical properties.
Define patterns using a string syntax where each token and its attributes are encapsulated by parentheses. Token attributes are specified by key-value pairs, separated by an equals sign (=
), and multiple attributes are divided by a pipe (|
).
- Single Attribute:
(pos=NOUN)
- Multiple Attributes:
(pos=NOUN|lemma=run)
- Using List Values:
(lemma=in[run,walk])
- Using Operators:
(ent_type=person|op={2,3})
Once a pattern is defined, it can be used to search text for matches.
Here is a simple example to get started with spaCyEx
:
import spacyex as se
import spacy
nlp = spacy.load("en_core_web_sm")
text = "John Smith runs fast, but Jacob Smith walks slowly."
pattern = "(ent_type=person|op={2}) (lemma=in[run,walk]) (pos=ADV)"
results = se.search(pattern, text, nlp)
for match in results:
print(match[0].text, "Start:", match[1], "End:", match[2])
This code will match sequences in the text based on the defined pattern, using named entities, lemmas, and parts of speech.
- Support for all dictionary properties in patterns.
- Additional utilities and helper functions for more complex pattern scenarios.