Extract variable-length MWE using a user-defined POS regex pattern. #65
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #19 and closes #42
Problem
Currently, only two-word MWEs are supported by the
MWE
class.Objective
Enable the extraction of higher-order Ngrams.
Results
HigherOrderMWEExtractor
with workhorse methodextract_higher_order_mwes()
for extracting variable length MWE given a user pattern of POS tags.Considerations for Future Work
HigherOrderMWEExtractor
was designed to be decoupled from the rest of the dataset; unlike the classMWE
that is coupled with the entire corpus via itsdf
attribute.HigherOrderMWEExtractor
operates atomically on a single sentence input and for a specific pattern for that input. This allows for vectorising using different patternsMWE
class, as the latter is significantly broader than just aMWE
.MWE
could be renamed toMWEExtractor
(although this is still too generic), andHigherOrderMWEExtractor
could be renamed toVariableLengthMWEExtractor
,POSPatternMWEExtractor
, etc.HigherOrderMWEExtractor.extract_higher_order_mwes()
as it usesnltk.RegexpParser
, i.e. uses regex pattern matching, which could have efficiency implications.python3.9
(as I'm only using 3.10), but for some reason mypy would not let me run pre-commit hooks without also committing the.pre-commit-config.yaml
. It feels like not committing this file should be whitelisted by mypy, but I could be wrong.