Improvements to dictionary matching #180

frreiss · 2021-03-10T00:01:19Z

This PR includes some fixes for issues with our dictionary matching that encountered while working on a market intelligence use case for a blog post.

The specific problems addressed here are:

There was no API for creating a dictionary from an in-memory list, so I've added a new function create_dict().
SpaCy's default set tokenizers tend to overthink how they treat punctuation, which leads to dictionary entries getting tokenized differently from text, which messes up dictionary matching. So I've added a function simple_tokenizer() that returns a tokenizer that splits on every chunk of whitespace and on every punctuation character. Dictionary creation uses that tokenizer by default now.

I also fixed a minor bug in the handling of the warnings element of responses from Watson Natural Language Understanding.

…tion

…r-pandas into branch-dict

frreiss · 2021-03-24T22:04:48Z

@BryanCutler would you mind giving these changes a quick review?

Thanks!

BryanCutler · 2021-03-25T23:15:59Z

Sorry, slipped under the radar. Looking at it now.

BryanCutler

Just a couple nit questions, otherwise LGTM

text_extensions_for_pandas/spanner/extract.py

frreiss · 2021-03-26T20:03:40Z

Thanks for the review! Pushed some corrections. Will merge once this branch passes tests.

frreiss added 2 commits March 9, 2021 15:52

Fix tokenization problems in dictionary APIs and add online dict crea…

9db7f1c

…tion

Merge branch 'master' of https://github.com/CODAIT/text-extensions-fo…

98fcb14

…r-pandas into branch-dict

frreiss requested a review from BryanCutler March 10, 2021 00:01

BryanCutler approved these changes Mar 25, 2021

View reviewed changes

text_extensions_for_pandas/spanner/extract.py Show resolved Hide resolved

text_extensions_for_pandas/spanner/extract.py Outdated Show resolved Hide resolved

Fix typo and reformat with black

d6fac50

frreiss merged commit fdf40e0 into CODAIT:master Mar 26, 2021

frreiss deleted the branch-dict branch October 29, 2021 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to dictionary matching #180

Improvements to dictionary matching #180

frreiss commented Mar 10, 2021

frreiss commented Mar 24, 2021

BryanCutler commented Mar 25, 2021

BryanCutler left a comment

frreiss commented Mar 26, 2021

Improvements to dictionary matching #180

Improvements to dictionary matching #180

Conversation

frreiss commented Mar 10, 2021

frreiss commented Mar 24, 2021

BryanCutler commented Mar 25, 2021

BryanCutler left a comment

Choose a reason for hiding this comment

frreiss commented Mar 26, 2021