Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix/clean-pictograms-from-transcripts-before-indexing #165

Merged

Conversation

evamaxfield
Copy link
Member

Link to Relevant Issue

This pull request resolves #151

Description of Changes

Include a description of the proposed changes.

Finally found time to fix this one. Can't believe the bug either.

Looks like a transcript from Seattle has a pictogram / emoticon in it...
link -- start at 30:09 -- or search for

This fixes the pipeline by adding a function to clean all common pictograms / emojis from the sentence before stemming and fuzzy matching for context spans.

Tested by running the pipeline and storing the index locally:
run_cdp_event_index -n 1 --store_local --parallel ../configs-and-special-events/seattle.json

@evamaxfield evamaxfield added the bug Something isn't working label Feb 17, 2022
@evamaxfield evamaxfield self-assigned this Feb 17, 2022
@codecov
Copy link

codecov bot commented Feb 17, 2022

Codecov Report

Merging #165 (b5babdc) into main (ede007f) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #165      +/-   ##
==========================================
- Coverage   94.56%   94.49%   -0.08%     
==========================================
  Files          50       50              
  Lines        2558     2560       +2     
==========================================
  Hits         2419     2419              
- Misses        139      141       +2     
Impacted Files Coverage Δ
cdp_backend/pipeline/event_index_pipeline.py 85.71% <ø> (ø)
cdp_backend/tests/utils/test_string_utils.py 100.00% <100.00%> (ø)
cdp_backend/utils/string_utils.py 81.39% <100.00%> (-3.98%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ede007f...b5babdc. Read the comment docs.

Copy link
Collaborator

@isaacna isaacna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Honestly pretty suprised that Google speech-to-text generated an emoji haha

@evamaxfield
Copy link
Member Author

Looks good to me! Honestly pretty suprised that Google speech-to-text generated an emoji haha

I think this transcript is from a converted closed caption. Which makes a bit more sense 😂

Copy link
Collaborator

@tohuynh tohuynh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

cdp_backend/utils/string_utils.py Outdated Show resolved Hide resolved
@evamaxfield evamaxfield merged commit 203fec3 into main Feb 17, 2022
@evamaxfield evamaxfield deleted the bugfix/clean-pictograms-from-transcripts-before-indexing branch February 17, 2022 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Context span selection fails during index creation
3 participants