Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We lose frequency information in deduplication #30

Open
jelmervdl opened this issue Jun 20, 2023 · 1 comment
Open

We lose frequency information in deduplication #30

jelmervdl opened this issue Jun 20, 2023 · 1 comment

Comments

@jelmervdl
Copy link
Member

elif prev_hash == line_hash and options.dedup:
urls1.update(fieldsdict['url1'].split(' '))
urls2.update(fieldsdict['url2'].split(' '))
if 'collection' in fieldsdict.keys():
collections.add(fieldsdict['collection'])

Martin Popel pointed out that if we do it this way, say we have 10.000 pairs of Yes -> Ja in the data, and one Yes -> Fuck off, both make it into the TMX with a single entry. When then someone wants to deduplicate on the source side of the sentence pairs, and has to make a decision which pair to keep, having the frequency information might be quite helpful.

@ZJaume
Copy link
Member

ZJaume commented Jun 20, 2023

Bifixer still hasn't switched to source and target hashes separatedly, so at least in the current pipeline, those two sentences would be in separated tu entries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants