-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We lose frequency information in deduplication #30
Comments
Bifixer still hasn't switched to source and target hashes separatedly, so at least in the current pipeline, those two sentences would be in separated tu entries. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
cirrus-scripts/bitextor-buildTMX.py
Lines 180 to 184 in 61765e3
Martin Popel pointed out that if we do it this way, say we have 10.000 pairs of
Yes -> Ja
in the data, and oneYes -> Fuck off
, both make it into the TMX with a single entry. When then someone wants to deduplicate on the source side of the sentence pairs, and has to make a decision which pair to keep, having the frequency information might be quite helpful.The text was updated successfully, but these errors were encountered: