-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of URIs #320
Comments
I work with text that may contain URLs. I pre-process documents before feeding into lingua-rs, and I use linkify crate to find URL indices. Finding URLs is a tricky problem on its own, and there are many ways to do it. linkify returns any string that is valid according to specs, but there can be false positives. In addition, I validate domain names using addr |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
URLs tend to make the language detector switch to English. For example:
Results:
Notice also the end of the URL
.html
is separated from the beginning of the URL and classified as a language change again.The same also happens if just the domain name
somerandomwebsite.com
is referenced in the text.Would it be reasonable for the language detector to treat URIs as "language neutral" stretches of text that maybe also assume the language of the surrounding text? Treating URis as atomic would also solve the issue of URIs being split by the language detector.
Note: It is also possible to do this in post processing the results of Lingua. So after receiving the start/end indices of each language segment from Lingua, I then apply my URI regular expression to find start/end indices of URIs and then modify the Lingua results accordingly.
The text was updated successfully, but these errors were encountered: