-
-
Notifications
You must be signed in to change notification settings - Fork 760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix incorrect reading time calculation for entries with CJK characters #3820
Fix incorrect reading time calculation for entries with CJK characters #3820
Conversation
2ad598e
to
35983eb
Compare
Thanks for that PR @lizyn! You said your tried to fix the related issue, what left to be done to fix it? |
@j0k3r Thanks for your help! I'm kind of a newbie and currently don't know how to set a test environment, if you're available, could you please help with this: Try to bag the article at https://sspai.com/post/47150 and check the reading time. The characters in this articles, if counted in my modified way, should be around (For my current instance, which runs wallabag 2.3.5 through docker-compose, the estimated reading time is 3min. And that's way too short.) FYI, I counted the words in such ways: pull contents from wallabag instance through API, and count it in PHP, results:
|
However, the counted words seems too many in this modified ways. As I said, there is no easy rules to identify words in CJK sentences, so I was treating a single character as a word in the initial commit, maybe that's too radical. Now I'm thinking of counting 3 CJK characters together as a word if possible, by It would be helpful if someone speaks CJK language can help me decide this number. Otherwise, I think I'll stick with
EDIT: Sorry I changed my mind and altered the number to |
b6989bd
to
0802fa5
Compare
Hello @lizyn, you can update the expected reading time in the samples files according to the result of travis, see https://travis-ci.org/wallabag/wallabag/jobs/475616302#L946-L962 |
0802fa5
to
96f33ef
Compare
96f33ef
to
7f8630b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok!
Thanks @lizyn |
Trying to fix #3821 .
Sentences in CJK don't use explicit delimiters, so previous match scheme will treat sentences like
这本来应该是一句很长很长的话...
as a single word, which leads to reading time always being< 1min
.As there is no easy rules to identify words in CJK sentences, it may be a good idea to treat single CJK character as a word for reading time estimation. (And that's how length calculations are usually done for CJK contents.)
In the modified match pattern in file
/src/Wallabag/CoreBundle/Tools/Utils.php
,\p{xx}
stands for a character in:Note: This modification should not influence reading time calculation for entries without CJK content.