Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect reading time calculation for entries with CJK characters #3820

Conversation

lizyn
Copy link
Contributor

@lizyn lizyn commented Dec 25, 2018

Q A
Bug fix? yes
New feature? no
BC breaks? no
Deprecations? no
Tests pass? yes/no
Documentation no
Translation no
CHANGELOG.md no
License MIT
(Sorry but I don't know how to do a test)

Trying to fix #3821 .

Sentences in CJK don't use explicit delimiters, so previous match scheme will treat sentences like 这本来应该是一句很长很长的话... as a single word, which leads to reading time always being < 1min.

As there is no easy rules to identify words in CJK sentences, it may be a good idea to treat single CJK character as a word for reading time estimation. (And that's how length calculations are usually done for CJK contents.)

In the modified match pattern in file /src/Wallabag/CoreBundle/Tools/Utils.php, \p{xx} stands for a character in:

xx Language
Han Chinese
Hiragana, Katakana Japanese
Hangul Korean

Note: This modification should not influence reading time calculation for entries without CJK content.

@j0k3r j0k3r force-pushed the bugfix/incorrect-calculation-of-CJK-characters-in-reading-time-estimation branch from 2ad598e to 35983eb Compare January 4, 2019 10:24
@j0k3r
Copy link
Member

j0k3r commented Jan 4, 2019

Thanks for that PR @lizyn!
I've improved tests following your changes.

You said your tried to fix the related issue, what left to be done to fix it?

@lizyn
Copy link
Contributor Author

lizyn commented Jan 5, 2019

@j0k3r Thanks for your help!

I'm kind of a newbie and currently don't know how to set a test environment, if you're available, could you please help with this:

Try to bag the article at https://sspai.com/post/47150 and check the reading time.

The characters in this articles, if counted in my modified way, should be around 9061 words, so the reading time would be about 45 min (200 words/min).If that's satisfied, then I think there's nothing left to fix.

(For my current instance, which runs wallabag 2.3.5 through docker-compose, the estimated reading time is 3min. And that's way too short.)

FYI, I counted the words in such ways:

pull contents from wallabag instance through API, and count it in PHP, results:

php > echo count(preg_split('~[^\p{L}\p{N}\']+~u', strip_tags($str)));                                                                                                                                      
780  # the old way
php > echo count(preg_split('~([^\p{L}\p{N}\']+|\p{Han}|\p{Hiragana}|\p{Katakana}|\p{Hangul})~u', strip_tags($str)));
9061 # my modified way

@lizyn
Copy link
Contributor Author

lizyn commented Jan 5, 2019

However, the counted words seems too many in this modified ways.

As I said, there is no easy rules to identify words in CJK sentences, so I was treating a single character as a word in the initial commit, maybe that's too radical.

Now I'm thinking of counting 3 CJK characters together as a word if possible, by count(preg_split('~([^\p{L}\p{N}\']+|(\p{Han}|\p{Hiragana}|\p{Katakana}|\p{Hangul}){1,3})~u', strip_tags($str))). Mind the added part of {1,3}. Yet I don't know whether 3 is the proper number, maybe {1,2} works better.

It would be helpful if someone speaks CJK language can help me decide this number. Otherwise, I think I'll stick with 3.

FYI: in this way, the counted words are 3764 for the above article.

EDIT: Sorry I changed my mind and altered the number to 2 because that's usually how chars form words in Chinese. Then the word count should be 5083 and reading time is 25min. I barely changed the code, but the travis build somehow failed again, I'm terribly sorry for the bother.

@lizyn lizyn force-pushed the bugfix/incorrect-calculation-of-CJK-characters-in-reading-time-estimation branch from b6989bd to 0802fa5 Compare January 5, 2019 05:54
@Kdecherf
Copy link
Member

Kdecherf commented Jan 5, 2019

Hello @lizyn, you can update the expected reading time in the samples files according to the result of travis, see https://travis-ci.org/wallabag/wallabag/jobs/475616302#L946-L962

@lizyn lizyn force-pushed the bugfix/incorrect-calculation-of-CJK-characters-in-reading-time-estimation branch from 0802fa5 to 96f33ef Compare January 5, 2019 16:21
@lizyn
Copy link
Contributor Author

lizyn commented Jan 5, 2019

Thanks, I think it's ready for review now, nothing wrong AFAIK, @j0k3r @Kdecherf .

@lizyn lizyn force-pushed the bugfix/incorrect-calculation-of-CJK-characters-in-reading-time-estimation branch from 96f33ef to 7f8630b Compare January 5, 2019 17:22
@Kdecherf Kdecherf requested a review from j0k3r January 5, 2019 21:11
Copy link
Member

@Kdecherf Kdecherf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Member

@j0k3r j0k3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok!

@j0k3r j0k3r merged commit d2aec70 into wallabag:master Jan 7, 2019
@j0k3r
Copy link
Member

j0k3r commented Jan 7, 2019

Thanks @lizyn

@lizyn lizyn deleted the bugfix/incorrect-calculation-of-CJK-characters-in-reading-time-estimation branch February 24, 2020 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong reading time estimation for articles in CJK
3 participants