Selected patches from Calibre #245

gsnedders · 2016-05-07T22:36:29Z

This cherry-picks a few things from https://github.com/gsnedders/html5lib-python/commits/calibre-patches, which was a complete set of Calibre's patches from November 2013. https://github.com/kovidgoyal/calibre/commits/master/src/html5lib has very little changed in it since then, primarily a move to 0.999999-dev and a separate downstream fix for 0c551c9.

So, of those on that branch…

gsnedders@49f37d2, gsnedders@64e8b0b, and gsnedders@cc9f28a are all things I'm against landing upstream (they should just be special-cased in the downstream tree builder rather than requiring invoking separate handling for all tree builders).
gsnedders@7702d80 is something I find very inelegant, and goes against the recommendations laid out in PEP 8. Also, with CPython, in the True/False cases it's likely slower, therefore failing at its stated goal, as it results in more byte code and POP_JUMP_IF_FALSE and POP_JUMP_IF_TRUE special-case the condition being True or False (oddly, they don't specialise None, though it is in PyObject_IsTrue; if that makes any notable performance difference then I'd suggest fixing that in CPython).
gsnedders@a2d2e05 is unlikely to land because the input stream isn't public API (not that we document what is anywhere…).
gsnedders@1576515 in principle is something we want, but I think we want for all tokens and not just elements. (See also Add line/column number in tokens from HTMLTokenizer #87 and Add element.sourceline (lxml) #97.) That or something similar will likely land in future.
gsnedders@124e975 is included here, after 9dca7d8 which fixes the DOM tree builder to follow the tree builder API correctly. (It exposed bugs in the DOM tree builder, yay!)
gsnedders@db0b5a0 is what most of the commits on this PR deal with, fixing up small issues with the Calibre code.

gsnedders · 2016-05-07T22:51:03Z

This seems to have broken 2.6 badly. Huh.

codecov-io · 2016-05-07T23:13:32Z

Current coverage is 89.15%

Merging #245 into master will decrease coverage by 1.39%

@@             master       #245   diff @@
==========================================
  Files            51         50     -1   
  Lines          6817       6726    -91   
  Methods           0          0          
  Messages          0          0          
  Branches       1316       1307     -9   
==========================================
- Hits           6172       5996   -176   
- Misses          485        559    +74   
- Partials        160        171    +11

Powered by Codecov. Last updated by 143b0d4...82b971c

gsnedders · 2016-05-07T23:14:39Z

Oh, right, this is viewkeys (and dictionary views in general) not existing on 2.6.

kovidgoyal · 2016-05-08T03:48:05Z

How do you suggest I override the application of attributes for html and body tags in my builder? Since without those patches, it would require overriding the entire getPhases() method in html5parser.py Remember that the problem those patches is solving is that there can be multiple <html> and <body> tags in malformed documents that I need to handle by merging the attributes from those tags into the first / tag. Doing that after parsing would require an whole extra tree traversal which can be slow for large trees.

If you dont want to merge gsnedders/html5lib-python@a2d2e05 then how do you suggest I replace the the stream input class? The one in html5lib is too slow. The only alternative I can see is monkey patching -- which is less than optimal for obvious reasons.

gsnedders · 2016-05-08T12:44:10Z

@kovidgoyal I'll take a look at dealing with html/body attributes later (I'm literally amount to board a plane).

When it comes to the input stream, if it yields good perf increases when given a byte/unicode object we should just specialise them in html5lib.

kovidgoyal · 2016-05-08T12:57:13Z

Sure you are welcome to take the input stream class from calibre for dealing with unicode objects. It is faster because it avoids wrapping the unicode in StringIO. And it actually implements tracking of positions. For my use case, that is important, since I need line and col numbers.

landscape-bot · 2016-05-22T01:15:32Z

Code quality remained the same when pulling 1f04a3f on gsnedders:calibre-selected-1 into cc99095 on html5lib:master.

landscape-bot · 2016-05-22T01:39:19Z

Code quality remained the same when pulling 96643a2 on gsnedders:calibre-selected-1 into cc99095 on html5lib:master.

landscape-bot · 2016-05-22T01:43:42Z

Code quality remained the same when pulling 76bf242 on gsnedders:calibre-selected-1 into cc99095 on html5lib:master.

landscape-bot · 2016-05-22T01:57:57Z

Code quality remained the same when pulling 761f3ab on gsnedders:calibre-selected-1 into 143b0d4 on html5lib:master.

gsnedders modified the milestone: 0.99999999 May 8, 2016

gsnedders added the needs-tests label May 10, 2016

gsnedders force-pushed the calibre-selected-1 branch from 82b971c to 1f04a3f Compare May 22, 2016 01:13

gsnedders and others added 5 commits May 22, 2016 02:52

Make DOM treebuilder's AttrList return a MutableMapping

2812e44

Speed up unnecessarily slow and obtuse dict comparison

29f0512

Clean up the constants imports in html5parser

0a885c6

Preserve attribute order when parsing

a137d14

Add a test for HTMLParser(debug=True)

761f3ab

gsnedders force-pushed the calibre-selected-1 branch from 76bf242 to 761f3ab Compare May 22, 2016 01:55

gsnedders merged commit 5288737 into html5lib:master May 22, 2016

gsnedders deleted the calibre-selected-1 branch May 22, 2016 02:05

eli-schwartz mentioned this pull request Nov 4, 2016

calibre: port to html5lib-python 0.999999999 kovidgoyal/calibre#583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selected patches from Calibre #245

Selected patches from Calibre #245

gsnedders commented May 7, 2016

gsnedders commented May 7, 2016

codecov-io commented May 7, 2016 •

edited

Loading

gsnedders commented May 7, 2016

kovidgoyal commented May 8, 2016 •

edited

Loading

gsnedders commented May 8, 2016

kovidgoyal commented May 8, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

Selected patches from Calibre #245

Selected patches from Calibre #245

Conversation

gsnedders commented May 7, 2016

gsnedders commented May 7, 2016

codecov-io commented May 7, 2016 • edited Loading

Current coverage is 89.15%

gsnedders commented May 7, 2016

kovidgoyal commented May 8, 2016 • edited Loading

gsnedders commented May 8, 2016

kovidgoyal commented May 8, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

codecov-io commented May 7, 2016 •

edited

Loading

kovidgoyal commented May 8, 2016 •

edited

Loading