Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selected patches from Calibre #245

Merged
merged 5 commits into from
May 22, 2016

Conversation

gsnedders
Copy link
Member

See #119. CC @kovidgoyal.

This cherry-picks a few things from https://github.com/gsnedders/html5lib-python/commits/calibre-patches, which was a complete set of Calibre's patches from November 2013. https://github.com/kovidgoyal/calibre/commits/master/src/html5lib has very little changed in it since then, primarily a move to 0.999999-dev and a separate downstream fix for 0c551c9.

So, of those on that branch…

  • gsnedders@49f37d2, gsnedders@64e8b0b, and gsnedders@cc9f28a are all things I'm against landing upstream (they should just be special-cased in the downstream tree builder rather than requiring invoking separate handling for all tree builders).
  • gsnedders@7702d80 is something I find very inelegant, and goes against the recommendations laid out in PEP 8. Also, with CPython, in the True/False cases it's likely slower, therefore failing at its stated goal, as it results in more byte code and POP_JUMP_IF_FALSE and POP_JUMP_IF_TRUE special-case the condition being True or False (oddly, they don't specialise None, though it is in PyObject_IsTrue; if that makes any notable performance difference then I'd suggest fixing that in CPython).
  • gsnedders@a2d2e05 is unlikely to land because the input stream isn't public API (not that we document what is anywhere…).
  • gsnedders@1576515 in principle is something we want, but I think we want for all tokens and not just elements. (See also Add line/column number in tokens from HTMLTokenizer #87 and Add element.sourceline (lxml) #97.) That or something similar will likely land in future.
  • gsnedders@124e975 is included here, after 9dca7d8 which fixes the DOM tree builder to follow the tree builder API correctly. (It exposed bugs in the DOM tree builder, yay!)
  • gsnedders@db0b5a0 is what most of the commits on this PR deal with, fixing up small issues with the Calibre code.

@gsnedders
Copy link
Member Author

This seems to have broken 2.6 badly. Huh.

@codecov-io
Copy link

codecov-io commented May 7, 2016

Current coverage is 89.15%

Merging #245 into master will decrease coverage by 1.39%

@@             master       #245   diff @@
==========================================
  Files            51         50     -1   
  Lines          6817       6726    -91   
  Methods           0          0          
  Messages          0          0          
  Branches       1316       1307     -9   
==========================================
- Hits           6172       5996   -176   
- Misses          485        559    +74   
- Partials        160        171    +11   

Powered by Codecov. Last updated by 143b0d4...82b971c

@gsnedders
Copy link
Member Author

Oh, right, this is viewkeys (and dictionary views in general) not existing on 2.6.

@gsnedders gsnedders modified the milestone: 0.99999999 May 8, 2016
@kovidgoyal
Copy link
Contributor

kovidgoyal commented May 8, 2016

How do you suggest I override the application of attributes for html and body tags in my builder? Since without those patches, it would require overriding the entire getPhases() method in html5parser.py Remember that the problem those patches is solving is that there can be multiple <html> and <body> tags in malformed documents that I need to handle by merging the attributes from those tags into the first / tag. Doing that after parsing would require an whole extra tree traversal which can be slow for large trees.

If you dont want to merge gsnedders/html5lib-python@a2d2e05 then how do you suggest I replace the the stream input class? The one in html5lib is too slow. The only alternative I can see is monkey patching -- which is less than optimal for obvious reasons.

@gsnedders
Copy link
Member Author

@kovidgoyal I'll take a look at dealing with html/body attributes later (I'm literally amount to board a plane).

When it comes to the input stream, if it yields good perf increases when given a byte/unicode object we should just specialise them in html5lib.

@kovidgoyal
Copy link
Contributor

Sure you are welcome to take the input stream class from calibre for dealing with unicode objects. It is faster because it avoids wrapping the unicode in StringIO. And it actually implements tracking of positions. For my use case, that is important, since I need line and col numbers.

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 1f04a3f on gsnedders:calibre-selected-1 into cc99095 on html5lib:master.

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 96643a2 on gsnedders:calibre-selected-1 into cc99095 on html5lib:master.

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 76bf242 on gsnedders:calibre-selected-1 into cc99095 on html5lib:master.

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 761f3ab on gsnedders:calibre-selected-1 into 143b0d4 on html5lib:master.

@gsnedders gsnedders merged commit 5288737 into html5lib:master May 22, 2016
@gsnedders gsnedders deleted the calibre-selected-1 branch May 22, 2016 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants