Skip to content

Latest commit

 

History

History
112 lines (72 loc) · 2.93 KB

README.rst

File metadata and controls

112 lines (72 loc) · 2.93 KB

html5lib

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

Requirements

Python 2.6 and above as well as Python 3.0 and above are supported. Implementations known to work are CPython (as the reference implementation) and PyPy. Jython is known not to work due to various bugs in its implementation of the language. Others such as IronPython may or may not work; if you wish to try, you are strongly encouraged to run the testsuite and report back!

The only required library dependency is six, this can be found packaged in PyPI.

Optionally:

  • datrie can be used to improve parsing performance (though in almost all cases the improvement is marginal);
  • lxml is supported as a tree format (for both building and walking) under CPython (but not PyPy where it is known to cause segfaults);
  • genshi has a treewalker (but not builder); and
  • charade can be used as a fallback when character encoding cannot be determined; chardet, from which it was forked, can also be used on Python 2.

Installation

html5lib is packaged with distutils. To install it use:

$ python setup.py install

Usage

Simple usage follows this pattern:

import html5lib
with open("mydocument.html", "r") as fp:
    document = html5lib.parse(f)

or:

import html5lib
document = html5lib.parse("<p>Hello World!")

More documentation is available in the docstrings.

Bugs

Please report any bugs on the issue tracker.

Tests

These are contained in the html5lib-tests repository and included as a submodule, thus for git checkouts they must be initialized (for release tarballs this is unneeded):

$ git submodule init
$ git submodule update

And then they can be run, with nose installed, using the nosetests command in the root directory. All should pass.

Contributing

Pull requests are more than welcome — both to the library and to the documentation. Some useful information:

  • We aim to follow PEP 8 in the library, but ignoring the 79-character-per-line limit, instead following a soft limit of 99, but allowing lines over this where it is the readable thing to do.
  • We keep pyflakes reporting no errors or warnings at all times.
  • We keep the master branch passing all tests at all times on all supported versions.

Travis CI is run against all pull requests and should enforce all of the above.

We also use an external code-review tool, which uses your GitHub login to authenticate. You'll get emails for changes on the review.

Questions?

There's a mailing list available for support on Google Groups, html5lib-discuss, though you may have more success (and get a far quicker response) asking on IRC in #whatwg on irc.freenode.net.