Skip to content

Commit

Permalink
Update and expand "moving parts" doc
Browse files Browse the repository at this point in the history
  • Loading branch information
twm committed Apr 15, 2017
1 parent c8fca0e commit 637826f
Showing 1 changed file with 31 additions and 34 deletions.
65 changes: 31 additions & 34 deletions doc/movingparts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,25 @@ The moving parts
html5lib consists of a number of components, which are responsible for
handling its features.

Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document.
Several tree representations are supported, as are translations to other formats via *tree adapters*.
The tree may be translated to a token stream with a *tree walker*, from which :class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes.
The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization.

Tree builders
-------------

The parser reads HTML by tokenizing the content and building a tree that
the user can later access. There are three main types of trees that
html5lib can build:
the user can later access. html5lib can build three types of trees:

* ``etree`` - this is the default; builds a tree based on ``xml.etree``,
* ``etree`` - this is the default; builds a tree based on :mod:`xml.etree`,
which can be found in the standard library. Whenever possible, the
accelerated ``ElementTree`` implementation (i.e.
``xml.etree.cElementTree`` on Python 2.x) is used.

* ``dom`` - builds a tree based on ``xml.dom.minidom``.
* ``dom`` - builds a tree based on :mod:`xml.dom.minidom`.

* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
* ``lxml`` - uses the :mod:`lxml.etree` implementation of the ``ElementTree``
API. The performance gains are relatively small compared to using the
accelerated ``ElementTree`` module.

Expand All @@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:
with open("mydocument.html", "rb") as f:
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
When instantiating a parser object, you have to pass a tree builder
class in the ``tree`` keyword attribute:
To get a builder class by name, use the :func:`~html5lib.treebuilders.getTreeBuilder` function.

.. code-block:: python
import html5lib
parser = html5lib.HTMLParser(tree=SomeTreeBuilder)
document = parser.parse("<p>Hello World!")
To get a builder class by name, use the ``getTreeBuilder`` function:
When instantiating a :class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute:

.. code-block:: python
import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
TreeBuilder = html5lib.getTreeBuilder("dom")
parser = html5lib.HTMLParser(tree=TreeBuilder)
minidom_document = parser.parse("<p>Hello World!")
The implementation of builders can be found in `html5lib/treebuilders/
Expand All @@ -55,17 +52,16 @@ The implementation of builders can be found in `html5lib/treebuilders/
Tree walkers
------------

Once a tree is ready, you can work on it either manually, or using
a tree walker, which provides a streaming view of the tree. html5lib
provides walkers for all three supported types of trees (``etree``,
``dom`` and ``lxml``).
In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.

The implementation of walkers can be found in `html5lib/treewalkers/
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.

Walkers make consuming HTML easier. html5lib uses them to provide you
with has a couple of handy tools.
html5lib provides a few tools for consuming token streams:

* :class:`~html5lib.serializer.HTMLSerializer`, to generate a stream of bytes; and
* filters, to manipulate the token stream.

HTMLSerializer
~~~~~~~~~~~~~~
Expand All @@ -90,15 +86,14 @@ The serializer lets you write HTML back as a stream of bytes.
'>'
'Witam wszystkich'
You can customize the serializer behaviour in a variety of ways, consult
the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
documentation.
You can customize the serializer behaviour in a variety of ways. Consult
the :class:`~html5lib.serializer.HTMLSerializer` documentation.


Filters
~~~~~~~

You can alter the stream content with filters provided by html5lib:
html5lib provides several filters

* :class:`alphabeticalattributes.Filter
<html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
Expand All @@ -110,11 +105,11 @@ You can alter the stream content with filters provided by html5lib:
the document

* :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
``LintError`` exceptions on invalid tag and attribute names, invalid
:exc:`AssertionError` exceptions on invalid tag and attribute names, invalid
PCDATA, etc.

* :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
removes tags from the stream which are not necessary to produce valid
removes tags from the token stream which are not necessary to produce valid
HTML

* :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
Expand All @@ -125,9 +120,9 @@ You can alter the stream content with filters provided by html5lib:

* :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
collapses all whitespace characters to single spaces unless they're in
``<pre/>`` or ``textarea`` tags.
``<pre/>`` or ``<textarea/>`` tags.

To use a filter, simply wrap it around a stream:
To use a filter, simply wrap it around a token stream:

.. code-block:: python
Expand All @@ -142,9 +137,11 @@ To use a filter, simply wrap it around a stream:
Tree adapters
-------------

Used to translate one type of tree to another. More documentation
pending, sorry.
Tree adapters can be used to translate between tree formats.
Two adapters are provided by html5lib:

* :func:`html5lib.treeadapters.genshi.to_genshi()` generates a `Genshi markup stream <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
* :func:`html5lib.treeadapters.sax.to_sax()` calls a SAX handler based on the tree.

Encoding discovery
------------------
Expand All @@ -156,14 +153,14 @@ the following way:
* The encoding may be explicitly specified by passing the name of the
encoding as the encoding parameter to the
:meth:`~html5lib.html5parser.HTMLParser.parse` method on
``HTMLParser`` objects.
:class:`~html5lib.html5parser.HTMLParser` objects.

* If no encoding is specified, the parser will attempt to detect the
encoding from a ``<meta>`` element in the first 512 bytes of the
document (this is only a partial implementation of the current HTML
5 specification).
specification).

* If no encoding can be found and the chardet library is available, an
* If no encoding can be found and the :mod:`chardet` library is available, an
attempt will be made to sniff the encoding from the byte pattern.

* If all else fails, the default encoding will be used. This is usually
Expand Down

0 comments on commit 637826f

Please sign in to comment.