From 637826ffa72ca982dff6ae7204e4afcc35f3e29e Mon Sep 17 00:00:00 2001 From: Tom Most Date: Sat, 15 Apr 2017 11:51:16 -0700 Subject: [PATCH] Update and expand "moving parts" doc --- doc/movingparts.rst | 65 +++++++++++++++++++++------------------------ 1 file changed, 31 insertions(+), 34 deletions(-) diff --git a/doc/movingparts.rst b/doc/movingparts.rst index 3eeff4f2..1f3086cb 100644 --- a/doc/movingparts.rst +++ b/doc/movingparts.rst @@ -4,22 +4,25 @@ The moving parts html5lib consists of a number of components, which are responsible for handling its features. +Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document. +Several tree representations are supported, as are translations to other formats via *tree adapters*. +The tree may be translated to a token stream with a *tree walker*, from which :class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes. +The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization. Tree builders ------------- The parser reads HTML by tokenizing the content and building a tree that -the user can later access. There are three main types of trees that -html5lib can build: +the user can later access. html5lib can build three types of trees: -* ``etree`` - this is the default; builds a tree based on ``xml.etree``, +* ``etree`` - this is the default; builds a tree based on :mod:`xml.etree`, which can be found in the standard library. Whenever possible, the accelerated ``ElementTree`` implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x) is used. -* ``dom`` - builds a tree based on ``xml.dom.minidom``. +* ``dom`` - builds a tree based on :mod:`xml.dom.minidom`. -* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree`` +* ``lxml`` - uses the :mod:`lxml.etree` implementation of the ``ElementTree`` API. The performance gains are relatively small compared to using the accelerated ``ElementTree`` module. @@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API: with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml") -When instantiating a parser object, you have to pass a tree builder -class in the ``tree`` keyword attribute: +To get a builder class by name, use the :func:`~html5lib.treebuilders.getTreeBuilder` function. -.. code-block:: python - - import html5lib - parser = html5lib.HTMLParser(tree=SomeTreeBuilder) - document = parser.parse("

Hello World!") - -To get a builder class by name, use the ``getTreeBuilder`` function: +When instantiating a :class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute: .. code-block:: python import html5lib - parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) + TreeBuilder = html5lib.getTreeBuilder("dom") + parser = html5lib.HTMLParser(tree=TreeBuilder) minidom_document = parser.parse("

Hello World!") The implementation of builders can be found in `html5lib/treebuilders/ @@ -55,17 +52,16 @@ The implementation of builders can be found in `html5lib/treebuilders/ Tree walkers ------------ -Once a tree is ready, you can work on it either manually, or using -a tree walker, which provides a streaming view of the tree. html5lib -provides walkers for all three supported types of trees (``etree``, -``dom`` and ``lxml``). +In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it. +html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams `_. The implementation of walkers can be found in `html5lib/treewalkers/ `_. -Walkers make consuming HTML easier. html5lib uses them to provide you -with has a couple of handy tools. +html5lib provides a few tools for consuming token streams: +* :class:`~html5lib.serializer.HTMLSerializer`, to generate a stream of bytes; and +* filters, to manipulate the token stream. HTMLSerializer ~~~~~~~~~~~~~~ @@ -90,15 +86,14 @@ The serializer lets you write HTML back as a stream of bytes. '>' 'Witam wszystkich' -You can customize the serializer behaviour in a variety of ways, consult -the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer` -documentation. +You can customize the serializer behaviour in a variety of ways. Consult +the :class:`~html5lib.serializer.HTMLSerializer` documentation. Filters ~~~~~~~ -You can alter the stream content with filters provided by html5lib: +html5lib provides several filters * :class:`alphabeticalattributes.Filter ` sorts attributes on @@ -110,11 +105,11 @@ You can alter the stream content with filters provided by html5lib: the document * :class:`lint.Filter ` raises - ``LintError`` exceptions on invalid tag and attribute names, invalid + :exc:`AssertionError` exceptions on invalid tag and attribute names, invalid PCDATA, etc. * :class:`optionaltags.Filter ` - removes tags from the stream which are not necessary to produce valid + removes tags from the token stream which are not necessary to produce valid HTML * :class:`sanitizer.Filter ` removes @@ -125,9 +120,9 @@ You can alter the stream content with filters provided by html5lib: * :class:`whitespace.Filter ` collapses all whitespace characters to single spaces unless they're in - ``

`` or ``textarea`` tags.
+  ``
`` or ``