Skip to content

Commit

Permalink
Remove docs for HTMLTokenizer and HTMLSanitizer
Browse files Browse the repository at this point in the history
HTMLTokenizer is now a private API (I cannot find a public export).
HTMLSanitizer no longer exists as a tokenizer, and has been replaced
with a filter.
  • Loading branch information
twm committed Apr 15, 2017
1 parent 964d0e1 commit abf6224
Showing 1 changed file with 0 additions and 38 deletions.
38 changes: 0 additions & 38 deletions doc/movingparts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -169,41 +169,3 @@ the following way:
* If all else fails, the default encoding will be used. This is usually
`Windows-1252 <http:https://en.wikipedia.org/wiki/Windows-1252>`_, which is
a common fallback used by Web browsers.


Tokenizers
----------

The part of the parser responsible for translating a raw input stream
into meaningful tokens is the tokenizer. Currently html5lib provides
two.

To set up a tokenizer, simply pass it when instantiating
a :class:`~html5lib.html5parser.HTMLParser`:

.. code-block:: python
import html5lib
from html5lib import sanitizer
p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
p.parse("<p>Surprise!<script>alert('Boo!');</script>")
HTMLTokenizer
~~~~~~~~~~~~~

This is the default tokenizer, the heart of html5lib. The implementation
can be found in `html5lib/tokenizer.py
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.

HTMLSanitizer
~~~~~~~~~~~~~

This is a tokenizer that removes unsafe markup and CSS styles from the
input. Elements that are known to be safe are passed through and the
rest is converted to visible text. The default configuration of the
sanitizer follows the `WHATWG Sanitization Rules
<http:https://wiki.whatwg.org/wiki/Sanitization_rules>`_.

The implementation can be found in `html5lib/sanitizer.py
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.

0 comments on commit abf6224

Please sign in to comment.