soup-strainer

A reimplementation of the Readability algorithm using BeautifulSoup and html5lib.

What does this do?

It takes HTML and scores the markup structure in an attempt to divine which bits are a human-readable article instead of junk. It then rips out the junk and returns clean(ish) markup containing the most relevant bits of the page.

Why another implementation?

Well, most of the modern Python ports/conversions use lxml, which is fast and lenient but involves an extra dependency.

Since I needed a pure Python solution, I decided to take the bits I needed from the lxml implementations, re-factor the code to make it (a lot) easier to maintain, and back-port the whole thing to BeautifulSoup.

Didn't BeautifulSoup have trouble parsing bad markup?

BeautifulSoup 3.x used SGMLParser, which was noticeably brain-dead on occasion, so yes. And that was one of the reasons it was slow, too.

But BeautifulSoup 4.x can use both lxml and html5lib, which is pure Python. Or, if you're using Python 2.7.3 (or 3.2.2), you can use the improved standard library html.parser, which is now more lenient.

But html5lib handles all sorts of corner cases automatically, so now I have the best of all worlds -- I can choose to use lxml for speed, stick to the standard library for simple stuff or html5lib for quirky parsing -- while keeping the ease of use that characterizes BeautifulSoup.

Next Steps

More language hints (Portuguese, German, etc.)
Score tuning
"Learning" (i.e., persistent scoring of successful tag IDs and classes across invocations)
URL and HREF handling (i.e., toss in an URL and it will fetch the page by itself, normalizing all HREFs afterwards) -- not done yet simply out of laziness, the utility functions are there
Multi-page support (i.e., have it bolt on extra markup by divining links) -- a trifle harder

Other Implementations

kwellman's gist
nirmalpatel's
Sharmila Gopirajan's decruft
mitechie's breadability
buriy's python-readability - slighly better than gfxmonk's in my tests
bndr's node-read

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
bs4		bs4
chardet		chardet
html5lib		html5lib
strainer		strainer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cython.sh		cython.sh
demo.py		demo.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

soup-strainer

What does this do?

Why another implementation?

Didn't BeautifulSoup have trouble parsing bad markup?

Next Steps

Other Implementations

About

Releases

Packages

Languages

License

rcarmo/soup-strainer

Folders and files

Latest commit

History

Repository files navigation

soup-strainer

What does this do?

Why another implementation?

Didn't BeautifulSoup have trouble parsing bad markup?

Next Steps

Other Implementations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages