Skip to content

A reimplementation of the Readability/Decruft algorithm using BeautifulSoup and html5lib

License

Notifications You must be signed in to change notification settings

rcarmo/soup-strainer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

soup-strainer

A reimplementation of the Readability algorithm using BeautifulSoup and html5lib.

What does this do?

It takes HTML and scores the markup structure in an attempt to divine which bits are a human-readable article instead of junk. It then rips out the junk and returns clean(ish) markup containing the most relevant bits of the page.

Why another implementation?

Well, most of the modern Python ports/conversions use lxml, which is fast and lenient but involves an extra dependency.

Since I needed a pure Python solution, I decided to take the bits I needed from the lxml implementations, re-factor the code to make it (a lot) easier to maintain, and back-port the whole thing to BeautifulSoup.

Didn't BeautifulSoup have trouble parsing bad markup?

BeautifulSoup 3.x used SGMLParser, which was noticeably brain-dead on occasion, so yes. And that was one of the reasons it was slow, too.

But BeautifulSoup 4.x can use both lxml and html5lib, which is pure Python. Or, if you're using Python 2.7.3 (or 3.2.2), you can use the improved standard library html.parser, which is now more lenient.

But html5lib handles all sorts of corner cases automatically, so now I have the best of all worlds -- I can choose to use lxml for speed, stick to the standard library for simple stuff or html5lib for quirky parsing -- while keeping the ease of use that characterizes BeautifulSoup.

Next Steps

  • More language hints (Portuguese, German, etc.)
  • Score tuning
  • "Learning" (i.e., persistent scoring of successful tag IDs and classes across invocations)
  • URL and HREF handling (i.e., toss in an URL and it will fetch the page by itself, normalizing all HREFs afterwards) -- not done yet simply out of laziness, the utility functions are there
  • Multi-page support (i.e., have it bolt on extra markup by divining links) -- a trifle harder

Other Implementations

About

A reimplementation of the Readability/Decruft algorithm using BeautifulSoup and html5lib

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages