GitHub - scivey/goosepp: C++ port of goose html content extractor

#goosepp

An MIT-licensed c++ port of python-goose, which is itself a port of the original Scala Goose project.

Case study: Jezebel extraction

goosepp takes raw HTML like this article about Taylor Swift and extracts the most "contenty" text. It tries to remove navigation, headers, footers, ads, etc., and outputs the cleaned text of the main content like this:

Leave it to Taylor to tidy up a  squabble  quickly and, if possible, in the most maximalist way. Mere days after a social media-based tiff, misunderstanding, whatever you want to call it, Taylor invited Avril to the stage at her San Diego show. They performed “Complicated,” Taylor beamed and squeezed Avril — everything was beautiful again.
... [truncated]
So in case you had any further questions, Taylor Swift remains bosom friends with every living creature on the globe—with the exception of Katy Perry, for those wounds will never heal—and will one day invite every one of us on stage with her.

It also extracts titles:

Avril Lavigne And Taylor Swift Sing 'Complicated,' Are Besties Again

Title extraction uses meta tags if possible, falling back on the <title> element. It then tries to remove common title noise, matching patterns like Article Title | www.something.com and Jezebel: More Taylor Swift News.

Code example

#include <iostream>
#include <string>
#include <goosepp/goosepp.h>
using std::string;
using std::cout;
using std::endl;
using scivey::goosepp::GooseExtractor;

string getHtmlSomehow(string url);

int main() {
    string url = "http:https://jezebel.com/articles/taylor-swift-9999";
    auto rawHtml = getHtmlSomehow(url);
    GooseExtractor extractor(url, rawHtml);
    cout << extractor.getBody() << endl;
    cout << extractor.getTitle() << endl;
    cout << extractor.getPublishDate() << endl;
    // that's it
}

Dependencies

There are two:

MITIE for tokenization of element text
gumbo-parser for HTML parsing, DOM, etc.

Building

Make sure you have the dependencies, then:

make deps
mkdir build
cd build
cmake ../
sudo make install

Tests

There are lower-level unit tests, functional tests of content extraction, and a set of memory leak checks.

make test-unit
make test-functional
make test-mem

Benchmarks

These numbers aren't all that scientific, as the current benchmark is just run repeatedly against a single example.

    ./benchmark_runner
    Run on (4 X 3500 MHz CPU s
    2015-09-07 19:56:12
    Benchmark            Time(ns)    CPU(ns) Iterations
    ---------------------------------------------------
    extractJezebelBody    6342558    6334100        100

goosepp can extract the content of this article in about 0.0063 seconds. The python implementation takes around 0.09 seconds for the same article.

To run the c++ benchmark yourself (requires benchmark):

make benchmark

To run the same benchmark against python-goose (requires python-goose in your PYTHONPATH):

python scripts/benchmark_python_goose.py

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
cmake		cmake
external		external
include/goosepp		include/goosepp
resources		resources
scripts		scripts
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Case study: Jezebel extraction

Code example

Dependencies

Building

Tests

Benchmarks

License

About

Releases

Packages

Languages

License

scivey/goosepp

Folders and files

Latest commit

History

Repository files navigation

Case study: Jezebel extraction

Code example

Dependencies

Building

Tests

Benchmarks

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages