Skip to content

A fairly intuitive & powerful framework that enables you to collect & save articles and news from all over the web.

License

Notifications You must be signed in to change notification settings

UniStudents/Saffron

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Saffron | News & announcements aggregation framework.

Table of Contents

What is Saffron?

Saffron stands for Simple Abstract Framework For the Retrieval Of News

As said saffron is a framework. It is an abstraction engine that helps you collect news and announcements from websites in a uniform way.

It supports different ways of data collection, such as API endpoints and web-scraping. It tries to ease the process of integrating all data sources, by abstracting data collection into a few simple and powerful functions.

Architecture

Saffron's architecture is based on a main node that issues scraping instructions and several worker nodes that do the scraping & upload the data to the database.

The communication between the nodes is happening through the Grid. The grid will generate events to communicate with other classes. Saffron supports remote nodes by using socket.io server and clients as a middleware to connect to the main node.

Installation

To install the latest release:

npm install @unistudents/saffron

To install a specific version:

npm install @unistudents/saffron@version

Initialization

Once you have installed the library and created your configuration:

import Saffron from "@unistudents/saffron";

const saffron = new Saffron();

// Initialize saffron
saffron.initialize({/* configuration */});

// Start sheduler and workers.
saffron.start();

Configuration

Read the configuration file for more information.

Parsers

To retrieve the desired information from the websites we use parsers. There are four available parser types: wordpress, rss, html, api and dynamic.

WordPress V2

Parser type: wordpress-v2

By default, WordPress based websites has an open API for news retrieval. We make use of that to get access on the articles and categories of the website.

To quickly check if a website supports the WordPress API simply open your browser and type <website-root-link>/wp-json/wp/v2/posts/. If a valid JSON file is displayed on the browser (or downloaded on your computer) which contains the website's articles, then you can safely use the wordpress parser.

RSS

Parser type: rss

Many websites support RSS feed. RSS allows users and applications to access updates to websites in a standardized, computer-readable format. You can check if a website supports RSS if you can see this icon .

JSON / XML

Parser type: json (or xml)

This parser is best to be used when it comes to pages that are loading data using API requests (e.g. lazy loading). The only prerequisite for this parser is that the response of the API requests is in a structured JSON or XML format.

HTML

Parser type: html

This parser uses scrapping tools like CheerioJS to scrape the website content and receive the displayed news. This parser is best to be used when the HTML in the website is structured. Websites where the HTML and CSS are not structured will be very difficult to scrape.

Dynamic

Parser type: dynamic

Unlike the other parsers, this parser uses javascript/typescript code to parse a website. All the logic for the scraping is decided by the user by extending the class DynamicSourceFile.

Which to choose

We recommend a specific order for using the available parsers.

  • If the desired website is based an WordPress and the WordPress articles API is enabled, then choose the wordpress-v2 parser.
  • If the desired website supports RSS feed. then choose the rss parser.
  • If the desired website is loading data using API requests with structured responses (e.g. lazy loading), then choose the json or xml parser.
  • If the desired website has a structured form, the use the html parser.
  • If none of the above is possible (bad html or custom API) then the dynamic parser is our last choice.

Article

We have created a universal format for the parsed news, and we named it Article.

Read the article file for more information.

Source files

What is a source file?

A source file is a json or javascript file that represents a website. These files are generated from the user and guide Saffron on how to parse a website.

Creating a source file

Read the source file for the common options or the parsers files WordPress V2, RSS, API, HTML or Dynamic for the scrape options.

Middleware

A middleware is a function that gets executed before the articles are passed to newArticles function. Middleware functions can be useful for logging, article formatting or sorting.

The order where the middleware are executed is the order where they were reistered. Each middleware function can be called more than once.

Register a middleware

saffron.use("name", (...args: any) => {
    //...
});

Format article

For changing the contents of the articles. It gets as parameter every article that was found from the parsers and must return the same object when it changed.

saffron.use("article.format", (article: Article) => {
    // If possible set pubDate with milliseconds.
    let ms = new Date(article.pubDate).getTime();
    if (!isNaN(ms)) article.pubDate = ms;

    // Append source name before title for every article
    article.title = `[${article.getSource(saffron).name}] ${article.title}`;

    // Return the changed article.
    return article;
});

You can also access the source class of the article by calling article.getSource(). Note that any changes made on the source class will also affect the saved source.

Articles

This middleware can be used to edit the articles in bulk. You can sort or filter them as you want. The only requirement is to return an array (empty or not) of articles.

saffron.use("articles", (articles: Article[]) => {
    sort(articles);
    return articles.filter(
        (article) => article.title != null && article.title !== ""
    );
});

Listeners

Saffron supports listeners for various event. Listeners can be used for logging or creating analytics.

Read the listeners file for more information.

Standalone

Saffron supports immediate parsing using the static function parse.

import {Saffron} from "@unistudents/saffron";

try {
    const result = Saffron.parse({
        name: "source-name",
        url: ["Category 1", "https://example.com"],
        type: "html",
        // ...
        scrape: {
            // ...
        },
    }, null); // or pass a config

    console.log("Result:", result);
} catch (e) {
    console.log("Encountered an error during parsing:", e);
}

The result of the parse function is an array of objects for each url passed in the source file:

[
    {
        url: "https://example.com",
        aliases: ["Category 1"],
        articles: [/*Article*/, /*Article*/, /*Article*/, /*...*/]
    },
];