- What is Saffron?
- Architecture
- Installation
- Initialization
- Configuration
- Parsers
- Article
- Source files
- Middleware
- Listeners
- Standalone
Saffron stands for Simple Abstract Framework For the Retrieval Of News
As said saffron is a framework. It is an abstraction engine that helps you collect news and announcements from websites in a uniform way.
It supports different ways of data collection, such as API endpoints and web-scraping. It tries to ease the process of integrating all data sources, by abstracting data collection into a few simple and powerful functions.
Saffron's architecture is based on a main
node that issues scraping instructions and several worker
nodes
that do the scraping & upload the data to the database.
The communication between the nodes is happening through the Grid
. The grid will generate events to communicate
with other classes. Saffron supports remote nodes by using socket.io
server and clients
as a middleware to connect to the main
node.
To install the latest release:
npm install @unistudents/saffron
To install a specific version:
npm install @unistudents/saffron@version
Once you have installed the library and created your configuration:
import Saffron from "@unistudents/saffron";
const saffron = new Saffron();
// Initialize saffron
saffron.initialize({/* configuration */});
// Start sheduler and workers.
saffron.start();
Read the configuration file for more information.
To retrieve the desired information from the websites we use parsers.
There are four available parser types: wordpress
, rss
, html
, api
and dynamic
.
Parser type: wordpress-v2
By default, WordPress
based websites has an open API for news retrieval.
We make use of that to get access on the articles and categories of the website.
To quickly check if a website supports the WordPress API simply open your browser and
type <website-root-link>/wp-json/wp/v2/posts/
.
If a valid JSON file is displayed on the browser (or downloaded on your computer) which contains the website's articles,
then you can safely use the wordpress
parser.
Parser type: rss
Many websites support RSS
feed. RSS allows users and applications to access updates
to websites in a standardized, computer-readable format. You can check if a website supports RSS if you can see this
icon .
Parser type: json
(or xml
)
This parser is best to be used when it comes to pages that are loading data using API requests (e.g. lazy loading). The only prerequisite for this parser is that the response of the API requests is in a structured JSON or XML format.
Parser type: html
This parser uses scrapping tools like CheerioJS to scrape the website content and receive the displayed news. This parser is best to be used when the HTML in the website is structured. Websites where the HTML and CSS are not structured will be very difficult to scrape.
Parser type: dynamic
Unlike the other parsers, this parser uses javascript/typescript code to parse a website. All the logic for the scraping is
decided by the user by extending the class DynamicSourceFile
.
We recommend a specific order for using the available parsers.
- If the desired website is based an
WordPress
and the WordPress articles API is enabled, then choose thewordpress-v2
parser. - If the desired website supports
RSS
feed. then choose therss
parser. - If the desired website is loading data using API requests with structured responses (e.g. lazy loading), then choose the
json
orxml
parser. - If the desired website has a structured form, the use the
html
parser. - If none of the above is possible (bad html or custom API) then the
dynamic
parser is our last choice.
We have created a universal format for the parsed news, and we named it Article
.
Read the article file for more information.
A source file is a json
or javascript
file that represents a website.
These files are generated from the user and guide Saffron on how to parse a website.
Read the source file for the common options or the parsers files WordPress V2, RSS, API, HTML or Dynamic for the scrape options.
A middleware is a function that gets executed before the articles are passed to newArticles
function.
Middleware functions can be useful for logging, article formatting or sorting.
The order where the middleware are executed is the order where they were reistered. Each middleware function can be called more than once.
saffron.use("name", (...args: any) => {
//...
});
For changing the contents of the articles. It gets as parameter every article that was found from the parsers and must return the same object when it changed.
saffron.use("article.format", (article: Article) => {
// If possible set pubDate with milliseconds.
let ms = new Date(article.pubDate).getTime();
if (!isNaN(ms)) article.pubDate = ms;
// Append source name before title for every article
article.title = `[${article.getSource(saffron).name}] ${article.title}`;
// Return the changed article.
return article;
});
You can also access the source class of the article by calling article.getSource()
.
Note that any changes made on the source class will also affect the saved source.
This middleware can be used to edit the articles in bulk. You can sort or filter them as you want. The only requirement is to return an array (empty or not) of articles.
saffron.use("articles", (articles: Article[]) => {
sort(articles);
return articles.filter(
(article) => article.title != null && article.title !== ""
);
});
Saffron supports listeners for various event. Listeners can be used for logging or creating analytics.
Read the listeners file for more information.
Saffron supports immediate parsing using the static function parse
.
import {Saffron} from "@unistudents/saffron";
try {
const result = Saffron.parse({
name: "source-name",
url: ["Category 1", "https://example.com"],
type: "html",
// ...
scrape: {
// ...
},
}, null); // or pass a config
console.log("Result:", result);
} catch (e) {
console.log("Encountered an error during parsing:", e);
}
The result of the parse
function is an array of objects for each url passed in the source file:
[
{
url: "https://example.com",
aliases: ["Category 1"],
articles: [/*Article*/, /*Article*/, /*Article*/, /*...*/]
},
];