Tech Blog Spider

Crawls tech blogs and notifies you via Slack

Background

Software engineers often subscribe to various tech blogs in order to keep up with the ever-evolving technology domain.

However, visiting many tech blogs each time is a painstaking task, and I would like to receive feeds to help me subscribe if possible.

Of course, you can use a dedicated feeder to receive feeds, but I'd like to send feeds to Slack, which I usually use for work, so that I can subscribe to tech blogs without having to consciously open feeder.

So I decided to create a system that would periodically go to the registered entries and notify Slack when there was an update.

There are similar mechanisms such as Slack's /feed and RSS registration using IFTTT, but both of them are not easy to use because the free version has a limit on the number of registrations.

Therefore, I decided to use GitHub Actions and Harper DB, which are free for OSS.

Why Spider?

It doesn't have any particular meaning, but I imagined a spider running around on a tech blog (the web). But that's a crawler, isn't it?

Feature

All free
- Not using Slack Apps, Slack feed and IFTTT; There is no upper limit to the numbers for free plans.
- Using HarperDB; a free plan exists.
Post feeds to Slack only when there is an update.
You can register an unlimited number of Feeds.
It is easy to automate; This repository uses GitHub Actions to automate the process. And it's free.
You can use pytermextract to extract technical terms in an article and display them on Slack.

Setup

Preparing Harper DB

HarperDB is a managed NoSQL DB service, and Tech Blog Spider uses HarperDB to store the last update of RSS entries.

Get account for HarperDB
Create an organization (or using existing org)
Create an instance (recommend: HarperDB Cloud Instance)
Create a scheme (By default, prd)
Create 2 tables
- entry_urls (hash attr is name)
- last_published (hash attr is name)

(optional) Set schema and tables on REST API

create database(schema)

POST /
{
    "operation": "create_database",
    "database": "prd"
}

create entry_urls table

POST /
{
    "operation": "create_table",
    "database": "prd",
    "table": "entry_urls",
    "primary_key": "name"
}

create last_published table

POST /
{
    "operation": "create_table",
    "database": "prd",
    "table": "last_published",
    "primary_key": "name"
}

create attribute

POST /
{
    "operation": "create_attribute",
    "database": "prd",
    "table": "last_published",
    "attribute": "time"
}

Set entries

Edit entry.csv to register the RSS entries you want to subscribe to.

First column is name: must be unique.
Second column is url: RSS feed url (compatible RSS1.0, RSS2.0, Atom0.3, Atom1.0).
Third column is icon if you want to set another icon instead of favicon.

name,url,icon
"aws","https://aws.amazon.com/jp/blogs/aws/feed/","https://i.imgur.com/Z5YLUiS.png"

Create Slack incoming webhook

Slack incoming webhooks using the new Slack API, Apps, consume Apps, so I added a custom integration.

Set environment

Set environment (OS Env) the below.

name	descriotion	default
HARPERDB_URL	HarperDB instance url	-
HARPERDB_USERNAME	HarperDB username	-
HARPERDB_PASSWORD	HarperDB password	-
HARPERDB_SCHEMA	HarperDB schemas	prd
SLACK_WEBHOOK_URL	Slack incoming webhook url	-
LOGGING_LEVEL	Logging level (CRITICAL, ERROR, WARNING, INFO, DEBUG)	INFO

How to use it in GitHub Actions

For example, when setting environment variables in GitHub Actions, it is recommended to register the environment variable itself in the secret and refer to it from the secret as shown below.

- name: Run RSS
  env:
    HARPERDB_URL: ${{ secrets.HARPERDB_URL }}
    HARPERDB_USERNAME: ${{ secrets.HARPERDB_USERNAME }}
    HARPERDB_PASSWORD: ${{ secrets.HARPERDB_PASSWORD }}
    HARPERDB_SCHEMA: ${{ secrets.HARPERDB_SCHEMA }}
    SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
    LOGGING_LEVEL: "DEBUG"
  run: python3 src/main.py

Install Dependencies

The required libraries are listed in requirements.txt and can be installed by pip.

pip install -r requirements.txt

Usage

Preparation

python src/create_config.py

Run

python src/main.py

Demo

License

MIT © tubone.

Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
.github/workflows		.github/workflows
docs/images		docs/images
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
backup_last_published.json		backup_last_published.json
docker-compose.yml		docker-compose.yml
entry.csv		entry.csv
renovate.json		renovate.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tech Blog Spider

Background

Why Spider?

Feature

Setup

Preparing Harper DB

(optional) Set schema and tables on REST API

Set entries

Create Slack incoming webhook

Set environment

How to use it in GitHub Actions

Install Dependencies

Usage

Preparation

Run

Demo

License

About

Releases

Packages

Contributors 4

Languages

License

tubone24/tech_blog_spider

Folders and files

Latest commit

History

Repository files navigation

Tech Blog Spider

Background

Why Spider?

Feature

Setup

Preparing Harper DB

(optional) Set schema and tables on REST API

Set entries

Create Slack incoming webhook

Set environment

How to use it in GitHub Actions

Install Dependencies

Usage

Preparation

Run

Demo

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages