- You need to pass in a
GH_TOKEN
env value (A github token) to be able to scrap websites
To run the WSS script and feed live data to the database:
- You need to create a blockchain account and get an api key
- You need to pass in a
BLOCKCHAIN_API_KEY
env value - You need to then run
./run.sh wss
to start the websocket script
The following tools will be use to scrap the data from the news feed:
These are robust and widely used for web scraping.
Depending on the volume and nature of our data, we should consider using a combination of relational databases (like PostgreSQL) and NoSQL databases (like MongoDB or Elasticsearch)
The following tools will be use to real-time data processing, especially if wz expect high volumes of data.
It can also handle batch processing, so it offers flexibility.
Apache Kafka can be used in conjunction with pytorch to handle real-time data ingestion and processing. Kafka can act as a buffer to store the scraped data. Depending on the kind of analytics we're running, a time-series database like InfluxDB or TimescaleDB might be beneficial.
Since we're setting up a pipeline, we should have monitoring and alerting in place. Tools like Prometheus and Alertmanager can be integrated with Grafana to provide monitoring capabilities.
The following tools will be use to visualize the data:
Grafana is a solid choice for visualizing time-series data. It integrates well with many databases, including InfluxDB and TimescaleDB.
We should ensure we have the right plugins or visualizations to represent the analytics as you envision.
Grafana has a rich library of plugins.
For more interactive and custom analytics visualization, we could consider using Tableau or Power BI.
We'll use a self-managed system and maybe Kubernetes to manage the infrastructure.
We should also consider setting up an automation tool or CI/CD pipeline, like Jenkins or GitHub Actions, to deploy updates and changes to our system seamlessly.
We'll use the following news feeds:
A news should have the following attributes:
id
: unique identifiertitle
: title of the newsdatetime
: date of the newsdescription
: description of the newsurl
: url of the news