Cryptodata

Notes

Scraping

You need to pass in a GH_TOKEN env value (A github token) to be able to scrap websites

Live data

To run the WSS script and feed live data to the database:

You need to create a blockchain account and get an api key
You need to pass in a BLOCKCHAIN_API_KEY env value
You need to then run ./run.sh wss to start the websocket script

Sytem Design

Web scrapper

The following tools will be use to scrap the data from the news feed:

These are robust and widely used for web scraping.

Data Storage

Depending on the volume and nature of our data, we should consider using a combination of relational databases (like PostgreSQL) and NoSQL databases (like MongoDB or Elasticsearch)

Data Builder (Processing)

The following tools will be use to real-time data processing, especially if wz expect high volumes of data.

Apache Kafka

It can also handle batch processing, so it offers flexibility.

Apache Kafka can be used in conjunction with pytorch to handle real-time data ingestion and processing. Kafka can act as a buffer to store the scraped data. Depending on the kind of analytics we're running, a time-series database like InfluxDB or TimescaleDB might be beneficial.

Monitoring & Error Handling

Since we're setting up a pipeline, we should have monitoring and alerting in place. Tools like Prometheus and Alertmanager can be integrated with Grafana to provide monitoring capabilities.

Dynamic Viewer with Analytics

The following tools will be use to visualize the data:

Grafana

Grafana is a solid choice for visualizing time-series data. It integrates well with many databases, including InfluxDB and TimescaleDB.

We should ensure we have the right plugins or visualizations to represent the analytics as you envision.

Grafana has a rich library of plugins.

For more interactive and custom analytics visualization, we could consider using Tableau or Power BI.

Infrastructure

We'll use a self-managed system and maybe Kubernetes to manage the infrastructure.

Automation

We should also consider setting up an automation tool or CI/CD pipeline, like Jenkins or GitHub Actions, to deploy updates and changes to our system seamlessly.

Feeds

We'll use the following news feeds:

CryptoPanic

Data schema

News object

A news should have the following attributes:

id: unique identifier
title: title of the news
datetime: date of the news
description: description of the news
url: url of the news

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
.github/workflows		.github/workflows
airflow		airflow
commands		commands
consumers/scraped		consumers/scraped
crypto-wss		crypto-wss
db		db
docker		docker
docs		docs
env-vars		env-vars
grafana		grafana
scrapers		scrapers
sentiment_analysis		sentiment_analysis
.gitignore		.gitignore
README.md		README.md
crypto-report.md		crypto-report.md
makefile		makefile
news_model.json		news_model.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
run		run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cryptodata

Notes

Scraping

Live data

Sytem Design

Web scrapper

Data Storage

Data Builder (Processing)

Monitoring & Error Handling

Dynamic Viewer with Analytics

Infrastructure

Automation

Feeds

Data schema

News object

About

Releases 1

Packages

Contributors 2

Languages

tensorflowters/cryptodata

Folders and files

Latest commit

History

Repository files navigation

Cryptodata

Notes

Scraping

Live data

Sytem Design

Web scrapper

Data Storage

Data Builder (Processing)

Monitoring & Error Handling

Dynamic Viewer with Analytics

Infrastructure

Automation

Feeds

Data schema

News object

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages