Skip to content

derrabauke/eu-data-workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eu-data-workflow

Here you witness my data analysis project which consists of this repo, which scrapes data from public sources and models them into the database, and a visualization tool in the front.

First of all this follows the great idea of Simon Willison, who published a great blog entry about his experience of utilizing GitHub Actions as a scraping pipeline.

The data of the CORIDS database (named "H2020" in this repo) underlies the copyright of the © European Union 2022.

The data of the BMBF underlies copyright of © Bundesministerium für Bildung und Forschung.

Approach

This repository does the following:

  • scrape data regulary from known source via cron timed GH Actions
  • push the data through a minimal ETL pipeline
  • upload the latest records to the Neo4j AuraDB
  • keep track of data changes via Git history

Get running yourself

You would need to setup a (free) Neo4j AuraDB and put the credentials into your repository secrets (NEO4J_USER, NEO4j_PASSWORD, NEO4j_URL). Further you have to adapt the actions under .github/workflows/... to your needs.

About

Git-Scraping ETL Pipeline for public data sources

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published