!!! readme for exercise 2 at the end
This repository provides a code base for the information integration course in the summer semester of 2022. Below you can find the documentation for setting up the project.
- Install Poetry
- Install Docker and docker-compose
- Install Protobuf compiler (protoc). If you are using windows you can use this guide
- Install jq
The Registerbekanntmachung website contains
announcements concerning entries made into the companies, cooperatives, and
partnerships registers within the electronic information and communication system. You can search for the announcements.
Each announcement can be requested through the link below. You only need to pass the query parameters rb_id
and land_abk
. For instance, we chose the state Rheinland-Pfalz rp
with an announcement id of 56267
, the
new entry of the company BioNTech.
export STATE="rp"
export RB_ID="56267"
curl -X GET "https://www.handelsregisterbekanntmachungen.de/skripte/hrb.php?rb_id=$RB_ID&land_abk=$STATE"
The Registerbekanntmachung crawler (rb_crawler) sends a get request to the link above with parameters (rb_id
and land_abk
) passed to it and extracts the information from the response.
We use Protocol buffers to define our schema.
The crawler uses the generated model class (i.e., Corporate
class) from
the protobuf schema.
We will explain furthur how you can generate this class using the protobuf compiler.
The compiler creates a Corporate
class with the fields defined in the schema. The crawler fills the object fields with
the
extracted data from the website.
It then serializes the Corporate
object to bytes so that Kafka can read it and produces it to the corporate-events
topic. After that, it increments the rb_id
value and sends another GET request.
This process continues until the end of the announcements is reached, and the crawler will stop automatically.
The corporate-events
holds all the events (announcements) produced by the rb_crawler
. Each message in a Kafka topic
consist of a key and value.
The key type of this topic is String
. The key is generated by the rb_crawler
. The key
is a combination of the land_abk
and the rb_id
. If we consider the rb_id
and land_abk
from the example above,
the
key will look like this: rp_56267
.
The value of the message contains more information like event_name
, event_date
, and more. Therefore, the value type
is complex and needs a schema definition.
Kafka Connect is a tool to move large data sets into (source) and out (sink) of Kafka. Here we only use the Sink connector, which consumes data from a Kafka topic into a secondary index such as Elasticsearch.
We use the Elasticsearch Sink Connector
to move the data from the coporate-events
topic into the Elasticsearch.
This project uses Poetry as a build tool.
To install all the dependencies, just run poetry install
.
This project uses Protobuf for serializing and deserializing objects. We provided a
simple protobuf schema.
Furthermore, you need to generate the Python code for the model class from the proto file.
To do so run the generate-proto.sh
script.
This script uses the Protobuf compiler (protoc) to generate the model class
under the build/gen/bakdata/corporate/v1
folder
with the name corporate_pb2.py
.
Use docker-compose up -d
to start all the services: Zookeeper
, Kafka, Schema
Registry
, Kafka REST Proxy, Kowl,
Kafka Connect,
and Elasticsearch. Depending on your system, it takes a couple of minutes
before the services are up and running. You can use a tool
like lazydocker
to check the status of the services.
After all the services are up and running, you need to configure Kafka Connect to use the Elasticsearch sink connector. The config file is a JSON formatted file. We provided a basic configuration file. You can find more information about the configuration properties on the official documentation page.
To start the connector, you need to push the JSON config file to Kafka. You can either use the UI dashboard in Kowl or use the bash script provided. It is possible to remove a connector by deleting it through Kowl's UI dashboard or calling the deletion API in the bash script provided.
You can start the crawler with the command below:
poetry run python rb_crawler/main.py --id $RB_ID --state $STATE
The --id
option is an integer, which determines the initial event in the handelsregisterbekanntmachungen to be
crawled.
The --state
option takes a string (only the ones listed above). This string defines the state where the crawler should
start from.
You can use the --help
option to see the usage:
Usage: main.py [OPTIONS]
Options:
-i, --id INTEGER The rb_id to initialize the crawl from
-s, --state [bw|by|be|br|hb|hh|he|mv|ni|nw|rp|sl|sn|st|sh|th]
The state ISO code
--help Show this message and exit.
Kowl is a web application that helps you manage and debug your Kafka workloads effortlessly. You can create, update, and delete Kafka resources like Topics and Kafka Connect configs. You can see Kowl's dashboard in your browser under https://localhost:8080.
To query the data from Elasticsearch, you can use the query DSL of elastic. For example:
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
<field>
}
}
}
'
<field>
is the field you wish to search. For example:
"reference_id":"HRB 41865"
You can stop and remove all the resources by running:
docker-compose down
Basically, our crawler for the second dataset is using the same logic, principles and file structure as the one for the Handelsregisterdaten. Therefore, the same command structure can be used:
poetry run python spiegel_crawler/main.py --date 2022-05-24
As the Spiegel-Website has a different structure, the parameter needs to be date
. The crawler then gathers all articles from spiegel.de, starting from the given date and then backwards in time, until you cancel.
In difference to the Handelsregister-crawler, we're using scrapy to crawl the Spiegel-Website. Spiegel has a overview page of all articles on the given date (https://www.spiegel.de/nachrichtenarchiv/artikel-24.05.2022.html), from which we can extract the article-links. Afterwards, all articles are crawled and then the overview for the day before and so on.
On the overview page and the article we can extract the information based on the html structure by the <article>
-tag and given CSS classes.
We also build an protobuf-schema for inserting the data into the Kafka topic, which you can find in the proto/bakdata/articles/article.proto
file. It is independent of the Spiegel-Website and can also be used for different news sites in the future.