Gutenberg

The project aims to develop an automated system capable of grading GitHub repositories and transforming various data types into actionable insights. The system leverages modern stream processing frameworks, microservices, and local large language models (LLMs) to ensure scalability, efficiency, and cost-effectiveness.

Features

Fetch commits from GitHub repositories
Process commit messages to generate summaries
Store results in a vector database
Kafka integration for message streaming
Configurable to use different chat models (OpenAI, Fake model)
Kafka Kraft instance for broker management
Kafka UI for cluster management
REST Proxy for interacting with Kafka topics via REST API
Schema Registry for managing Kafka message schemas

Project Structure

my_project/
├── config/
│   ├── __init__.py
│   ├── config_setting.py
├── models/
│   ├── __init__.py
│   ├── commit.py
├── services/
│   ├── __init__.py
│   ├── github_service.py
│   ├── message_processing_service.py
│   ├── vectordb_service.py
├── utils/
│   ├── __init__.py
│   ├── kafka_utils.py
│   ├── model_utils.py
│   ├── get_qdrant.py
│   ├── setup_logging.py
├── dataflows/
│   ├── __init__.py
│   ├── github_commit_processing.py
│   ├── commit_summary_service.py
│   ├── add_qdrant_service.py
├── tests/
│   ├── __init__.py
│   ├── test_github_service.py
│   ├── test_message_processing_service.py
│   ├── test_dataflows.py
└── requirements.txt

Setup and Installation

Clone the repository:

git clone https://github.com/jgwentworth92/GutenbergV2.git
cd your-repo

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the dependencies:
```
pip install -r requirements.txt
```

Set up your environment variables: Create a .env file in the root directory and add the required environment variables:

GITHUB_TOKEN=your_github_token
BROKERS="localhost:9092"
INPUT_TOPIC=your_input_topic
OUTPUT_TOPIC=your_output_topic
PROCESSED_TOPIC=your_processed_topic
VECTORDB_TOPIC_NAME=your_vectordb_topic_name
CONSUMER_CONFIG={"bootstrap.servers": "kafka_b:9094","auto.offset.reset": "earliest","group.id": "consumer_group","enable.auto.commit": "True"}
PRODUCER_CONFIG={"bootstrap.servers": "kafka_b:9094"}
OPENAI_API_KEY=your_openai_api_key
MODEL_PROVIDER=openai
TEMPLATE=your_template_string

Configuration

The application configuration is managed using Pydantic settings. Modify the config/config_setting.py file to update the configuration settings.

Running the Services

To start all services, navigate to the root directory of your project where the docker-compose.yml file is located and run the following command:

docker-compose up --build

This will build and start all the services defined in your docker-compose.yml file.

Running the Dataflows

To run the dataflows, use the following command format, replacing (filename) with the actual filename of the dataflow script without the .py extension:

python -m bytewax.run -w3 dataflows.(filename)

For example, to run the GitHub commit processing dataflow:

python -m bytewax.run -w3 dataflows.github_commit_processing

And to run the commit summary service dataflow:

python -m bytewax.run -w3 dataflows.commit_summary_service

And to run the add to Qdrant service dataflow:

python -m bytewax.run -w3 dataflows.add_qdrant_service

Creating Recovery Partitions

Before creating recovery partitions, ensure that the necessary directories exist. If not, create them:

mkdir -p recovery/github_listener
mkdir -p recovery/commit_summary_service
mkdir -p recovery/add_qdrant_service

To set up recovery partitions for each microservice, run the following commands. This ensures that Bytewax can recover from failures and continue processing.

GitHub Commit Processing Recovery Partition:

python -m bytewax.recovery recovery/github_listener 4

Commit Summary Service Recovery Partition:

python -m bytewax.recovery recovery/commit_summary_service 4

Add to Qdrant Service Recovery Partition:

python -m bytewax.recovery recovery/add_qdrant_service 4

These commands should be run from the root directory of your project.

Testing

To run the tests, use the pytest framework:

pytest .

The tests are located in the tests/ directory and cover the GitHub service, message processing service, and dataflows.

Additional Services

Kafka UI

Kafka UI is included for cluster management, providing a web interface to manage and monitor Kafka clusters. Kafka UI can be accessed at http:https://localhost:8080/. When adding a cluster, use kafka_b for the host and port 9094.

Qdrant Web UI

Qdrant Web UI is included to manage the vector database. Qdrant Web UI can be accessed at http:https://localhost:6333/dashboard#/collections.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github/workflows		.github/workflows
config		config
dataflows		dataflows
kui		kui
models		models
services		services
tests		tests
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
docker-compose.yml		docker-compose.yml
logging.conf		logging.conf
main.py		main.py
pytest.ini		pytest.ini
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gutenberg

Table of Contents

Features

Project Structure

Setup and Installation

Configuration

Running the Services

Running the Dataflows

Creating Recovery Partitions

Testing

Additional Services

Kafka UI

Qdrant Web UI

About

Releases

Packages

Languages

License

jgwentworth92/GutenbergV2

Folders and files

Latest commit

History

Repository files navigation

Gutenberg

Table of Contents

Features

Project Structure

Setup and Installation

Configuration

Running the Services

Running the Dataflows

Creating Recovery Partitions

Testing

Additional Services

Kafka UI

Qdrant Web UI

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages