K2BQ - Dataflow Kafka to BigQuery

Overview

K2BQ is a dataflow pipeline designed to stream data from Kafka to Google BigQuery. The project provides a scalable and efficient solution for transferring large volumes of data from Kafka topics into BigQuery for real-time analytics. This project leverages Google Cloud’s managed Kafka service and Dataflow to enable seamless data integration, processing, and storage in BigQuery.

Features

Kafka Integration: Collect data from Kafka topics for real-time processing.
BigQuery Integration: Directly stream the processed data into BigQuery for fast analytics.
Scalable Architecture: Leverages Google Cloud's managed services for scaling based on demand.
Automated Data Pipeline: Automates the data transfer process from Kafka to BigQuery with minimal intervention.

Components

The pipeline consists of the following components:

Google Cloud Managed Kafka Cluster: Used to store and stream data through Kafka topics.
Google Dataflow: Used to process data from Kafka and load it into BigQuery.
BigQuery: The final destination where the processed data is stored for analysis.

Installation

Prerequisites

Google Cloud Platform (GCP) account.
Terraform installed on your local machine.
Google Cloud SDK configured with the necessary permissions.
Access to Google BigQuery and Kafka services.

Setting Up the Project

Clone the repository:

git clone https://github.com/Pirate-Emperor/K2BQ.git
cd K2BQ

Configure Terraform:

Ensure that you have your main.tf, variable.tf, and any other required Terraform files set up as mentioned in the following sections.
Install required Terraform providers:

In the root directory, run the following command to initialize Terraform providers:
```
terraform init
```
Apply the Terraform configuration:

Run the following command to create the resources in your Google Cloud project:
```
terraform apply
```
This command will create the Kafka cluster, Kafka topics, and Dataflow pipeline. Review the changes and confirm the application when prompted.

Terraform Configuration Files

`main.tf`

This Terraform configuration sets up the Google Cloud resources required for the Kafka-to-BigQuery data pipeline.

provider "google-beta" {
  project = data.google_project.project.project_id
  region  = "us-central1"
}

data "google_project" "project" {}

module "kafka_cluster" {
  source = "./module/df_resource"  # Update with the actual path to your module
  cluster_id         = "dataops-kafka"
  region             = "us-central1"
  vcpu_count         = 4
  memory_bytes       = 4294967296
  subnet             = "projects/valid-verbena-437709-h5/regions/us-central1/subnetworks/default"
  topic_id           = "dataops-kafka-topic"
  partition_count    = 3
  replication_factor = 3
  cleanup_policy     = "compact"
}

resource "google_managed_kafka_cluster" "cluster" {
  cluster_id = var.cluster_id
  location   = var.region

  capacity_config {
    vcpu_count    = var.vcpu_count
    memory_bytes  = var.memory_bytes
  }

  gcp_config {
    access_config {
      network_configs {
        subnet = var.subnet
      }
    }
  }
}

resource "google_managed_kafka_topic" "example" {
  topic_id          = var.topic_id
  cluster           = google_managed_kafka_cluster.cluster.cluster_id
  location          = var.region
  partition_count   = var.partition_count
  replication_factor = var.replication_factor
  configs = {
    "cleanup.policy" = var.cleanup_policy
  }
}

`variable.tf`

This file defines the necessary variables for the Kafka cluster configuration.

variable "cluster_id" {
  description = "The ID of the Kafka cluster."
  type        = string
  default     = "my-cluster"
}

variable "region" {
  description = "The region to deploy the Kafka cluster."
  type        = string
  default     = "us-central1"
}

variable "vcpu_count" {
  description = "Number of vCPUs for Kafka cluster capacity."
  type        = number
  default     = 3
}

variable "memory_bytes" {
  description = "Memory in bytes for Kafka cluster capacity."
  type        = number
  default     = 3221225472
}

variable "subnet" {
  description = "Subnetwork to attach the Kafka cluster to."
  type        = string
  default     = "projects/valid-verbena-437709-h5/regions/us-central1/subnetworks/default"
}

variable "topic_id" {
  description = "The ID of the Kafka topic."
  type        = string
  default     = "example-topic"
}

variable "partition_count" {
  description = "Number of partitions for the Kafka topic."
  type        = number
  default     = 2
}

variable "replication_factor" {
  description = "Replication factor for the Kafka topic."
  type        = number
  default     = 3
}

variable "cleanup_policy" {
  description = "Cleanup policy for the Kafka topic."
  type        = string
  default     = "compact"
}

Dataflow Pipeline

The Dataflow pipeline reads data from the Kafka topic and writes it to BigQuery. This setup allows for continuous streaming of data for analysis and reporting.

Key steps:

Reading Data: Data is consumed from the Kafka topic.
Processing Data: Data is processed and transformed as per the requirements (e.g., filtering, aggregation).
Writing to BigQuery: After processing, the data is written to BigQuery tables for querying and analysis.

Conclusion

The K2BQ project simplifies the integration between Kafka and BigQuery, offering a robust solution for real-time data processing. With Terraform managing the infrastructure setup and Dataflow handling the pipeline, this solution provides scalability, performance, and ease of maintenance for your data streaming needs.

License

This project is licensed under the Pirate-Emperor License. See the LICENSE file for details.

Author

Pirate-Emperor

GitHub: Pirate-Emperor
Reddit: PirateKingRahul
Twitter: PirateKingRahul
Discord: PirateKingRahul
LinkedIn: PirateKingRahul
Skype: Join Skype
Medium: PirateKingRahul

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K2BQ - Dataflow Kafka to BigQuery

Overview

Features

Components

Installation

Prerequisites

Setting Up the Project

Terraform Configuration Files

`main.tf`

`variable.tf`

Dataflow Pipeline

Key steps:

Conclusion

License

Author

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
module/df_resource		module/df_resource
pipeline		pipeline
LICENSE		LICENSE
README.md		README.md
main.tf		main.tf

License

Pirate-Emperor/K2BQ

Folders and files

Latest commit

History

Repository files navigation

K2BQ - Dataflow Kafka to BigQuery

Overview

Features

Components

Installation

Prerequisites

Setting Up the Project

Terraform Configuration Files

main.tf

variable.tf

Dataflow Pipeline

Key steps:

Conclusion

License

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`main.tf`

`variable.tf`

Packages