About this integration
Using this integration, you can manage and interact with Google Cloud Platform's Dataproc service directly from Dagster. Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters. This integration allows you to create, manage, and delete Dataproc clusters, as well as submit and monitor jobs on these clusters.
Installation
pip install dagster-gcp
Examples
from dagster import asset, Definitions
from dagster_gcp.dataproc import DataprocResource
dataproc_resource = DataprocResource(
project_id="your-gcp-project-id",
region="your-gcp-region",
cluster_name="your-cluster-name",
cluster_config_yaml_path="path/to/your/cluster/config.yaml",
)
@asset
def my_dataproc_asset(dataproc: DataprocResource):
with dataproc.get_client() as client:
job_details = {
"job": {
"placement": {"clusterName": dataproc.cluster_name},
}
}
client.submit_job(job_details)
defs = Definitions(
assets=[my_dataproc_asset], resources={"dataproc": dataproc_resource}
)
About Google Cloud Platform Dataproc
Google Cloud Platform's Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Hadoop, and other open-source data processing frameworks. Dataproc simplifies the process of setting up and managing clusters, allowing you to focus on your data processing tasks without worrying about the underlying infrastructure. With Dataproc, you can quickly create clusters, submit jobs, and monitor their progress, all while benefiting from the scalability and reliability of Google Cloud Platform.