Run-level Graph #2772

wslulciuc · 2024-03-15T19:17:13Z

Run-level Graph

A run-level graph represents the relationships between dataset and run metadata. A run-level graph is directed and consists of three node types: dataset version, job version, and run (see Figure 1). A run node may have one or more versioned inputs and versioned outputs as edges. An edge from a run node to a job version node is also maintained and represents the version of the job (=link to source code) at time of execution.

Figure 1: Run-level graph relationships between dataset versions, job versions, and runs.

Note that a dataset is assumed to be modified as the result of a successful run. For a run to be marked successful, the run must transition from a RUNNING state to a COMPLETED state. A run-level graph dynamically captures all modifications made to a given dataset from run-to-run.

Introduction

A run-level graph is fundamental in troubleshooting data issues. For example, the data type of a column within a table may change resulting in unanticipated downstream job failures.

Often, it's both challenging and time consuming to determine the cause of why a given job might be failing. Using the run-level graph, you can observe the upstream lineage of the failing job, therefore, simplifing troubleshooting by highlighting that, for example, the data type of a column is now a STRING upstream, though the failing job was processing the column as an INT downstream.

Graph Data Model

A run-level graph consists of the following nodes:

Dataset Version: A read-only immutable version of a dataset.
Job Version: A read-only immutable version of a job, with a unique referenceable link to code preserving the reproducibility of builds from source.
Run: A discrete instantiation of a job version, with a unique run ID used to update each stage of execution.

Nodes

ID	`dataset:{namespace}:{dataset}#{version}`
Example	`dataset:food_delivery:public.top_delivery_times#947c0388..`

ID	`job:{namespace}:{job}#{version}`
Example	`job:food_delivery:orders_popular_day_of_week#947c0388..`

ID	`run:{id}`
Example	`run:a03422cf..`

Edges

{ dataset:*, TO, run:* }
{ run:*, TO , dataset:* }
{ run:*, IS_VERSION_OF, job:* }

Example

Run `a03422cf`

First, we create the run a03422cf for orders_popular_day_of_week that consumes the input version 695888e2 and produces the output version a03422cf:

Figure 2:

Run `ec6abf8b`

Then, we create another run ec6abf8b that consumes the same input version 695888e2, but produces a new output version ec44fed4:

Figure 3:

Run diff from `a03422cf` to `ec6abf8b`

A diff graph represents the changes between two run nodes of a run-level graph. The graph compares changes starting at a given run node A, up to a given run node B (inclusive). Below we show a run-based comparison for the job orders_popular_day_of_week between runs a03422cf and ec6abf8b:

Figure 4 Diff from a03422cf to ec6abf8b

The text was updated successfully, but these errors were encountered:

zqqqqz2000 · 2024-03-25T15:31:23Z

Very nice feature, looking forward to its completion. Is there currently a schedule for completion?

wslulciuc added the api API layer changes label Mar 15, 2024

wslulciuc added this to the Roadmap milestone Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run-level Graph #2772

Run-level Graph #2772

wslulciuc commented Mar 15, 2024

zqqqqz2000 commented Mar 25, 2024

Run-level Graph #2772

Run-level Graph #2772

Comments

wslulciuc commented Mar 15, 2024

Run-level Graph

Introduction

Graph Data Model

Nodes

Edges

Example

Run a03422cf

Run ec6abf8b

Run diff from a03422cf to ec6abf8b

zqqqqz2000 commented Mar 25, 2024

Run `a03422cf`

Run `ec6abf8b`

Run diff from `a03422cf` to `ec6abf8b`