Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run-level Graph #2772

Open
wslulciuc opened this issue Mar 15, 2024 · 1 comment
Open

Run-level Graph #2772

wslulciuc opened this issue Mar 15, 2024 · 1 comment
Labels
api API layer changes
Milestone

Comments

@wslulciuc
Copy link
Member

Run-level Graph

A run-level graph represents the relationships between dataset and run metadata. A run-level graph is directed and consists of three node types: dataset version, job version, and run (see Figure 1). A run node may have one or more versioned inputs and versioned outputs as edges. An edge from a run node to a job version node is also maintained and represents the version of the job (=link to source code) at time of execution.

run-node

Figure 1: Run-level graph relationships between dataset versions, job versions, and runs.

Note that a dataset is assumed to be modified as the result of a successful run. For a run to be marked successful, the run must transition from a RUNNING state to a COMPLETED state. A run-level graph dynamically captures all modifications made to a given dataset from run-to-run.

Introduction

A run-level graph is fundamental in troubleshooting data issues. For example, the data type of a column within a table may change resulting in unanticipated downstream job failures.

Often, it's both challenging and time consuming to determine the cause of why a given job might be failing. Using the run-level graph, you can observe the upstream lineage of the failing job, therefore, simplifing troubleshooting by highlighting that, for example, the data type of a column is now a STRING upstream, though the failing job was processing the column as an INT downstream.

Graph Data Model

A run-level graph consists of the following nodes:

  • Dataset Version: A read-only immutable version of a dataset.
  • Job Version: A read-only immutable version of a job, with a unique referenceable link to code preserving the reproducibility of builds from source.
  • Run: A discrete instantiation of a job version, with a unique run ID used to update each stage of execution.

Nodes

ID dataset:{namespace}:{dataset}#{version}
Example dataset:food_delivery:public.top_delivery_times#947c0388..
ID job:{namespace}:{job}#{version}
Example job:food_delivery:orders_popular_day_of_week#947c0388..
ID run:{id}
Example run:a03422cf..

Edges

  • { dataset:*, TO, run:* }
  • { run:*, TO , dataset:* }
  • { run:*, IS_VERSION_OF, job:* }

Example

Run a03422cf

First, we create the run a03422cf for orders_popular_day_of_week that consumes the input version 695888e2 and produces the output version a03422cf:

run-1

Figure 2:

Run ec6abf8b

Then, we create another run ec6abf8b that consumes the same input version 695888e2, but produces a new output version ec44fed4:

run-2

Figure 3:

Run diff from a03422cf to ec6abf8b

A diff graph represents the changes between two run nodes of a run-level graph. The graph compares changes starting at a given run node A, up to a given run node B (inclusive). Below we show a run-based comparison for the job orders_popular_day_of_week between runs a03422cf and ec6abf8b:

diff

Figure 4 Diff from a03422cf to ec6abf8b

@wslulciuc wslulciuc added the api API layer changes label Mar 15, 2024
@wslulciuc wslulciuc added this to the Roadmap milestone Mar 15, 2024
@zqqqqz2000
Copy link

Very nice feature, looking forward to its completion. Is there currently a schedule for completion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes
Projects
Status: No status
Development

No branches or pull requests

2 participants