-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run-level Graph #2772
Comments
Very nice feature, looking forward to its completion. Is there currently a schedule for completion? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Run-level Graph
A run-level graph represents the relationships between dataset and run metadata. A run-level graph is directed and consists of three node types: dataset version, job version, and run (see Figure 1). A run node may have one or more versioned inputs and versioned outputs as edges. An edge from a run node to a job version node is also maintained and represents the version of the job (=link to source code) at time of execution.
Note that a dataset is assumed to be modified as the result of a successful run. For a run to be marked successful, the run must transition from a
RUNNING
state to aCOMPLETED
state. A run-level graph dynamically captures all modifications made to a given dataset from run-to-run.Introduction
A run-level graph is fundamental in troubleshooting data issues. For example, the data type of a column within a table may change resulting in unanticipated downstream job failures.
Often, it's both challenging and time consuming to determine the cause of why a given job might be failing. Using the run-level graph, you can observe the upstream lineage of the failing job, therefore, simplifing troubleshooting by highlighting that, for example, the data type of a column is now a
STRING
upstream, though the failing job was processing the column as anINT
downstream.Graph Data Model
A run-level graph consists of the following nodes:
Nodes
dataset:{namespace}:{dataset}#{version}
dataset:food_delivery:public.top_delivery_times#947c0388..
job:{namespace}:{job}#{version}
job:food_delivery:orders_popular_day_of_week#947c0388..
run:{id}
run:a03422cf..
Edges
dataset:*
,TO
,run:*
}run:*
,TO
,dataset:*
}run:*
,IS_VERSION_OF
,job:*
}Example
Run
a03422cf
First, we create the run
a03422cf
fororders_popular_day_of_week
that consumes the input version695888e2
and produces the output versiona03422cf
:Run
ec6abf8b
Then, we create another run
ec6abf8b
that consumes the same input version695888e2
, but produces a new output versionec44fed4
:Run diff from
a03422cf
toec6abf8b
A diff graph represents the changes between two run nodes of a run-level graph. The graph compares changes starting at a given run node
A
, up to a given run nodeB
(inclusive). Below we show a run-based comparison for the joborders_popular_day_of_week
between runsa03422cf
andec6abf8b
:The text was updated successfully, but these errors were encountered: