-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[io manager] access output metadata in handle_output #20091
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @jamiedemaria and the rest of your teammates on Graphite |
bc14058
to
420ce11
Compare
def output_metadata(self) -> ArbitraryMetadataMapping: | ||
"""A dict of the metadata that is assigned to the output at execution time.""" | ||
if self._warn_on_step_context_use: | ||
warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this should be an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to finally fix this but I'm finding the current naming very confusing. We are on an OutputContext
, which has a field metadata
, but then also another field output_metadata
. This already makes my head spin. It's made even worse (not your fault) by the fact that even the plain old metadata
is already sourced from something qualified with "out" (AssetOut
or OutputDefinition
IIRC).
I think if we are doing two fields (which I agree is the right approach, thx for thorough explainer) we need to deprecate the old .metadata
accessor and replace it with definition_metadata
, and use something other than output_metadata
-- maybe runtime_metadata
?
Agree that naming is confusing - thanks for bringing it up. I'll look through our docs and see if we already use any particular language to differentiate between definition and runtime metadata. that could inform naming conventions. Since it seems unlikely we'd be able to address the The other option is to like take the hit on the |
terminology from docs:
I'm feeling pretty ok about |
|
I prefer |
Thinking about how we might describe the difference between definition and value metadata in docs: Definition metadata is added to the |
Good description. I would be even more explicit with the description of value metadata, e.g. "Value metadata is generated during execution of the asset and is either attached to a return
Just flagging that prob not the example we want since we now have dedicated field for asset owners |
oh yeah for sure - this was mostly spitballing docs to see how value felt in prose. idk if i'd make docs changes as part of this pr. but def noted on the author metadata, i didn't know that had become a dedicated thing |
Seems like @smackesey can you elaborate more on why you like |
Primarily for consistency with Yeah it's not the clearest but neither is "runtime" unfortunately. |
this make sense to me. For what it's worth i originally interpreted "value" as "the metadata is a value" not "the metadata is associated with a return value". I think that's where some of my confusion stemmed from. maybe another avenue to asses naming is if we'd be able to replace |
This is a good idea, though IIRC we want to deprecate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my immediate reaction is i'd expect the api name in io manager to be consistent with the name in asset body.
- in an asset, i log info as:
@asset
def metadata_added_to_context(context: AssetExecutionContext):
context.add_output_metadata(blah)
- then, in io manager
class MyRuntimeMetadataIOManager(IOManager):
def handle_output(self, context: OutputContext, obj):
assert context.output_metdata == blah
So before reading the comment, output_metadata
felt pretty natural to me. However, if output_metadata will not be here forever, we may want to pick something that's more future proof:
- value_metadata: my main reservation is "value" is another new noun to this world. i agree that if we could rename to
context.add_value_metadata
, this would make total sense. but introducing "value" right now could cause further confusion as we can't remove output_metadata right away (we could deprecate it but there will be a relatively long period where all these different x_metadata co-exist in the system) - runtime_metadata: this seems descriptive and accurate to me. but having
context.add_runtime_metadata
in asset body sounds a little too implementation details to me?
that being said, whats the long term vision for add_output_metadata
? or, what's our thinking on long term pattern to log metadata at runtime?
spitballing. it feels like we are trapped in this .x_metadata
. have we explored a getter pattern? something like:
class MyRuntimeMetadataIOManager(IOManager):
def handle_output(self, context: OutputContext, obj):
assert context.get_metadata() == OutputMetadata(definition_metadata=..., runtime_metdata=...) # some object that consolidates both metadata and makes that very explicit
and deprecate .metadata
in OutputContext
|
||
class MyDefinitionMetadataIOManager(IOManager): | ||
def handle_output(self, context: OutputContext, obj): | ||
assert context.metadata == expected_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this change, i could see this .metadata
get super confusing 😢
I dug back through git history to see what spurred the creation of
Based on that it seems unlikely we'd deprecate without having a strong reason (there totally could be one that i've missed or forgotten). However, I definitely would like to deprecate it on the
I like the idea of the getter that returns the object. It feels like a nice balance between keeping the different types of metadata split out while having just a single method on the context. |
discussed in standup. Naming decisions:
|
420ce11
to
c33e250
Compare
c33e250
to
62c633a
Compare
62c633a
to
54a2358
Compare
92ce690
to
143d07d
Compare
54a2358
to
ce63d73
Compare
b605672
to
01cb6ec
Compare
89fd03f
to
fee2e81
Compare
@@ -766,7 +766,7 @@ def _store_output( | |||
) -> Iterator[DagsterEvent]: | |||
output_def = step_context.op_def.output_def_named(step_output_handle.output_name) | |||
output_manager = step_context.get_io_manager(step_output_handle) | |||
output_context = step_context.get_output_context(step_output_handle) | |||
output_context = step_context.get_output_context(step_output_handle, output.metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
early in the execution path, we combine the metadata added via context.add_output_metadata
with the metadata on the Output
(link)
Another implementation option for this PR would be to update the StepExecutionContext with the final dictionary of metadata and then we wouldn't need to thread output metadata through here. This would also resolve a similar issue in a community pr that is trying to access output metadata in the hook context (link).
Are there any concerns with modifying the step context with the finalized dictionary of metadata? we'd either do like a full overwrite
step_context._metadata = output.metadata
and ignore the linting error, or add a method to overwrite the metadata
step_context.finalize_metadata(output.metadata) # with a better method name
My worry with that one is that users could start overwriting the metadata whenever
or maybe
output = Output(
...
metadata=step_context.normalize_metadata_with_output(output.metadata)
)
where normalize_metadata_with_output
does updates the step context metadata with the output metadata and then return the dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i kinda am leaning towards normalize_metadata_with_output
to make things very very explicit. but i dont have strong opinions. worth checking with other foundations folks maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah @alangenfeld and @smackesey interested in your thoughts on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any concerns with modifying the step context with the finalized dictionary of metadata?
This feels wrong to me. Having context be this stateful object that holds metadata is a bit of a mistake to begin with IMO, and this would be made worse by mutating it after execution is complete with the contents of a return Output
.
But I agree it's a little weird to patch just output.metadata
into get_output_context
. However it feels very natural to pass the output object itself in (you can just pull off the metadata inside get_output_context
), so that's what I favor.
a community pr that is trying to access output metadata in the hook context
Not very familiar with how hook context is constructed but I'd also recommended a similar approach to ^^ rather than mutating StepExecutionContext
with content of returned Output
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having context be this stateful object that holds metadata is a bit of a mistake to begin with IMO
heard
a little context on the hook thing: the issue for hook context is that it doesn't have access to the output when it's created. maybe that's the real thing to fix. the community PR is trying to mimic what we do for passing output values to hooks where we store the value on the step context when we know there is a hook that could be invoked (
if step_context.step_output_capture is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smackesey do you have any other thoughts on this? I think i'm inclined to keep this PR as is and eat the less-than-ideal pattern for the hook context. My justification is that we're already using the step context to store outputs when we have to make a hook context, so also storing metadata isn't introducing a brand new pattern, and hooks aren't a widely used feature so this code path won't be used in most execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response. Can we pass just the Output
(as output
) instead of output_metadata
into get_output_context
? It makes sense that the full Output
should be available when constructing the output context. Otherwise this PR looks good to me.
I am going to continue the hook discussion in that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have consensus and can proceed?
@@ -766,7 +766,7 @@ def _store_output( | |||
) -> Iterator[DagsterEvent]: | |||
output_def = step_context.op_def.output_def_named(step_output_handle.output_name) | |||
output_manager = step_context.get_io_manager(step_output_handle) | |||
output_context = step_context.get_output_context(step_output_handle) | |||
output_context = step_context.get_output_context(step_output_handle, output.metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response. Can we pass just the Output
(as output
) instead of output_metadata
into get_output_context
? It makes sense that the full Output
should be available when constructing the output context. Otherwise this PR looks good to me.
I am going to continue the hook discussion in that PR.
603e4cc
to
c29693a
Compare
42ddea4
to
e2b5ea8
Compare
be26d1a
to
b3aa7d0
Compare
e2b5ea8
to
a2bade8
Compare
a2bade8
to
c086b89
Compare
ea888b2
to
baa4a6b
Compare
Summary & Motivation
Runtime/output metadata is not available in in handle_output of the I/O manager. This issue has gotten a few upvotes and came up in support recently. This PR introduces a new property
output_metadata
where users can access the metadata assigned to an output (either inOutput
or withcontext.add_output_metadata
). If accessed from theInputContext
output_metadata
will log a warning and return{}
since accessing the output metadata would require that we query the db, and we don't have an efficient way to do this for ops (see #20094)Reasoning for not combining output metadata with definition metadata.
Having a single property for metadata causes issues for the
InputContext
. It isn't reasonable to access output metadata from theInputContext
since this would require querying the event log for the latest run and fetching the metadata from the output event (further info here if interested). Since we cannot fetch the output metadata in the InputContext there would be a consistency error in the metadata provided to the user in theOutputContext
and in theInputContext
. For example:If we merged metadata together, the user would get
{"foo":"bar", "baz":"qux"}
in theOutputContext
but only{"foo":"bar"}
in theInputContext
. If instead the user did this:it could be even more problematic since the user would get
{"foo": "qux"} in the
OutputContextand
{"foo": "bar"}` in the input context.How I Tested These Changes
New unit tests. Before implementing the change, the assets that add runtime metadata (via
Output
andcontext.add_output_metadata
) failed because the I/O manager could not access the metadata