[io manager] access output metadata in handle_output #20091

jamiedemaria · 2024-02-27T16:11:47Z

Summary & Motivation

Runtime/output metadata is not available in in handle_output of the I/O manager. This issue has gotten a few upvotes and came up in support recently. This PR introduces a new property output_metadata where users can access the metadata assigned to an output (either in Output or with context.add_output_metadata). If accessed from the InputContext output_metadata will log a warning and return {} since accessing the output metadata would require that we query the db, and we don't have an efficient way to do this for ops (see #20094)

Reasoning for not combining output metadata with definition metadata.

Having a single property for metadata causes issues for the InputContext. It isn't reasonable to access output metadata from the InputContext since this would require querying the event log for the latest run and fetching the metadata from the output event (further info here if interested). Since we cannot fetch the output metadata in the InputContext there would be a consistency error in the metadata provided to the user in the OutputContext and in the InputContext. For example:

@asset(
    metadata={"foo": "bar"}
)
def asset_adds_metadata_two_ways():
    return Output(1, metadata={"baz": "qux"}

If we merged metadata together, the user would get {"foo":"bar", "baz":"qux"} in the OutputContext but only {"foo":"bar"} in the InputContext. If instead the user did this:

@asset(
    metadata={"foo": "bar"}
)
def asset_adds_metadata_two_ways():
    return Output(1, metadata={"foo": "qux"} # overriding the same metadata key

it could be even more problematic since the user would get {"foo": "qux"} in the OutputContextand{"foo": "bar"}` in the input context.

How I Tested These Changes

New unit tests. Before implementing the change, the assets that add runtime metadata (via Output and context.add_output_metadata) failed because the I/O manager could not access the metadata

jamiedemaria · 2024-02-27T16:12:01Z

[io manager] access output metadata in handle_output #20091 : 2 dependent PRs (#20196 , #20206 ) 👈
deprecate output context metadata for definition_metadata #20198
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @jamiedemaria and the rest of your teammates on Graphite

jamiedemaria · 2024-02-27T16:37:14Z

python_modules/dagster/dagster/_core/execution/context/output.py

+ def output_metadata(self) -> ArbitraryMetadataMapping:
+ """A dict of the metadata that is assigned to the output at execution time."""
+ if self._warn_on_step_context_use:
+ warnings.warn(


maybe this should be an exception?

smackesey

It's great to finally fix this but I'm finding the current naming very confusing. We are on an OutputContext, which has a field metadata, but then also another field output_metadata. This already makes my head spin. It's made even worse (not your fault) by the fact that even the plain old metadata is already sourced from something qualified with "out" (AssetOut or OutputDefinition IIRC).

I think if we are doing two fields (which I agree is the right approach, thx for thorough explainer) we need to deprecate the old .metadata accessor and replace it with definition_metadata, and use something other than output_metadata-- maybe runtime_metadata?

jamiedemaria · 2024-02-27T17:56:38Z

Agree that naming is confusing - thanks for bringing it up. I'll look through our docs and see if we already use any particular language to differentiate between definition and runtime metadata. that could inform naming conventions.

Since it seems unlikely we'd be able to address the InputContext runtime metadata issue any time soon, I think your idea of deprecating metadata for definition_metadata and output/run_time_metadata makes sense. That also opens up the possibility to re-introduce metadata later if we merge the two together, without breaking the previous definition-only metadata if we override duplicate keys

The other option is to like take the hit on the OutputContext/InputContext inconsistency and just do metadata. I think it has the potential to be confusing to users or give them the false sense that the metadata is available everywhere, but i'm open to being convinced that this is an ok risk to accept

jamiedemaria · 2024-02-27T18:56:25Z

terminology from docs:

definition metadata
materialization metadata - can't use this since we need to be inclusive of ops
event metadata - too vague to me
output metadata - already discussed that this is confusing

I'm feeling pretty ok about run_time_metadata (runtime_metadata?)

alangenfeld · 2024-02-27T19:18:54Z

value_metadata is the only other idea that comes to mind, not sure its better than run time

smackesey · 2024-02-27T19:38:49Z

value_metadata is the only other idea that comes to mind, not sure its better than run time

I prefer value_metadata to runtime_metadata

jamiedemaria · 2024-02-27T20:02:54Z

Thinking about how we might describe the difference between definition and value metadata in docs:

Definition metadata is added to the @asset decorator and remains static across multiple materializations. For example, tracking the owner of an asset may be recorded in definition-level metadata. Value metadata refers to metadata that is recorded during materialization of the asset. This metadata is usually dynamically determined during execution, for example, the number of rows added to a table in a database.

smackesey · 2024-02-27T20:20:19Z

Good description. I would be even more explicit with the description of value metadata, e.g. "Value metadata is generated during execution of the asset and is either attached to a return Output or yielded with context.add_output_metadata" (or whatever that method is).

tracking the owner of an asset may be recorded in definition-level metadata

Just flagging that prob not the example we want since we now have dedicated field for asset owners

jamiedemaria · 2024-02-27T20:23:19Z

oh yeah for sure - this was mostly spitballing docs to see how value felt in prose. idk if i'd make docs changes as part of this pr. but def noted on the author metadata, i didn't know that had become a dedicated thing

jamiedemaria · 2024-02-28T18:21:38Z

Seems like definition_metadata is a clear choice for the definition metadata.

@smackesey can you elaborate more on why you like value_metadata over runtime_metadata? The main risk I see with value_metadata is that it introduces a new term and isn't immediately clear that value means it's attached to a particular output/materialization

smackesey · 2024-02-28T18:30:53Z

can you elaborate more on why you like value_metadata over runtime_metadata?

Primarily for consistency with definition_metadata-- which is qualified with a WHAT rather than WHEN. If we did "runtime" it would seem the proper counterpart would be "static" or something rather than "definition". Also I think WHAT is better because there can actually be multiple return values, "runtime_metadata" is ambiguous wrt the metadata is associated with all or just one returned value.

Yeah it's not the clearest but neither is "runtime" unfortunately.

jamiedemaria · 2024-02-28T18:58:17Z

Also I think WHAT is better because there can actually be multiple return values

this make sense to me. For what it's worth i originally interpreted "value" as "the metadata is a value" not "the metadata is associated with a return value". I think that's where some of my confusion stemmed from.

maybe another avenue to asses naming is if we'd be able to replace context.add_output_metadata with the new name. So context.add_value_metadata. It doesn't help much for de-emphasizing I/O managers since "value" would still be an output value. my thoughts on this aren't fully formed yet though

smackesey · 2024-02-28T20:10:28Z

maybe another avenue to asses naming is if we'd be able to replace context.add_output_metadata with the new name. So context.add_value_metadata

This is a good idea, though IIRC we want to deprecate context.add_output_metadata entirely

yuhan

my immediate reaction is i'd expect the api name in io manager to be consistent with the name in asset body.

in an asset, i log info as:

    @asset
    def metadata_added_to_context(context: AssetExecutionContext):
        context.add_output_metadata(blah)

then, in io manager

    class MyRuntimeMetadataIOManager(IOManager):
        def handle_output(self, context: OutputContext, obj):
            assert context.output_metdata == blah

So before reading the comment, output_metadata felt pretty natural to me. However, if output_metadata will not be here forever, we may want to pick something that's more future proof:

value_metadata: my main reservation is "value" is another new noun to this world. i agree that if we could rename to context.add_value_metadata, this would make total sense. but introducing "value" right now could cause further confusion as we can't remove output_metadata right away (we could deprecate it but there will be a relatively long period where all these different x_metadata co-exist in the system)
runtime_metadata: this seems descriptive and accurate to me. but having context.add_runtime_metadata in asset body sounds a little too implementation details to me?

that being said, whats the long term vision for add_output_metadata? or, what's our thinking on long term pattern to log metadata at runtime?

spitballing. it feels like we are trapped in this .x_metadata. have we explored a getter pattern? something like:

    class MyRuntimeMetadataIOManager(IOManager):
        def handle_output(self, context: OutputContext, obj):
            assert context.get_metadata() == OutputMetadata(definition_metadata=..., runtime_metdata=...) # some object that consolidates both metadata and makes that very explicit

and deprecate .metadata in OutputContext

yuhan · 2024-02-29T18:06:48Z

python_modules/dagster/dagster_tests/storage_tests/test_io_manager.py

+
+ class MyDefinitionMetadataIOManager(IOManager):
+ def handle_output(self, context: OutputContext, obj):
+ assert context.metadata == expected_metadata


with this change, i could see this .metadata get super confusing 😢

jamiedemaria · 2024-03-01T16:08:31Z

long term plan of add_output_metadata

I dug back through git history to see what spurred the creation of add_output_metadata and it looks like it was just because it was nice to not need to yield an Output (source). I also searched through slack a bit and these are my takeaways:

I don't see much chatter in slack about it's future
it's used in a lot of example code sent by our users, which indicates it's widely used

Based on that it seems unlikely we'd deprecate without having a strong reason (there totally could be one that i've missed or forgotten).

However, I definitely would like to deprecate it on the AssetExecutionContext and replace with something that doesn't use output terminology (or not replace at all, but i'm unsure about that option)

having a metadata object

I like the idea of the getter that returns the object. It feels like a nice balance between keeping the different types of metadata split out while having just a single method on the context.

jamiedemaria · 2024-03-01T18:30:51Z

discussed in standup. Naming decisions:

deprecate metadata in favor of definition_metadata
keep the output metadata named output_metadata

jamiedemaria · 2024-03-06T22:49:30Z

python_modules/dagster/dagster/_core/execution/plan/execute_step.py

@@ -766,7 +766,7 @@ def _store_output(
 ) -> Iterator[DagsterEvent]:
 output_def = step_context.op_def.output_def_named(step_output_handle.output_name)
 output_manager = step_context.get_io_manager(step_output_handle)
- output_context = step_context.get_output_context(step_output_handle)
+ output_context = step_context.get_output_context(step_output_handle, output.metadata)


early in the execution path, we combine the metadata added via context.add_output_metadata with the metadata on the Output (link)

Another implementation option for this PR would be to update the StepExecutionContext with the final dictionary of metadata and then we wouldn't need to thread output metadata through here. This would also resolve a similar issue in a community pr that is trying to access output metadata in the hook context (link).

Are there any concerns with modifying the step context with the finalized dictionary of metadata? we'd either do like a full overwrite

step_context._metadata = output.metadata

and ignore the linting error, or add a method to overwrite the metadata

step_context.finalize_metadata(output.metadata) # with a better method name

My worry with that one is that users could start overwriting the metadata whenever

or maybe

output = Output( ... metadata=step_context.normalize_metadata_with_output(output.metadata) )

where normalize_metadata_with_output does updates the step context metadata with the output metadata and then return the dict

i kinda am leaning towards normalize_metadata_with_output to make things very very explicit. but i dont have strong opinions. worth checking with other foundations folks maybe?

yeah @alangenfeld and @smackesey interested in your thoughts on this

Are there any concerns with modifying the step context with the finalized dictionary of metadata?

This feels wrong to me. Having context be this stateful object that holds metadata is a bit of a mistake to begin with IMO, and this would be made worse by mutating it after execution is complete with the contents of a return Output.

But I agree it's a little weird to patch just output.metadata into get_output_context. However it feels very natural to pass the output object itself in (you can just pull off the metadata inside get_output_context), so that's what I favor.

a community pr that is trying to access output metadata in the hook context

Not very familiar with how hook context is constructed but I'd also recommended a similar approach to ^^ rather than mutating StepExecutionContext with content of returned Output.

Having context be this stateful object that holds metadata is a bit of a mistake to begin with IMO

heard

a little context on the hook thing: the issue for hook context is that it doesn't have access to the output when it's created. maybe that's the real thing to fix. the community PR is trying to mimic what we do for passing output values to hooks where we store the value on the step context when we know there is a hook that could be invoked (

dagster/python_modules/dagster/dagster/_core/execution/plan/execute_step.py

Line 573 in cd80ceb

if step_context.step_output_capture is not None:

). this is the exact same mistake that you point out. maybe it's a bit better than always updating the metadata since it would only happen in the case when there is a hook

@smackesey do you have any other thoughts on this? I think i'm inclined to keep this PR as is and eat the less-than-ideal pattern for the hook context. My justification is that we're already using the step context to store outputs when we have to make a hook context, so also storing metadata isn't introducing a brand new pattern, and hooks aren't a widely used feature so this code path won't be used in most execution

Sorry for the late response. Can we pass just the Output (as output) instead of output_metadata into get_output_context? It makes sense that the full Output should be available when constructing the output context. Otherwise this PR looks good to me.

I am going to continue the hook discussion in that PR.

smackesey

I think we have consensus and can proceed?

smackesey · 2024-03-19T13:58:06Z

python_modules/dagster/dagster/_core/execution/plan/execute_step.py

@@ -766,7 +766,7 @@ def _store_output(
 ) -> Iterator[DagsterEvent]:
 output_def = step_context.op_def.output_def_named(step_output_handle.output_name)
 output_manager = step_context.get_io_manager(step_output_handle)
- output_context = step_context.get_output_context(step_output_handle)
+ output_context = step_context.get_output_context(step_output_handle, output.metadata)


Sorry for the late response. Can we pass just the Output (as output) instead of output_metadata into get_output_context? It makes sense that the full Output should be available when constructing the output context. Otherwise this PR looks good to me.

I am going to continue the hook discussion in that PR.

…ecate

jamiedemaria force-pushed the jamie/2-metadata-io-mgr branch from bc14058 to 420ce11 Compare February 27, 2024 16:36

jamiedemaria commented Feb 27, 2024

View reviewed changes

jamiedemaria marked this pull request as ready for review February 27, 2024 17:30

jamiedemaria requested review from alangenfeld and smackesey February 27, 2024 17:30

smackesey reviewed Feb 27, 2024

View reviewed changes

yuhan reviewed Feb 29, 2024

View reviewed changes

jamiedemaria force-pushed the jamie/2-metadata-io-mgr branch from 420ce11 to c33e250 Compare March 1, 2024 19:50

jamiedemaria mentioned this pull request Mar 1, 2024

deprecate OutputContext.metadata for OutputContext.definition_metadata #20196

Closed

jamiedemaria changed the base branch from master to jamie/2-deprecate-metadata March 1, 2024 20:24

jamiedemaria force-pushed the jamie/2-metadata-io-mgr branch from c33e250 to 62c633a Compare March 1, 2024 20:24

jamiedemaria mentioned this pull request Mar 1, 2024

deprecate output context metadata for definition_metadata #20198

Merged

jamiedemaria force-pushed the jamie/2-metadata-io-mgr branch from 62c633a to 54a2358 Compare March 1, 2024 20:45

jamiedemaria mentioned this pull request Mar 1, 2024

deprecate metadata on InputContext in favor of definition_metadata #20206

Merged

jamiedemaria force-pushed the jamie/2-deprecate-metadata branch from 92ce690 to 143d07d Compare March 4, 2024 15:10

jamiedemaria force-pushed the jamie/2-metadata-io-mgr branch from 54a2358 to ce63d73 Compare March 4, 2024 15:10

jamiedemaria force-pushed the jamie/2-deprecate-metadata branch from b605672 to 01cb6ec Compare March 4, 2024 18:27

jamiedemaria requested a review from erinkcochran87 as a code owner March 4, 2024 18:27

jamiedemaria force-pushed the jamie/2-metadata-io-mgr branch 2 times, most recently from 89fd03f to fee2e81 Compare March 4, 2024 20:03

jamiedemaria requested a review from smackesey March 4, 2024 20:04

erinkcochran87 removed their request for review March 6, 2024 19:09

jamiedemaria commented Mar 6, 2024

View reviewed changes

alangenfeld removed their request for review March 6, 2024 22:52

jamiedemaria requested a review from yuhan March 8, 2024 17:59

smackesey approved these changes Mar 19, 2024

View reviewed changes

smackesey mentioned this pull request Mar 19, 2024

Add op_output_metadata retrieval method to HookContext #20027

Merged

jamiedemaria force-pushed the jamie/2-deprecate-metadata branch from 603e4cc to c29693a Compare March 20, 2024 14:34

jamiedemaria force-pushed the jamie/2-metadata-io-mgr branch 2 times, most recently from 42ddea4 to e2b5ea8 Compare March 20, 2024 15:52

jamiedemaria force-pushed the jamie/2-deprecate-metadata branch from be26d1a to b3aa7d0 Compare March 20, 2024 18:57

jamiedemaria force-pushed the jamie/2-metadata-io-mgr branch from e2b5ea8 to a2bade8 Compare March 20, 2024 18:57

Base automatically changed from jamie/2-deprecate-metadata to master March 20, 2024 19:30

jamiedemaria force-pushed the jamie/2-metadata-io-mgr branch from a2bade8 to c086b89 Compare March 20, 2024 19:37

jamiedemaria added 6 commits March 25, 2024 11:06

deprecate output context metadata for definition_metadata

5f734f6

fix tests with temporary set of metadata until input context can depr…

af0bb8f

…ecate

add output metadata to io manager

137f436

warn and link to issue

7817caf

update for deprecated param

e242571

fix merge error

baa4a6b

jamiedemaria force-pushed the jamie/2-metadata-io-mgr branch from ea888b2 to baa4a6b Compare March 25, 2024 15:07

jamiedemaria merged commit dfb8027 into master Mar 25, 2024
1 check passed

jamiedemaria deleted the jamie/2-metadata-io-mgr branch March 25, 2024 16:00

yuhan mentioned this pull request Mar 28, 2024

support value in MaterializeResult #16887

Open

jamiedemaria mentioned this pull request Mar 28, 2024

Access output metadata in IO managers via the OutputContext and InputContext #17923

Closed

PedramNavid pushed a commit that referenced this pull request Mar 28, 2024

[io manager] access output metadata in handle_output (#20091)

6bae8bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[io manager] access output metadata in handle_output #20091

[io manager] access output metadata in handle_output #20091

jamiedemaria commented Feb 27, 2024 •

edited

Loading

jamiedemaria commented Feb 27, 2024 •

edited

Loading

jamiedemaria Feb 27, 2024

smackesey left a comment •

edited

Loading

jamiedemaria commented Feb 27, 2024

jamiedemaria commented Feb 27, 2024

alangenfeld commented Feb 27, 2024

smackesey commented Feb 27, 2024

jamiedemaria commented Feb 27, 2024

smackesey commented Feb 27, 2024 •

edited

Loading

jamiedemaria commented Feb 27, 2024 •

edited

Loading

jamiedemaria commented Feb 28, 2024

smackesey commented Feb 28, 2024 •

edited

Loading

jamiedemaria commented Feb 28, 2024

smackesey commented Feb 28, 2024

yuhan left a comment •

edited

Loading

yuhan Feb 29, 2024

jamiedemaria commented Mar 1, 2024

jamiedemaria commented Mar 1, 2024

jamiedemaria Mar 6, 2024 •

edited

Loading

yuhan Mar 7, 2024

jamiedemaria Mar 7, 2024

smackesey Mar 7, 2024

jamiedemaria Mar 7, 2024

jamiedemaria Mar 12, 2024

smackesey Mar 19, 2024

smackesey left a comment

smackesey Mar 19, 2024

[io manager] access output metadata in handle_output #20091

[io manager] access output metadata in handle_output #20091

Conversation

jamiedemaria commented Feb 27, 2024 • edited Loading

Summary & Motivation

How I Tested These Changes

jamiedemaria commented Feb 27, 2024 • edited Loading

Choose a reason for hiding this comment

smackesey left a comment • edited Loading

Choose a reason for hiding this comment

jamiedemaria commented Feb 27, 2024

jamiedemaria commented Feb 27, 2024

alangenfeld commented Feb 27, 2024

smackesey commented Feb 27, 2024

jamiedemaria commented Feb 27, 2024

smackesey commented Feb 27, 2024 • edited Loading

jamiedemaria commented Feb 27, 2024 • edited Loading

jamiedemaria commented Feb 28, 2024

smackesey commented Feb 28, 2024 • edited Loading

jamiedemaria commented Feb 28, 2024

smackesey commented Feb 28, 2024

yuhan left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamiedemaria commented Mar 1, 2024

jamiedemaria commented Mar 1, 2024

jamiedemaria Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smackesey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamiedemaria commented Feb 27, 2024 •

edited

Loading

jamiedemaria commented Feb 27, 2024 •

edited

Loading

smackesey left a comment •

edited

Loading

smackesey commented Feb 27, 2024 •

edited

Loading

jamiedemaria commented Feb 27, 2024 •

edited

Loading

smackesey commented Feb 28, 2024 •

edited

Loading

yuhan left a comment •

edited

Loading

jamiedemaria Mar 6, 2024 •

edited

Loading