Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[io manager] access output metadata in handle_output #20091

Merged
merged 6 commits into from
Mar 25, 2024

Conversation

jamiedemaria
Copy link
Contributor

@jamiedemaria jamiedemaria commented Feb 27, 2024

Summary & Motivation

Runtime/output metadata is not available in in handle_output of the I/O manager. This issue has gotten a few upvotes and came up in support recently. This PR introduces a new property output_metadata where users can access the metadata assigned to an output (either in Output or with context.add_output_metadata). If accessed from the InputContext output_metadata will log a warning and return {} since accessing the output metadata would require that we query the db, and we don't have an efficient way to do this for ops (see #20094)


Reasoning for not combining output metadata with definition metadata.

Having a single property for metadata causes issues for the InputContext. It isn't reasonable to access output metadata from the InputContext since this would require querying the event log for the latest run and fetching the metadata from the output event (further info here if interested). Since we cannot fetch the output metadata in the InputContext there would be a consistency error in the metadata provided to the user in the OutputContext and in the InputContext. For example:

@asset(
    metadata={"foo": "bar"}
)
def asset_adds_metadata_two_ways():
    return Output(1, metadata={"baz": "qux"}

If we merged metadata together, the user would get {"foo":"bar", "baz":"qux"} in the OutputContext but only {"foo":"bar"} in the InputContext. If instead the user did this:

@asset(
    metadata={"foo": "bar"}
)
def asset_adds_metadata_two_ways():
    return Output(1, metadata={"foo": "qux"} # overriding the same metadata key

it could be even more problematic since the user would get {"foo": "qux"} in the OutputContextand{"foo": "bar"}` in the input context.

How I Tested These Changes

New unit tests. Before implementing the change, the assets that add runtime metadata (via Output and context.add_output_metadata) failed because the I/O manager could not access the metadata

Copy link
Contributor Author

jamiedemaria commented Feb 27, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @jamiedemaria and the rest of your teammates on Graphite Graphite

def output_metadata(self) -> ArbitraryMetadataMapping:
"""A dict of the metadata that is assigned to the output at execution time."""
if self._warn_on_step_context_use:
warnings.warn(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be an exception?

@jamiedemaria jamiedemaria marked this pull request as ready for review February 27, 2024 17:30
Copy link
Collaborator

@smackesey smackesey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to finally fix this but I'm finding the current naming very confusing. We are on an OutputContext, which has a field metadata, but then also another field output_metadata. This already makes my head spin. It's made even worse (not your fault) by the fact that even the plain old metadata is already sourced from something qualified with "out" (AssetOut or OutputDefinition IIRC).

I think if we are doing two fields (which I agree is the right approach, thx for thorough explainer) we need to deprecate the old .metadata accessor and replace it with definition_metadata, and use something other than output_metadata-- maybe runtime_metadata?

@jamiedemaria
Copy link
Contributor Author

Agree that naming is confusing - thanks for bringing it up. I'll look through our docs and see if we already use any particular language to differentiate between definition and runtime metadata. that could inform naming conventions.

Since it seems unlikely we'd be able to address the InputContext runtime metadata issue any time soon, I think your idea of deprecating metadata for definition_metadata and output/run_time_metadata makes sense. That also opens up the possibility to re-introduce metadata later if we merge the two together, without breaking the previous definition-only metadata if we override duplicate keys

The other option is to like take the hit on the OutputContext/InputContext inconsistency and just do metadata. I think it has the potential to be confusing to users or give them the false sense that the metadata is available everywhere, but i'm open to being convinced that this is an ok risk to accept

@jamiedemaria
Copy link
Contributor Author

terminology from docs:

  • definition metadata
  • materialization metadata - can't use this since we need to be inclusive of ops
  • event metadata - too vague to me
  • output metadata - already discussed that this is confusing

I'm feeling pretty ok about run_time_metadata (runtime_metadata?)

@alangenfeld
Copy link
Member

value_metadata is the only other idea that comes to mind, not sure its better than run time

@smackesey
Copy link
Collaborator

value_metadata is the only other idea that comes to mind, not sure its better than run time

I prefer value_metadata to runtime_metadata

@jamiedemaria
Copy link
Contributor Author

Thinking about how we might describe the difference between definition and value metadata in docs:

Definition metadata is added to the @asset decorator and remains static across multiple materializations. For example, tracking the owner of an asset may be recorded in definition-level metadata. Value metadata refers to metadata that is recorded during materialization of the asset. This metadata is usually dynamically determined during execution, for example, the number of rows added to a table in a database.

@smackesey
Copy link
Collaborator

smackesey commented Feb 27, 2024

Good description. I would be even more explicit with the description of value metadata, e.g. "Value metadata is generated during execution of the asset and is either attached to a return Output or yielded with context.add_output_metadata" (or whatever that method is).

tracking the owner of an asset may be recorded in definition-level metadata

Just flagging that prob not the example we want since we now have dedicated field for asset owners

@jamiedemaria
Copy link
Contributor Author

jamiedemaria commented Feb 27, 2024

oh yeah for sure - this was mostly spitballing docs to see how value felt in prose. idk if i'd make docs changes as part of this pr. but def noted on the author metadata, i didn't know that had become a dedicated thing

@jamiedemaria
Copy link
Contributor Author

Seems like definition_metadata is a clear choice for the definition metadata.

@smackesey can you elaborate more on why you like value_metadata over runtime_metadata? The main risk I see with value_metadata is that it introduces a new term and isn't immediately clear that value means it's attached to a particular output/materialization

@smackesey
Copy link
Collaborator

smackesey commented Feb 28, 2024

can you elaborate more on why you like value_metadata over runtime_metadata?

Primarily for consistency with definition_metadata-- which is qualified with a WHAT rather than WHEN. If we did "runtime" it would seem the proper counterpart would be "static" or something rather than "definition". Also I think WHAT is better because there can actually be multiple return values, "runtime_metadata" is ambiguous wrt the metadata is associated with all or just one returned value.

Yeah it's not the clearest but neither is "runtime" unfortunately.

@jamiedemaria
Copy link
Contributor Author

Also I think WHAT is better because there can actually be multiple return values

this make sense to me. For what it's worth i originally interpreted "value" as "the metadata is a value" not "the metadata is associated with a return value". I think that's where some of my confusion stemmed from.

maybe another avenue to asses naming is if we'd be able to replace context.add_output_metadata with the new name. So context.add_value_metadata. It doesn't help much for de-emphasizing I/O managers since "value" would still be an output value. my thoughts on this aren't fully formed yet though

@smackesey
Copy link
Collaborator

maybe another avenue to asses naming is if we'd be able to replace context.add_output_metadata with the new name. So context.add_value_metadata

This is a good idea, though IIRC we want to deprecate context.add_output_metadata entirely

Copy link
Contributor

@yuhan yuhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my immediate reaction is i'd expect the api name in io manager to be consistent with the name in asset body.

  • in an asset, i log info as:
    @asset
    def metadata_added_to_context(context: AssetExecutionContext):
        context.add_output_metadata(blah)
  • then, in io manager
    class MyRuntimeMetadataIOManager(IOManager):
        def handle_output(self, context: OutputContext, obj):
            assert context.output_metdata == blah

So before reading the comment, output_metadata felt pretty natural to me. However, if output_metadata will not be here forever, we may want to pick something that's more future proof:

  • value_metadata: my main reservation is "value" is another new noun to this world. i agree that if we could rename to context.add_value_metadata, this would make total sense. but introducing "value" right now could cause further confusion as we can't remove output_metadata right away (we could deprecate it but there will be a relatively long period where all these different x_metadata co-exist in the system)
  • runtime_metadata: this seems descriptive and accurate to me. but having context.add_runtime_metadata in asset body sounds a little too implementation details to me?

that being said, whats the long term vision for add_output_metadata? or, what's our thinking on long term pattern to log metadata at runtime?


spitballing. it feels like we are trapped in this .x_metadata. have we explored a getter pattern? something like:

    class MyRuntimeMetadataIOManager(IOManager):
        def handle_output(self, context: OutputContext, obj):
            assert context.get_metadata() == OutputMetadata(definition_metadata=..., runtime_metdata=...) # some object that consolidates both metadata and makes that very explicit

and deprecate .metadata in OutputContext


class MyDefinitionMetadataIOManager(IOManager):
def handle_output(self, context: OutputContext, obj):
assert context.metadata == expected_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this change, i could see this .metadata get super confusing 😢

@jamiedemaria
Copy link
Contributor Author

long term plan of add_output_metadata

I dug back through git history to see what spurred the creation of add_output_metadata and it looks like it was just because it was nice to not need to yield an Output (source). I also searched through slack a bit and these are my takeaways:

  • I don't see much chatter in slack about it's future
  • it's used in a lot of example code sent by our users, which indicates it's widely used

Based on that it seems unlikely we'd deprecate without having a strong reason (there totally could be one that i've missed or forgotten).

However, I definitely would like to deprecate it on the AssetExecutionContext and replace with something that doesn't use output terminology (or not replace at all, but i'm unsure about that option)

having a metadata object

I like the idea of the getter that returns the object. It feels like a nice balance between keeping the different types of metadata split out while having just a single method on the context.

@jamiedemaria
Copy link
Contributor Author

discussed in standup. Naming decisions:

  • deprecate metadata in favor of definition_metadata
  • keep the output metadata named output_metadata

@jamiedemaria jamiedemaria force-pushed the jamie/2-metadata-io-mgr branch 2 times, most recently from 89fd03f to fee2e81 Compare March 4, 2024 20:03
@erinkcochran87 erinkcochran87 removed their request for review March 6, 2024 19:09
@@ -766,7 +766,7 @@ def _store_output(
) -> Iterator[DagsterEvent]:
output_def = step_context.op_def.output_def_named(step_output_handle.output_name)
output_manager = step_context.get_io_manager(step_output_handle)
output_context = step_context.get_output_context(step_output_handle)
output_context = step_context.get_output_context(step_output_handle, output.metadata)
Copy link
Contributor Author

@jamiedemaria jamiedemaria Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

early in the execution path, we combine the metadata added via context.add_output_metadata with the metadata on the Output (link)

Another implementation option for this PR would be to update the StepExecutionContext with the final dictionary of metadata and then we wouldn't need to thread output metadata through here. This would also resolve a similar issue in a community pr that is trying to access output metadata in the hook context (link).

Are there any concerns with modifying the step context with the finalized dictionary of metadata? we'd either do like a full overwrite

step_context._metadata = output.metadata 

and ignore the linting error, or add a method to overwrite the metadata

step_context.finalize_metadata(output.metadata) # with a better method name 

My worry with that one is that users could start overwriting the metadata whenever

or maybe

output = Output(
...
metadata=step_context.normalize_metadata_with_output(output.metadata)
)

where normalize_metadata_with_output does updates the step context metadata with the output metadata and then return the dict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i kinda am leaning towards normalize_metadata_with_output to make things very very explicit. but i dont have strong opinions. worth checking with other foundations folks maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah @alangenfeld and @smackesey interested in your thoughts on this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any concerns with modifying the step context with the finalized dictionary of metadata?

This feels wrong to me. Having context be this stateful object that holds metadata is a bit of a mistake to begin with IMO, and this would be made worse by mutating it after execution is complete with the contents of a return Output.

But I agree it's a little weird to patch just output.metadata into get_output_context. However it feels very natural to pass the output object itself in (you can just pull off the metadata inside get_output_context), so that's what I favor.

a community pr that is trying to access output metadata in the hook context

Not very familiar with how hook context is constructed but I'd also recommended a similar approach to ^^ rather than mutating StepExecutionContext with content of returned Output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having context be this stateful object that holds metadata is a bit of a mistake to begin with IMO

heard

a little context on the hook thing: the issue for hook context is that it doesn't have access to the output when it's created. maybe that's the real thing to fix. the community PR is trying to mimic what we do for passing output values to hooks where we store the value on the step context when we know there is a hook that could be invoked (

if step_context.step_output_capture is not None:
). this is the exact same mistake that you point out. maybe it's a bit better than always updating the metadata since it would only happen in the case when there is a hook

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smackesey do you have any other thoughts on this? I think i'm inclined to keep this PR as is and eat the less-than-ideal pattern for the hook context. My justification is that we're already using the step context to store outputs when we have to make a hook context, so also storing metadata isn't introducing a brand new pattern, and hooks aren't a widely used feature so this code path won't be used in most execution

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response. Can we pass just the Output (as output) instead of output_metadata into get_output_context? It makes sense that the full Output should be available when constructing the output context. Otherwise this PR looks good to me.

I am going to continue the hook discussion in that PR.

@alangenfeld alangenfeld removed their request for review March 6, 2024 22:52
@jamiedemaria jamiedemaria requested a review from yuhan March 8, 2024 17:59
Copy link
Collaborator

@smackesey smackesey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have consensus and can proceed?

@@ -766,7 +766,7 @@ def _store_output(
) -> Iterator[DagsterEvent]:
output_def = step_context.op_def.output_def_named(step_output_handle.output_name)
output_manager = step_context.get_io_manager(step_output_handle)
output_context = step_context.get_output_context(step_output_handle)
output_context = step_context.get_output_context(step_output_handle, output.metadata)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response. Can we pass just the Output (as output) instead of output_metadata into get_output_context? It makes sense that the full Output should be available when constructing the output context. Otherwise this PR looks good to me.

I am going to continue the hook discussion in that PR.

@jamiedemaria jamiedemaria force-pushed the jamie/2-metadata-io-mgr branch 2 times, most recently from 42ddea4 to e2b5ea8 Compare March 20, 2024 15:52
Base automatically changed from jamie/2-deprecate-metadata to master March 20, 2024 19:30
@jamiedemaria jamiedemaria merged commit dfb8027 into master Mar 25, 2024
1 check passed
@jamiedemaria jamiedemaria deleted the jamie/2-metadata-io-mgr branch March 25, 2024 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants