Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composition Function Events and Status Conditions #5450

Merged
merged 1 commit into from
Jun 28, 2024

Conversation

dalton-hill-0
Copy link
Contributor

@dalton-hill-0 dalton-hill-0 commented Mar 4, 2024

Description of your changes

Requires runtime PR Composite Receiver Functions for Conditions.

Fixes #5402
One-pager PR to document this effort: #5426
Should be merged with function-sdk-go 129

I have:

  • Read and followed Crossplane's [contribution process].
  • Run make reviewable to ensure this PR is ready for review.
  • Added or updated unit tests.
  • Added or updated e2e tests.
  • Linked a PR or a [docs tracking issue] to [document this change].
  • Added backport release-x.y labels to auto-backport this PR.

@dalton-hill-0 dalton-hill-0 requested review from negz and a team as code owners March 4, 2024 17:33
@negz negz marked this pull request as draft March 27, 2024 02:15
@negz negz self-assigned this Mar 27, 2024
Copy link
Member

@negz negz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully reviewed everything, but here's what I have so far.

Let me have a think about some of the bigger changes (e.g. the new condition package).

apis/apiextensions/fn/proto/v1beta1/run_function.proto Outdated Show resolved Hide resolved
apis/apiextensions/fn/proto/v1beta1/run_function.proto Outdated Show resolved Hide resolved
apis/apiextensions/v1/xrd_types.go Outdated Show resolved Hide resolved
internal/controller/apiextensions/claim/reconciler.go Outdated Show resolved Hide resolved
internal/controller/apiextensions/claim/reconciler.go Outdated Show resolved Hide resolved
internal/controller/apiextensions/composite/reconciler.go Outdated Show resolved Hide resolved
internal/controller/apiextensions/composite/reconciler.go Outdated Show resolved Hide resolved
internal/controller/apiextensions/conditions/conditions.go Outdated Show resolved Hide resolved
internal/controller/apiextensions/conditions/conditions.go Outdated Show resolved Hide resolved
@negz
Copy link
Member

negz commented Mar 28, 2024

@dalton-hill-0 I spent a little time on this today and put together a sketch at #5523. It doesn't actually work end-to-end, but hopefully it's enough to communicate what I'm thinking direction wise.

What do you think?

@dalton-hill-0
Copy link
Contributor Author

@dalton-hill-0 I spent a little time on this today and put together a sketch at #5523. It doesn't actually work end-to-end, but hopefully it's enough to communicate what I'm thinking direction wise.

What do you think?

@negz

I'm encountering one issue with this approach; labeling events.
It looks like we cannot add labels to events, only annotations. There was an issue raised and a subsequent PR, but it was ultimately closed. The maintainers had concerns about performance.

From what I can tell on the docs, labels should be used instead of annotations for our approach as labels are intended for purposes of identification (and thus are likely more performant). However, this path is not supported.

@dalton-hill-0
Copy link
Contributor Author

dalton-hill-0 commented Apr 2, 2024

@negz - Got a design question for you.

Now that users will need to check for the existence of a condition before setting it,
they will likely want to have instances of xpv1.Condition. This will allow them to
use the xpv1.Condition{}.Equal() function to check equality.

Given this, I think we need to revisit the sdk API as the existing implementation is
now unwieldy.

I'm including this here as our decision will also affect this PR.

Here are the current options I'm thinking about:

// context: user defines the condition they wish to set
dbWaiting := xpv1.Condition{
	Type:               "DatabaseReady",
	Status:             corev1.ConditionFalse,
	LastTransitionTime: metav1.Now(),
	Reason:             "Waiting",
	Message:            "Database is being created.",
}

// option A (existing approach)
if !observedXR.Resource.GetCondition(dbWaiting.Type).Equal(dbWaiting) {
	response.Normal(rsp, dbWaiting.Message).
		ConditionFalse(string(dbWaiting.Type), string(dbWaiting.Reason)).
		TargetCompositeAndClaim()
}

// option B (user directly updates XR)
// this would need development from crossplane side to still copy XR.statusConditions
// in the event of a fatal result (fatal result returns before copying XR data)
if !observedXR.Resource.GetCondition(dbWaiting.Type).Equal(dbWaiting) {
	response.Normal(rsp, dbWaiting.Message).TargetCompositeAndClaim()
	desiredXR.Resource.SetConditions(dbWaiting)
	desiredXR.Resource.SetClaimConditions(dbWaiting.Type)
}

// option C (split event and condition)
if !observedXR.Resource.GetCondition(dbWaiting.Type).Equal(dbWaiting) {
	e := event.Normal(event.Reason(dbWaiting.Reason), dbWaiting.Message)
	response.Event(rsp, e).TargetCompositeAndClaim()
	response.Condition(rsp, dbWaiting).TargetCompositeAndClaim()
}

// option C continued..
// response.Event() would do the same as response.(Normal|Warning)
// this is for sake of consistency with response.Condition()
// we would also continue to expose response.(Normal|Warning|Fatal) and allow user to set Reason:
response.Normal(rsp, "message").Reason("reason").TargetCompositeAndClaim()
// this is equivalent to:
e := event.Normal(someReason, "message")
response.Event(rsp, e).TargetCompositeAndClaim()

I'm leaning toward option C because now that the user is expected to check existence of the Condition (and consequently manage the full object), it seems backwards to then make them call an API that re-builds that Condition for them.

I do not like option B due to needing to copy the XR object (at least the status conditions) even in the event of a failed case. Would like to know your thoughts though, maybe this is not as big of a deal as I'm thinking.

@negz
Copy link
Member

negz commented Apr 5, 2024

Got a design question for you.

@dalton-hill-0 I'm struggling with this one a bit. Here's what I'm considering.

We don't want to emit a Normal event every time a function is called. It's okay to emit a Warning event every time if the warning is still valid. In the Normal happy path emitting an event every time would be noisy, and create unnecessary load on the API server (each event is a Kubernetes object after all).

Probably most functions will never actually need to emit a Normal event. I can't think of many reasons, and I don't think I've emitted one from any function I've written yet.

Normal events are usually used to inform you that a controller did something interesting that it does only some of the time. For example a Crossplane MR controller emits a Normal event whenever it creates, updates, or deletes an external resource. Functions on the other hand mostly do the same thing (produce a set of desired composed resources) every time they're called.

The only reasons I can think of for a function to emit a Normal event are:

  • Some interesting subset of the composed resources became ready for the first time (e.g. DatabaseReady)
  • The function had an interesting interaction with an external system. Perhaps a hypothetical function that imports cloud resources like Support for Querying/Filtering for Import and Observe #4141 (comment) imports a new cloud resource for the first time.

Emitting events only sometimes means conditionally producing results. For example you could imagine the following pseudocode:

current = req.observed.composite.resource["status"]["readyDbResources"]
new = 0

for r in req.observed.resources:
  if is_db(r) && is_ready(r):
    new += 1

if new >= current:
  response.normal(rsp, "All database resources are now ready")

req.desired.composite.resource.["status"]["readyDbResources"] = new
dbs = list_aws_rds_instances()
for db in dbs:
  add_db(req.desired.resources, db)
  # We already imported this one
  if db.name in req.observed.resources:
    continue
  # We found a new resource
  response.normal(rsp, "Imported new RDS instance from AWS")

I think we do want to return status conditions every time. I'm not sure what the best way to do that is though.

Functions are supposed to return server-side-apply fully specified intent. Per this link:

A fully specified intent is a partial object that only includes the fields and values for which the user has an opinion.

So every time the function pipeline is run it's sent an empty desired XR, and is supposed to populate it with its fully specified intent. If the function returns some fully specified intent but doesn't return it on a subsequent invocation, that tells server-side apply to delete the intent.

More concretely, if you called dxr.Resource.SetCondition() to set a desired status condition, you'd need to make sure to do that every time or the condition would be deleted.

What complicates that is that calling dxr.Resource.SetCondition() on the empty desired XR Crossplane sends to the function is going to write a new lastTransitionTime every time the function is called. That in turn will trigger a reconcile of the XR. The XR will get stuck in an endless tight reconcile loop, constantly updating the condition's lastTransitionTime.

@negz
Copy link
Member

negz commented Apr 5, 2024

@dalton-hill-0 Here's an idea that came to mind while writing the novel above.

func (f *Function) RunFunction(ctx context.Context, req *fnv1beta1.RunFunctionRequest) (*fnv1beta1.RunFunctionResponse, error) {
	rsp := response.To(req, response.DefaultTTL)

	oxr, _ := request.GetObservedCompositeResource(req)
	dxr, _ := request.GetDesiredCompositeResource(req)

	// Copy the conditions we're interested in from the observed XR.
	// This way when we call SetCondition below the last transition
	// time will only change if the condition actually changes.
	// This would copy both status.conditions and status.claimConditions.
	composed.CopyConditions(dxr, oxr, "DatabaseReady", "CacheReady")

	// Do some stuff

	// It's now safe to just call this every time. SetConditions will
	// make sure this is a no-op if the conditions were already true.
	dxr.SetConditions(
		resource.NewCondition("DatabaseReady", "True", "Available", TargetCompositionAndClaim()),
		resource.NewCondition("CacheReady", "True", "Available", TargetCompositionAndClaim()),
	)

	response.SetDesiredCompositeResource(rsp, dxr)
	return rsp, nil
}

The thinking is:

  • A Result doesn't have anything to do with status conditions 1. It continues to be equivalent to an event.
  • To update status conditions, you mutate the desired XR status directly.
  • We add some helpers to the SDKs to make it easier to update desired XR status conditions.

I vaguely recall this might be coming full circle, closer to your original proposal. 🙃

I like that it feels like a pretty small set of changes overall though.

To handle status conditions:

  • Add status.claimConditions to XRs
  • Teach claim controller to propagate anything in status.claimConditions from XR -> claim
  • Add SDK helpers to make writing to desired XR status.conditions and status.claimConditions easier

To handle results/events:

  • Add optional target to Result message
  • Add optional reason to Result message
  • Label events that target a claim (and XR)
  • Add controller to propagate labelled events to the claim

Footnotes

  1. Except fatal results, no change there.

@dalton-hill-0
Copy link
Contributor Author

@negz
I like the approach you outlined above, but we may need to make a couple of modifications.

1: Copying Events to Claim

  • Label events that target a claim (and XR)
  • Add controller to propagate labelled events to the claim

See this comment.
Summary: Standard API does not expose a way to set labels on events, it is not a recommended approach.

2: Setting Status Conditions on Desired XR

In the event of a fatal result the conditions will not be set. FunctionComposer.Compose() returns before copying the desired XR status into the XR itself.
This can be changed, but seems dangerous.

@negz
Copy link
Member

negz commented Apr 5, 2024

I like the approach you outlined above, but we may need to make a couple of modifications.

@dalton-hill-0 Ack.

Summary: Standard API does not expose a way to set labels on events, it is not a recommended approach.

Got it. It still feels like a violation of separation of concerns for the XR controller to record claim-aligned events, but I can't think of any good options.

Some things that come to mind:

  • Should we just punt on events completely, rely on status conditions to communicate things to claim users?
  • Would it be acceptable to list and filter/process events client-side (by annotation 😬)? The issue seems concerned with the performance implications of asking the API server to do filtering on events.

In the event of a fatal result the conditions will not be set.

I think this is okay.

XR status is updated in two places:

  1. The FunctionComposer uses SSA on the XR status to apply the function pipeline's fully-specified intent
  2. The Reconciler uses a traditional get-mutate-update on the XR status to set the Ready and Synced conditions

If a function returns a fatal result, I believe:

  1. We'll return early from the FunctionComposer, so it won't SSA anything
  2. The Reconciler will update the XR status to set the Synced condition to false, communicating the fatal result

There's no SSA, so no status conditions set on previous reconciles will be lost. Any status conditions that would have been updated before the fatal result was hit would be lost. I think that's fine, though. Fatal results stop the world. For example today we also discard any pending desired composed resource state changes without SSA-ing them if a function returns a fatal result.

@dalton-hill-0
Copy link
Contributor Author

@negz - continuing discussion

  • Should we just punt on events completely, rely on status conditions to communicate things to claim users?

IMO this would be fine for the initial implementation (assuming we can set status conditions in the fatal path).
If desired in the future, we can add logic that says "emit event on status condition change".

In the event of a fatal result the conditions will not be set.

I think we need a way for the function author to set status conditions during a fatal case. This way the author can clearly communicate to the user what went wrong and any actions that need to be taken to resolve the issue.
This could be accomplished with minimal changes if conditions can be returned as function results. The function author would return a fatal result (with condition fields), and that condition could be set on both the composite and claim.

As a function author, I can already update the Claim's status for non-fatal results (via custom status fields e.g., status.customConditions).

@negz
Copy link
Member

negz commented May 1, 2024

IMO this would be fine for the initial implementation (assuming we can set status conditions in the fatal path).
If desired in the future, we can add logic that says "emit event on status condition change".

I've come around to what you first had - having the XR controller emit events that target the claim. 😄 I'd prefer a better separation of concerns, but everything else just feels like I'm overthinking it.

I think we need a way for the function author to set status conditions during a fatal case.

Hrm, yeah I think I agree.

So far I've thought of fatal results as internal server errors. Things that aren't actionable to claim authors. For example I return a fatal result if something has gone horribly wrong, like my function can't parse its RunFunctionRequest. I'll also return a fatal result if the function's input (from the Composition) isn't valid.

I could imagine a scenario where a function pipeline returned a fatal result that was relevant to a claim author though. Maybe the function loads data from some external system based on input from the claim. If the claim author told it to load data from the wrong place, the function would want to return a fatal result to stop the reconcile process completely.

Even if a fatal result isn't actionable by the claim author it would still be helpful to let them know it's happening. That way the claim author could distinguish between "my XR isn't ready yet, but is healthy and progressing toward readiness" from "my XR isn't ready because it's hitting an error".

What do you think about adding a new CompositeSynced status conditions to all claims? It would mirror the XR's Synced status condition, but with a generic message. So instead of copying the specific error message from the XR to the claim, it just has a message like "The composite resource couldn't sync its state due to an error".

I'm not really sure how to handle fatal errors that are actionable by claim authors, though. Maybe some way to bubble up from function -> XR status -> claim status that the current XR Synced status condition was actually relevant to the claim?

(Apologies for the super delayed response - life got in the way for a few weeks.)

@dalton-hill-0
Copy link
Contributor Author

What do you think about adding a new CompositeSynced status conditions to all claims?

I agree this would be useful to allow claim author's to distinguish between not_ready && progressing vs not_ready && not_progressing. Though function authors should also have the ability to add their own custom status conditions IMO.

I'm not really sure how to handle fatal errors that are actionable by claim authors, though. Maybe some way to bubble up from function -> XR status -> claim status that the current XR Synced status condition was actually relevant to the claim?

I think what we had discussed previously will solve this:
Each function result may contain a condition. If the result contains a condition and targets both the composite and claim, then the condition will be added to both. This will allow function authors to provide visibility to claim authors by using status conditions that are relevant to each implementation, such as DatabaseReady, ApplicationReady, etc.

@negz
Copy link
Member

negz commented May 1, 2024

I think what we had discussed previously will solve this

@dalton-hill-0 I agree, but that approach would mean we'd have to address what I described in #5450 (comment), right?

If a fatal result targets the claim and composite, we could use a special entry in the XR's claimConditions array to tell the claim controller that it should set the claim's CompositeSynced condition message to the XR's Synced condition message. So instead of the claim getting a generic message like "The Composite resource couldn't sync due to an error" the CompositeSynced condition message would contain the same information as the XR's Synced condition message.

The downsides of this approach are:

  • You have to use the same error message on the XR and the claim.
  • You don't get any control over the claim condition type. It's always CompositeSynced.

Maybe that's not so bad though?

Perhaps we can come up with a better name than CompositeSynced. I think we want something that:

  • Is meaningful/intuitive to claim authors
  • Is fairly generic - when True means roughly "this claim is working without error"

Maybe just Healthy? There's precedent for that condition in our package controllers.

So in this world:

  • You can add specific claim conditions like DatabaseReady if your function logic is working.
  • If your function logic hits an error, that always shows up in the claim's Healthy status condition.
  • By default the error is "internal" to the XR so we add a generic error message to the claim's Healthy condition
  • If the fatal result targets the claim and the composition we propagate the detailed error to the Healthy condition

I think this could work if we believe that fatal composition function errors can be cleanly divided into two buckets:

  • Errors that are relevant and actionable to the claim author (specific Healthy condition message)
  • Errors that aren't relevant and actionable to the claim author (generic/obfuscated Healthy condition message)

Where it wouldn't work is if we think there will be cases where we want both. Where we want a detailed, specific error message on the XR and a different, but also custom error message on the claim.

@dalton-hill-0
Copy link
Contributor Author

@negz

I think not allowing the function author to set custom conditions in a fatal case could lead to a confusing claim state for the user.

For example, let's say we have an observe only DB as part of a claim, which is represented by the DatabaseAvailable type.

  1. User correctly creates the claim with a valid DB reference.
- type: Healthy
  status: True
- type: DatabaseAvailable
  status: True
  1. User updates the DB reference, the reference is now incorrect. The function returns a fatal result, but DatabaseAvailable still appears true, as it cannot be updated in the fatal case.
- type: Healthy
  status: False
  message: "The database provided does not exist or you do not have access."
- type: DatabaseAvailable
  status: True

Possible solution:

You had mentioned before we may want to try splitting Result into Event and Condition. I think this would solve the case above, while also providing users the ability to not emit events on every reconciliation.

Event would control:

  • emitting events to the composite and claim (depending on target)
  • marking a result as fatal to stop execution
  • the Healthy condition

Condition would control:

  • custom conditions on the composite and claim (depending on target)

I think this gives us everything we need:

  • function authors can send events to the claim
  • function authors can set custom status conditions on the claim (even in fatal)
  • function authors get the benefit of the new Healthy condition by using existing APIs
  • since events and conditions would be separate, authors can always set the conditions, without worrying about event spam.

I could see an argument that it feels more intuitive to directly set conditions on the desired XR. But that approach would prevent us from setting custom conditions in the fatal case.

@negz
Copy link
Member

negz commented May 1, 2024

I think not allowing the function author to set custom conditions in a fatal case could lead to a confusing claim state for the user.

@dalton-hill-0 we already have a problem very similar to the one you describe today. Whenever a controller hits a reconcile error it will update the Synced status condition but not the Ready status condition. So it's possible that:

  • The resource became Ready: True
  • Later, the resource started hitting an error causing it to be stuck in Synced: False
  • In the meantime, the resource isn't actually ready anymore

Our controllers never get to the code that assesses whether the resource is Ready. They hit an error, update the Synced condition and return. So when I see a resource that is Ready: True, Synced: False I read it as "this resource was ready the last time it synced successfully", which doesn't necessarily mean "this resource is ready now".

This case seems similar, except a claim having Healthy: False might indicate other conditions (like DatabaseReady) were stale. This does seem less intuitive to me, given that to me "not synced" intuitively means "could be stale" while "not healthy" doesn't.

One option would be to automatically set all "custom" conditions to status: Unknown whenever a claim is unhealthy, to make it explicit that the controller doesn't currently know the state of those conditions due to a fatal error. I suspect in a lot of cases Unknown will be the most appropriate status if we hit a fatal error, since the fatal error is likely to block the function from determining whether the condition is true or false.

Possible solution

Splitting results into events and conditions is pretty compelling. The only parts I don't love are:

  • I assume we'd rename Result to Event to avoid a breaking change. Fatal "events" are then potentially surprising, because they're more than an event. They produce an event, update the Synced status condition, and stop execution.
  • You'd have two different ways to set conditions. You could set conditions in the RunFunctionResponse or actually mutate the XR's conditions.

If we felt okay about making a backward incompatible change, I would:

  • Remove results.
  • Add events, and have them only produce an event.
  • Add conditions, and have them only set a condition.
  • Communicate fatal errors by just returning an error from the RunFunction method.

We try not to make breaking changes to beta features, but our contract does allow us to do it.

@negz
Copy link
Member

negz commented May 1, 2024

Communicate fatal errors by just returning an error from the RunFunction method.

Actually, maybe this doesn't make sense. The function would need to return a valid RunFunctionResponse in order to produce whatever conditions and events you want, and also a non-nil error to indicate that there was a fatal error.

If the function returns an error, can Crossplane trust that the RunFunctionResponse is valid? That it's not corrupt or half-formed? I think it probably can't.

@dalton-hill-0
Copy link
Contributor Author

dalton-hill-0 commented May 2, 2024

@negz

This will be a long one.. I split it into two separate threads. I think this may work nicely with your proposed Healthy condition.

Splitting Result

Could we avoid a breaking change by making the following changes?

message RunFunctionResponse {
  ...

  // We continue to accept Results as they were previously defined (event + fatal) for backwards compatibility
  // The SDK will be updated to no longer write to this field
  repeated Result results = 3;

  // Add HealthStatus that corresponds with the new Healthy condition and acts as the "fatal" trigger
  HealthStatus health_status
  // Add Events for just handling events
  repeated Event events
  // Add Conditions for just handling conditions
  repeated Condition conditions
  ...
}

// will (partially) determine the value of the new Healthy status condition on both the composite and claim
// if false, will trigger a warning event on both the composite and claim
// if message set, the resulting condition and event on the composite will always contain the message
// if targets xr & claim, the resulting condition and event on the claim will contain the message
message HealthStatus {
	// Did the execution of the function result in a healthy or unhealthy state
	bool Healthy
	// if set, message to appear in events and conditions
	string message
	// target xr or xr & claim
	Target target 
}

Handling Conditions

From what I understand there are two main approaches to conditions and we must choose between a better UX for claim authors or a better UX for function authors.

Approach-A: Set them on the desired XR. (better UX for function authors)
Approach-B: Return them as a new field to RunFunctionResponse (better UX for claim authors)

Personally, I think we should choose the UX of the claim author. In addition to being more intuitive for the average claim author, it would also improve the ability to integrate claims into external systems, such as UIs.

For example, let us say we offer a claim that involves the following steps:

  1. Provision cloud networking
  2. Provision a cluster using the above networking
  3. Install Istio on the cluster

As some of these steps can take time, it would be nice to display the status of each to the user.

- type: Healthy
  status: True
- type: NetworkingReady
  status: True
- type: ClusterReady
  status: True
- type: IstioReady
  status: False
  message: "Installing Istio"

In the event that one of these steps encounters an error (e.g., IstioReady), it would be nice if the error can be assigned to the correct item.
Not only would this be more intuitive to a regular claim author, but if the claim is surfaced by a UI, this would be much easier to display.

Example of Approach-A.

- type: Healthy
  status: False
  message: "Failed to install Istio: reason xyz"
- type: NetworkingReady
  status: Unknown
- type: ClusterReady
  status: Unknown
- type: IstioReady
  status: Unknown

Example of Approach-B.

- type: Healthy
  status: False
  message: "Failed to install Istio: reason xyz"
- type: NetworkingReady
  status: True
- type: ClusterReady
  status: True
- type: IstioReady
  status: False
  message: "Failed to install Istio: reason xyz"

@negz
Copy link
Member

negz commented Jun 1, 2024

@dalton-hill-0 Okay, I spent today thinking about this design. Here's where I'm at.

I agree we should add conditions, distinct from results, to RunFunctionResponse.

I think we should keep results rather than renaming it to events, or supporting both events and (deprecated) results. I don't think the churn is worth the benefit. I can live with a result being mostly an event, with fatal results being special.

I suggest we make fatal results special, in that they can't target a claim - only the composite. They're already special in that they set the synced status condition. The expectation for functions would be that if something fatal happens and you want to surface it on the claim, use separate results and/or conditions. For example:

if err != nil {
    // These actually target the claim and XR. Just playing around with a possible SDK API.
    // You don't need to emit a warning result and a condition, you could pick one or the other.
    response.ClaimWarning(rsp, errors.New("a user-friendly fatal thing happened"))
    response.ClaimCondition(rsp, "DatabaseReady", condition.False, "FatalError", "a user-friendly fatal thing happened"))

    // This only targets (and only can target) the XR. It's what actually stops the reconcile.
    response.Fatal(rsp, errors.Wrap(err, "a scary detailed fatal thing happened"))
    return rsp, nil
}

When the pipeline returns a fatal result, we'd update any "user-defined" conditions (e.g. DatabaseReady) that were included in the result. Any user-defined conditions that weren't included in the result would be set to status: Unknown. This covers the case where function 1 returns a fatal result, preventing function 2 from ever running. Any status conditions owned by function 2 would be unknown.

Finally, I think we should implement Healthy per #5643. I don't think functions should be able to directly influence this condition - e.g. set its message, etc. They'd be able to indirectly influence it by returning a fatal result, which would set Synced to false and thus Healthy to false.

This does mean there's an edge case where:

  • The claim and XR are synced (i.e. there was no fatal result)
  • All composed resources are synced and healthy
  • A custom status condition set by a function is in an error state

Healthy would be True, contradicting the custom status condition. We could address this by adding a field like bool affects_health = 6; to the Condition message to force the Healthy condition to become False. I'd punt on doing that until we're sure the edge case really happens in the wild though.

The example Go code above would result in claim conditions like this:

status:
  conditions:
  - type: "Synced"
    status: "True"
    reason: "ReconcileSuccess"
  - type: "Healthy"
    status: "False"
    reason: "UnhealthyCompositeResource"
    message: "The composite resource is not healthy"
  - type: "Ready"
    status: "False"
    reason: "BindCompositeResource"
    message: "The composite resource is not yet ready"
  - type: "DatabaseReady"
    status: "False"
    reason: "FatalError"
    message: "a user-friendly fatal thing happened"

Mechanics wise, I think we stick with the claimConditions array on the XR to let the claim controller propagate conditions from XR -> claim, and just have the XR controller write events to the claim. The protobufs would look something like:

message RunFunctionResponse {
  repeated Result results = 3;
  repeated Condition conditions = 6;
}

message Result {
  Severity severity = 1;
  string message = 2;
  optional string reason = 3;
  optional Target target = 4;
}

message Condition {
  string type = 1;
  Status status = 2;
  string reason = 3;
  optional string message = 4;
  optional Target target = 5;
}

With this all in place, I believe:

  • Crossplane always lets claim consumers know when a claim is unhealthy, .
  • Existing functions keep working as they do today. No need to update anything if they don't want to.
  • Function authors can communicate status to claim consumers via events and custom status conditions.
  • Function authors can communicate status to claim consumers even if a fatal result occurs.
  • Composition authors can communicate status to claim consumers using a generic function, like this example.

@dalton-hill-0
Copy link
Contributor Author

@negz - I like what you have proposed. I'll begin implementing it as I get time.

@negz
Copy link
Member

negz commented Jun 3, 2024

@dalton-hill-0 Thanks! I can update the design doc to match our latest thinking. Maybe I'll wait until you've taken another pass on implementation though, just to make sure we're confident this will work out in practice.

@negz
Copy link
Member

negz commented Jun 3, 2024

@crossplane/crossplane-maintainers @vfarcic - FYI, see #5450 (comment) for some planned changes to https://github.com/crossplane/crossplane/blob/v1.16.0/design/one-pager-fn-claim-conditions.md based on what we've found at implementation time.

Let us know ASAP if you have any questions or concerns.

@dalton-hill-0 dalton-hill-0 changed the title WIP - Composition Function Events and Status Conditions Composition Function Events and Status Conditions Jun 17, 2024
@dalton-hill-0 dalton-hill-0 marked this pull request as ready for review June 18, 2024 16:01
@dalton-hill-0
Copy link
Contributor Author

@negz
The implementation is ready to review.
Related PRs:

Copy link
Member

@negz negz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dalton-hill-0! I like this direction.

apis/apiextensions/fn/proto/v1beta1/run_function.proto Outdated Show resolved Hide resolved
apis/apiextensions/fn/proto/v1beta1/run_function.proto Outdated Show resolved Hide resolved
apis/apiextensions/fn/proto/v1beta1/run_function.proto Outdated Show resolved Hide resolved
apis/apiextensions/fn/proto/v1beta1/run_function.proto Outdated Show resolved Hide resolved
apis/apiextensions/fn/proto/v1beta1/run_function.proto Outdated Show resolved Hide resolved
Comment on lines 446 to 437
// TODO(dalton, reviewers): Safe to add events here? This is a behavior
// change, but allows function authors to emit events prior to failure.
return CompositionResult{Events: events, Conditions: conditions}, errors.Errorf(errFmtFatalResult, fn.Step, rs.GetMessage())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, I don't think it's unsafe but it does seem a little unintuitive.

In my experience it's pretty rare to return a partial result and an error in Go. Would the caller need to parse the error to know whether it could trust the CompositionResult? 🤔

I'd prefer to avoid this, but what's the alternative, given we want to write conditions even in the face of a fatal result? I'd guess the alternatives would be either:

  • We update conditions here in the FunctionComposer.
  • We find some way other than returning an error (e.g. a new field in CompositionResult{}to represent a fatal result.

I think I like what you have the most of the three options, but I don't love any of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points, I agree that returning a partial result feels wrong.

We find some way other than returning an error (e.g. a new field in CompositionResult{} to represent a fatal result.

This seems like a fair change IMO. For example the composite reconciler could check for err != nil || result.FatalResult != nil. It would then have logic to only handle the results and conditions in the case of err ==nil && FatalResult != nil. This is effectively what is happening currently because all other errors do not return events or conditions.

I don't feel strongly either way, LMK if you end up wanting to change this.

internal/controller/apiextensions/composite/reconciler.go Outdated Show resolved Hide resolved
conditionTypesSeen map[xpv1.ConditionType]struct{}
}

func handleCommonCompositionResult(ctx context.Context, r *Reconciler, res CompositionResult, xr *composite.Unstructured, log logging.Logger) compositionResultMeta {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps make this a method on Reconciler? It looks like it uses the reconciler's client, event handler, and logger.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering if it might be better to just "inline" this into the two places it's called. A little repetitive, but maybe easier to follow. I often prefer a little repetition over a little abstraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I turned this into a method as suggested by the first comment. It feels pretty long to inline IMO, and abstracting it prevents a good amount of nesting.

If you feel strongly about inlining it, LMK and I'll change it.

Comment on lines 648 to 658
for _, c := range xr.GetConditions() {
if xpv1.IsSystemConditionType(c.Type) {
continue
}
if _, ok := meta.conditionTypesSeen[c.Type]; !ok {
c.Status = corev1.ConditionUnknown
c.Reason = reasonPriorFailure
c.Message = ""
xr.SetConditions(c)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a comment explaining what's happening here. It took me a moment to connect the dots.

Also, I found PriorFailure as a reason a little confusing. I think you mean "a failure happened prior to determining this condition's status", but it read to me at first as "this condition had a prior failure".

That said, I'm struggling to come up with a better name. 🤔 Something that says "Crossplane couldn't reach the code that updates this condition because it hit a fatal error." Maybe just FatalError? I guess we could include what I just wrote in the message to make it extra clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking of commenting this bit, I should have done it 😄.
I changed this to FatalError and added the message as well. LMK what you think.

conditionTypesSeen map[xpv1.ConditionType]struct{}
}

func (r *Reconciler) handleCommonCompositionResult(ctx context.Context, res CompositionResult, xr *composite.Unstructured, log logging.Logger) compositionResultMeta {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need log as an argument here? I think there's an r.log field of the Reconciler struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included it because of the WithValues setup at the top of Reconcile.

	log := r.log.WithValues("request", req)
	log.Debug("Reconciling")
...
	log = log.WithValues(
		"uid", xr.GetUID(),
		"version", xr.GetResourceVersion(),
		"name", xr.GetName(),
	)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right.

Given we pass xr to this method my preference would be to use r.log and repeat this WithValues at the top of the method, rather than passing the logger as an argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would lose out on the request part of the logger, unless if we pass req to the function. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love it but I'd probably err toward losing that rather than passing the logger as an argument.

@dalton-hill-0
Copy link
Contributor Author

dalton-hill-0 commented Jun 21, 2024

@negz

During testing, I noticed that events quickly get rate limited.

It looks like by default each object gets an allowance of 25 events, which has a refill rate of 1 event every 5 minutes.

This could be problematic as we usually reconcile once a minute. So if both the system and the user emit one event on the XR per reconcile, and once the event limit is reached, only the first emitter be able to create an event, and only once every few reconciles. This means that either the user or the system will be prevented from emitting any more events on the XR. This only gets worse when you consider chaining multiple functions together.

I don't think overriding the default settings is a good idea as it appears the control for this exists in a single place and would affect all of crossplane.

# cmd/crossplane/core/core.go
eb := record.NewBroadcaster(record.WithCorrelatorOptions(record.CorrelatorOptions{
	BurstSize: 25,
	# This is very high, I just used this value for testing the root cause.
	QPS:       10,
}))

This likely affects the current code in the master branch with regards to XRs.

Additionally, if we add the ability for the XR controller to emit events for the Claim, we would create the possibility that user events to the claim will prevent us from seeing system events from the claim controller.
I tested this and it appears the claim and XR controllers share different event allowances, so emitting claim events from the XR did not affect the claim controller's ability to emit its own events.

@negz
Copy link
Member

negz commented Jun 21, 2024

During testing, I noticed that events quickly get rate limited.

@dalton-hill-0 Interesting. We definitely don't want to increase the event rate limit. This would only be an issue in error paths right, since we said (as a best practice) functions shouldn't be emitting results/events when everything is normal.

What events were you emitting in your tests? I'm trying to get a sense of how much the event frequency increases relative to master right now.

@dalton-hill-0
Copy link
Contributor Author

What events were you emitting in your tests? I'm trying to get a sense of how much the event frequency increases relative to master right now.

I was emitting 8+ events to manually test that each type worked as expected. I agree this would not be supported.

This would only be an issue in error paths right, since we said (as a best practice) functions shouldn't be emitting results/events when everything is normal.

I have not confirmed this, but in theory it could occur if the system emits a single event each time. Such as the event that gets created for having no warning events. I suspect this will slowly fill up the 25 event slots, and then if the function tries to emit an event after this, either the function or the system will be allowed to emit the single event, but the other will be ignored. I could be misunderstanding this, so I'll test to confirm.

@dalton-hill-0
Copy link
Contributor Author

I observed the following behavior with this setup:

cluster: kind
image: kindest/node:v1.29.2
crossplane branch: copy-events-to-claim-v2

I tested some different event scenarios and this is what I observed.

When running with no events coming from the function, we just see the "Successfully composed resources" event from the crossplane system on each reconcile. For awhile each event will be created as it comes in. For instance an event will be seen at 1m, 2m, 3m, etc. After some time (I checked after an hour, but it could be sooner) the events begin to be seen every 5 minutes, however, each time the event is seen it increases the count by 5. So it maintains a ratio of one event per minute.

It gets interesting when I begin emitting two separate events. So we have the system event and a custom function event, but I assume this would apply to two system events as well. What I observed is that after the event allowance fills up, you will only see the first event. So in this case I checked back after ~2.5 hours and noticed only the function event appearing. I let this run over the weekend and from what I can tell, it continues indefinitely. So I observed ~3k+ function events without seeing any system events. I then made a change to stop emitting the function event, after which I saw the full count of system events appear (one event per minute) retroactively. So the count of events is maintained, even though the event itself is never created on the API server until the function event was stopped.

If this behavior is not unique to my local environment, it means that if a function returns a fatal result and a warning result, only the fatal result will be seen after the event allowance is reached. The warning events will be effectively ignored.

This doesn't seem to be unique to this PR, but I thought it was worth mentioning. We may need to document this somewhere.

@@ -608,6 +659,22 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco
if kerrors.IsConflict(err) {
return reconcile.Result{Requeue: true}, nil
}

meta := r.handleCommonCompositionResult(ctx, res, xr, log)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handleCommonCompositionResult method is the only part of the implementation that I'm not sure about. It's more intuition (or taste?) than anything concrete. I think part of what's bugging me is there isn't a very clean separation of responsibilities. For example handleCommonCompositionResult sets some conditions, but we also set some conditions here in Reconcile.

I don't think this needs to be a blocker though. I'd like to take a pass and see if I can refactor it a bit, but I'm happy to do that after this is merged if you're okay with it. My focus is temporarily on another project, so I don't want to delay this further waiting to get time to play with it. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you refactor this please LMK, I would like to see how you approach this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do! I probably won't get time for a few weeks.

Copy link
Member

@negz negz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @dalton-hill-0, thanks for taking the time to work through a nuanced first contribution!

@negz
Copy link
Member

negz commented Jun 25, 2024

This doesn't seem to be unique to this PR, but I thought it was worth mentioning. We may need to document this somewhere.

@dalton-hill-0 Could you raise an issue against this repo to track this? I agree it sounds above the scope of this PR.

Could you please:

Once that's done I'll merge crossplane/crossplane-runtime#738 and then this one.

@dalton-hill-0
Copy link
Contributor Author

@negz

Could you raise an issue against this repo to track this? I agree it sounds above the scope of this PR.

#5802

Cleanup the commit history here and on crossplane/crossplane-runtime#738

Done

Make sure there's a docs tracking issue for this feature

Added a comment to an existing issue that is closely related.

@negz
Copy link
Member

negz commented Jun 28, 2024

@dalton-hill-0 I just merged crossplane/crossplane-runtime#738, I'm guessing this needs to be updated to include the latest runtime to get the build to pass.

@dalton-hill-0
Copy link
Contributor Author

dalton-hill-0 commented Jun 28, 2024

@negz

I'm guessing this needs to be updated to include the latest runtime to get the build to pass.

Yes, I have updated the mod file to reference the latest runtime. CI should work now 🤞

Seems running reviewable changed some of the proto field references, which is causing failures. Looking into it.

Edit: Fixed the references, reviewable working on local.

Edit: If curious what changes were made, see internal/controller/apiextensions/composite/composition_functions.go. Changes were just removing the usage of ptr package to dereference condition.Message and result.Reason. The linter/formatter wanted us to use the getters instead.

@negz negz merged commit 2a427fc into crossplane:master Jun 28, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide a Channel of Communication to the Claim
2 participants