Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metrics): adding cloudwatch metric filter #2176

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

jonahkaye
Copy link
Member

@jonahkaye jonahkaye commented May 29, 2024

Refs: #1764

Description

  • adding cloudwatch metrics and filters and removing sentry capture
  • note: will not affect 500 resource limit since all resources are in a nested stack
  • note: started squashing my commits :)

Testing

  • Staging
    • branched to staging

Release Plan

  • Merge this

result,
xcpdRequest,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including the request and condensing the error for easier debugging

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still want to capture these since these are an immediate issue that need to be remediated if they occur. They are an error in our system and not in an external gateways system

`${msg}, registryError: ${JSON.stringify(
registryErrorList
)}, outboundRequest: ${JSON.stringify(outboundRequest)}`
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing capture, adding log to be caught by metric filter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no multi line logs

log(
`${msg}, cxId: ${cxId}, patientId: ${patientId}, gateway: ${request.gateway.homeCommunityId}, error: ${error}`
`${msg}, cxId: ${cxId}, patientId: ${patientId}, gateway: ${request.gateway.homeCommunityId}, error: ${errorString}${errorDetails}`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaning log for metric filter. removing capture

log(
`${msg}, cxId: ${cxId}, patientId: ${patientId}, gateway: ${request.gateway.homeCommunityId}, error: ${error}`
`${msg}, cxId: ${cxId}, patientId: ${patientId}, gateway: ${request.gateway.homeCommunityId}, error: ${errorString}${errorDetails}`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

`${msg}, jsonObj: ${JSON.stringify(jsonObj)}, outboundRequest: ${JSON.stringify(
outboundRequest
)}`
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding log for metric filter


this.addMetricFiltersAndAlarms(patientDiscoveryLambda, "PatientDiscoveryLambda", props, [
{ filterPattern: "Aborted Error is present in response", threshold: 100 },
{ filterPattern: "Failure Sending SAML Request", threshold: 100 },
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have 115/1500 endpoints erroring in prod, split between the two error types:

aborted errors (inline soap errors)
http errors (400s, 500s, etc)

100 allotted for each for now should be a good buffer.

this.addMetricFiltersAndAlarms(documentQueryLambda, "DocumentQueryLambda", props, [
{ filterPattern: "RegistryErrorList is present in response", threshold: 3 },
{ filterPattern: "Failure Sending SAML Request", threshold: 5 },
]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From posthog, we send 3 dqs on average per patient right now. I dont know how many dqs we send a minute, so this threshold is kinda a guess right now.

…nd drs

Refs: #1667
Signed-off-by: Jonah Kaye <[email protected]>

feat(metrics): refactoring and cleaning

Refs: #1667
Signed-off-by: Jonah Kaye <[email protected]>

feat(metrics): upping xcpd error threshold

Refs: #1667
Signed-off-by: Jonah Kaye <[email protected]>

feat(metrics): cleaning up logical ids for metric filters

Refs: #1667
Signed-off-by: Jonah Kaye <[email protected]>

feat(metrics): upping thresholds for dq and drs

Refs: #1667
Signed-off-by: Jonah Kaye <[email protected]>
Copy link
Member

@leite08 leite08 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking how this will play our on a day-to-day basis... How could we get a simplified list of data to act upon, something like a report about the external GWs and a count of each type of errors? Does it make sense move that to a DB table instead of relying on logs?

@@ -47,27 +45,15 @@ export async function sendSignedDQRequests({
};
//eslint-disable-next-line @typescript-eslint/no-explicit-any
} catch (error: any) {
const msg = "HTTP/SSL Failure Sending Signed DQ SAML Request";
const msg = "Failure Sending SAML Request";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think we always use Error when referring to these situations? If we use Failure as well, it makes the filter for logs harder - also when we're debugging manually on the logs.

const msg = "Failure Sending SAML Request";
const errorString: string = errorToString(error);
const errorDetails = error?.response?.data
? `, error details: ${JSON.stringify(error?.response?.data)}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to question mark as we're already checking if present at the start of the ternary


private addMetricFiltersAndAlarms(
lambdaFunction: Lambda,
functionName: string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could get this from lambdaFunction

const metricFilter = logGroup.addMetricFilter(
`${functionName}-${sanitizedFilterPattern}-MetricFilter`,
{
metricNamespace: "IHEGatewayV2",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a Metriport namespace, can we place this as a "Service" under that one, please?

E.g., there's a OSS API service there: https://us-west-1.console.aws.amazon.com/cloudwatch/home?region=us-west-1#metricsV2?graph=~()&query=~'*7bMetriport*2cService*7d

{
metricNamespace: "IHEGatewayV2",
metricName: `${functionName}-${sanitizedFilterPattern}`,
filterPattern: FilterPattern.anyTerm(filterPattern),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this result in us getting hits for log entries that contain any of the components of the filter?

Comment on lines +84 to +91
const alarm = metricFilter
.metric()
.createAlarm(this, `${functionName}-${sanitizedFilterPattern}-Alarm`, {
threshold,
evaluationPeriods: 1,
alarmDescription: `Alarm if ${functionName} encounters ${filterPattern}`,
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we don't alert on these. CW-initiated alarms can't be "archived"/snoozed like we do with Sentry, so we would need to make a release to silent an alarm/update its configs.

We can instead add a widget to the prod dashboard and monitor once in a while as part of on-call (if not an auto report as indicated on the main review comment)?

Comment on lines +54 to +55
{ filterPattern: "RegistryErrorList In Soap Response", threshold: 5 },
{ filterPattern: "Failure Sending SAML Request", threshold: 5 },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd reconsider the filter pattern being so reliant on the full message. It's very prone to break as we maintain the code and potentially update the error message - no direct link to this code.

Maybe we could have some sort of fixed prefix that we add to all log entries we want to have a metric for, and that could be shared across the code and the infra using Core or Shared?

@jonahkaye jonahkaye marked this pull request as draft June 3, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants