feat(metrics): adding cloudwatch metric filter #2176

jonahkaye · 2024-05-29T19:25:16Z

Refs: #1764

Description

adding cloudwatch metrics and filters and removing sentry capture
note: will not affect 500 resource limit since all resources are in a nested stack
note: started squashing my commits :)

Testing

Staging
- branched to staging

Release Plan

Merge this

jonahkaye · 2024-05-29T19:44:36Z

packages/core/src/external/carequality/ihe-gateway-v2/ihe-gateway-v2-logic.ts

 result,
+ xcpdRequest,


including the request and condensing the error for easier debugging

We still want to capture these since these are an immediate issue that need to be remediated if they occur. They are an error in our system and not in an external gateways system

jonahkaye · 2024-05-29T19:45:56Z

packages/core/src/external/carequality/ihe-gateway-v2/outbound/xca/process/error.ts

+ `${msg}, registryError: ${JSON.stringify(
+ registryErrorList
+ )}, outboundRequest: ${JSON.stringify(outboundRequest)}`
+ );


removing capture, adding log to be caught by metric filter

no multi line logs

jonahkaye · 2024-05-29T19:46:39Z

packages/core/src/external/carequality/ihe-gateway-v2/outbound/xca/send/dq-requests.ts

 log(
- `${msg}, cxId: ${cxId}, patientId: ${patientId}, gateway: ${request.gateway.homeCommunityId}, error: ${error}`
+ `${msg}, cxId: ${cxId}, patientId: ${patientId}, gateway: ${request.gateway.homeCommunityId}, error: ${errorString}${errorDetails}`


cleaning log for metric filter. removing capture

jonahkaye · 2024-05-29T19:46:48Z

packages/core/src/external/carequality/ihe-gateway-v2/outbound/xca/send/dr-requests.ts

 log(
- `${msg}, cxId: ${cxId}, patientId: ${patientId}, gateway: ${request.gateway.homeCommunityId}, error: ${error}`
+ `${msg}, cxId: ${cxId}, patientId: ${patientId}, gateway: ${request.gateway.homeCommunityId}, error: ${errorString}${errorDetails}`


jonahkaye · 2024-05-29T19:47:02Z

packages/core/src/external/carequality/ihe-gateway-v2/outbound/xcpd/process/xcpd-response.ts

+ `${msg}, jsonObj: ${JSON.stringify(jsonObj)}, outboundRequest: ${JSON.stringify(
+ outboundRequest
+ )}`
+ );


adding log for metric filter

jonahkaye · 2024-05-29T20:29:05Z

packages/infra/lib/ihe-gateway-v2-stack.ts

+
+ this.addMetricFiltersAndAlarms(patientDiscoveryLambda, "PatientDiscoveryLambda", props, [
+ { filterPattern: "Aborted Error is present in response", threshold: 100 },
+ { filterPattern: "Failure Sending SAML Request", threshold: 100 },


we have 115/1500 endpoints erroring in prod, split between the two error types:

aborted errors (inline soap errors)
http errors (400s, 500s, etc)

100 allotted for each for now should be a good buffer.

jonahkaye · 2024-05-29T20:33:18Z

packages/infra/lib/ihe-gateway-v2-stack.ts

+ this.addMetricFiltersAndAlarms(documentQueryLambda, "DocumentQueryLambda", props, [
+ { filterPattern: "RegistryErrorList is present in response", threshold: 3 },
+ { filterPattern: "Failure Sending SAML Request", threshold: 5 },
+ ]);


From posthog, we send 3 dqs on average per patient right now. I dont know how many dqs we send a minute, so this threshold is kinda a guess right now.

…nd drs Refs: #1667 Signed-off-by: Jonah Kaye <[email protected]> feat(metrics): refactoring and cleaning Refs: #1667 Signed-off-by: Jonah Kaye <[email protected]> feat(metrics): upping xcpd error threshold Refs: #1667 Signed-off-by: Jonah Kaye <[email protected]> feat(metrics): cleaning up logical ids for metric filters Refs: #1667 Signed-off-by: Jonah Kaye <[email protected]> feat(metrics): upping thresholds for dq and drs Refs: #1667 Signed-off-by: Jonah Kaye <[email protected]>

leite08

Thinking how this will play our on a day-to-day basis... How could we get a simplified list of data to act upon, something like a report about the external GWs and a count of each type of errors? Does it make sense move that to a DB table instead of relying on logs?

packages/core/src/external/carequality/ihe-gateway-v2/ihe-gateway-v2-logic.ts

leite08 · 2024-05-30T20:26:30Z

packages/core/src/external/carequality/ihe-gateway-v2/outbound/xca/send/dq-requests.ts

@@ -47,27 +45,15 @@ export async function sendSignedDQRequests({
 };
 //eslint-disable-next-line @typescript-eslint/no-explicit-any
 } catch (error: any) {
- const msg = "HTTP/SSL Failure Sending Signed DQ SAML Request";
+ const msg = "Failure Sending SAML Request";


What do you think we always use Error when referring to these situations? If we use Failure as well, it makes the filter for logs harder - also when we're debugging manually on the logs.

leite08 · 2024-05-30T20:28:06Z

packages/core/src/external/carequality/ihe-gateway-v2/outbound/xca/send/dq-requests.ts

+ const msg = "Failure Sending SAML Request";
+ const errorString: string = errorToString(error);
+ const errorDetails = error?.response?.data
+ ? `, error details: ${JSON.stringify(error?.response?.data)}`


nit: no need to question mark as we're already checking if present at the start of the ternary

leite08 · 2024-05-30T20:32:35Z

packages/infra/lib/ihe-gateway-v2-stack.ts

+
+ private addMetricFiltersAndAlarms(
+ lambdaFunction: Lambda,
+ functionName: string,


Could get this from lambdaFunction

leite08 · 2024-05-30T20:34:01Z

packages/infra/lib/ihe-gateway-v2-stack.ts

+ const metricFilter = logGroup.addMetricFilter(
+ `${functionName}-${sanitizedFilterPattern}-MetricFilter`,
+ {
+ metricNamespace: "IHEGatewayV2",


There's a Metriport namespace, can we place this as a "Service" under that one, please?

E.g., there's a OSS API service there: https://us-west-1.console.aws.amazon.com/cloudwatch/home?region=us-west-1#metricsV2?graph=~()&query=~'*7bMetriport*2cService*7d

leite08 · 2024-05-30T20:37:28Z

packages/infra/lib/ihe-gateway-v2-stack.ts

+ {
+ metricNamespace: "IHEGatewayV2",
+ metricName: `${functionName}-${sanitizedFilterPattern}`,
+ filterPattern: FilterPattern.anyTerm(filterPattern),


Won't this result in us getting hits for log entries that contain any of the components of the filter?

leite08 · 2024-05-30T20:41:45Z

packages/infra/lib/ihe-gateway-v2-stack.ts

+ const alarm = metricFilter
+ .metric()
+ .createAlarm(this, `${functionName}-${sanitizedFilterPattern}-Alarm`, {
+ threshold,
+ evaluationPeriods: 1,
+ alarmDescription: `Alarm if ${functionName} encounters ${filterPattern}`,
+ treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
+ });


I suggest we don't alert on these. CW-initiated alarms can't be "archived"/snoozed like we do with Sentry, so we would need to make a release to silent an alarm/update its configs.

We can instead add a widget to the prod dashboard and monitor once in a while as part of on-call (if not an auto report as indicated on the main review comment)?

leite08 · 2024-05-30T20:44:11Z

packages/infra/lib/ihe-gateway-v2-stack.ts

+ { filterPattern: "RegistryErrorList In Soap Response", threshold: 5 },
+ { filterPattern: "Failure Sending SAML Request", threshold: 5 },


I'd reconsider the filter pattern being so reliant on the full message. It's very prone to break as we maintain the code and potentially update the error message - no direct link to this code.

Maybe we could have some sort of fixed prefix that we add to all log entries we want to have a metric for, and that could be shared across the code and the infra using Core or Shared?

jonahkaye commented May 29, 2024

View reviewed changes

jonahkaye force-pushed the 1764-cloudwatch-metric-and-filter-ihe-v2 branch from 459346b to b10cd45 Compare May 29, 2024 20:49

jonahkaye self-assigned this May 29, 2024

jonahkaye marked this pull request as ready for review May 29, 2024 20:50

jonahkaye requested review from Goncharo, leite08, Orta21 and RamilGaripov as code owners May 29, 2024 20:50

jonahkaye temporarily deployed to staging May 29, 2024 20:53 — with GitHub Actions Inactive

leite08 reviewed May 30, 2024

View reviewed changes

jonahkaye marked this pull request as draft June 3, 2024 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): adding cloudwatch metric filter #2176

feat(metrics): adding cloudwatch metric filter #2176

jonahkaye commented May 29, 2024 •

edited

Loading

jonahkaye May 29, 2024

jonahkaye May 29, 2024

jonahkaye May 29, 2024

jonahkaye May 29, 2024

jonahkaye May 29, 2024

jonahkaye May 29, 2024

jonahkaye May 29, 2024

jonahkaye May 29, 2024

jonahkaye May 29, 2024

leite08 left a comment

leite08 May 30, 2024

leite08 May 30, 2024

leite08 May 30, 2024

leite08 May 30, 2024

leite08 May 30, 2024

leite08 May 30, 2024

leite08 May 30, 2024

		{ filterPattern: "RegistryErrorList In Soap Response", threshold: 5 },
		{ filterPattern: "Failure Sending SAML Request", threshold: 5 },

feat(metrics): adding cloudwatch metric filter #2176

Are you sure you want to change the base?

feat(metrics): adding cloudwatch metric filter #2176

Conversation

jonahkaye commented May 29, 2024 • edited Loading

Description

Testing

Release Plan

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leite08 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonahkaye commented May 29, 2024 •

edited

Loading