Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transactional Guarantees with Background Work Queueing #3565

Open
dillonstreator opened this issue Dec 16, 2023 · 1 comment
Open

Transactional Guarantees with Background Work Queueing #3565

dillonstreator opened this issue Dec 16, 2023 · 1 comment

Comments

@dillonstreator
Copy link
Contributor

dillonstreator commented Dec 16, 2023

The current state of the fhir server reveals a crucial gap in transactional guarantees between resource updates and the background work queueing mechanism. This exposes the system and clients to potential data inconsistencies.

The problem can be seen in the api servers fhir repo where if the writeToDatabase call succeeds but addBackgroundJobs fails for some reason (transient BullMQ issue or server crash for example), clients will miss subscriptions, resource downloads, and/or cron registration.

if (!this.isCacheOnly(result)) {
await this.writeToDatabase(result);
}
await setCacheEntry(result);
await addBackgroundJobs(result, { interaction: create ? 'create' : 'update' });
this.removeHiddenFields(result);

Proposed Solution

To mitigate this issue, the implementation of the transactional outbox pattern is recommended. This pattern not only provides transactional guarantees but also introduces the ability to retry background job queueing, despite the low-probability of failures with redis/bullmq.

High-level Implementation Details

Event Table Introduction

The proposed solution involves the creation of a new events table. This table would encompass the following fields:

  • id: Uniquely identifies each event.
  • timestamp: Records the time when the event occurred.
  • type: Facilitates routing the event to the appropriate handler in the processor.
  • data: Contains event data, ideally limited to resource references.
  • correlation_id: Links disparate but related async actions. Can be injected into logs to tie together related async operations.
  • handler_results: Lists the event handlers results including errors encountered and processed_at timestamp to prevent re-execution.
  • errors: Count of errors attempting to process the event. This is used to skip events if too many errors are encountered. This can be reset to 0 to re-attempt to process unprocessed event handlers.
  • backoff_until: Specifies when to retry processing the event in case of an error.
  • processed_at: Timestamp indicating when the event was successfully processed.

ResourceSaved

A ResourceSaved event will be transactionally persisted alongside a resource creation or update.

Event Processor and Handler(s)

The solution introduces an event processor capable of running either in-process or as a separate process. This processor checks for unprocessed events in the events table on a set interval and runs the events through their respective handlers (a predefined map using keys as event type/name). The handler(s) are responsible for background job queuing through BullMQ. The processor updates the events based on handler completion results. The processor should be capable of scaling horizontally without concern of duplicate event processing.

For each background job that requires queuing, the ResourceSaved event will have a corresponding 'handler.' These handlers must utilize the addBulk method to guarantee atomic queueing in BullMQ. Alternatively, a single handler could be used and addBulk can be called to guarantee atomic queueing of all background jobs at once (subscriptions, downloads, and crons). The important part here is that the jobs are atomically queued.

Considerations and Drawbacks

Despite the benefits offered by transactional outbox, it is important to acknowledge the drawbacks, including:

  • High level of reliance on the event processor properly updating the events on success to prevent duplicate handler invocations. Specifying custom job ids would entirely mitigate this concern, assuming removeOnComplete is not enabled, as the queueing would be idempotent. https://docs.bullmq.io/guide/jobs/job-ids.
  • Increased latency between resource persistence and background execution. Impact based on event processor tick interval.
  • Increased storage requirements due to the introduction of the new events table which would store a record for each persisted change to a resource.
  • Increased bandwidth usage and database load from the event processor.
  • Arguably reduced code flow clarity.

Nevertheless, the advantages of improved transactional guarantees and enhanced data consistency significantly outweigh these drawbacks IMO.

@dillonstreator
Copy link
Contributor Author

dillonstreator commented Dec 17, 2023

Looked a bit more into this and I ended up throwing together a library that implements a generic transactional outbox event processor https://github.com/dillonstreator/txob as well as a client adapter for pg https://github.com/dillonstreator/txob/blob/main/src/pg/client.ts
I also put together a simple example http server application that uses this library to showcase transactionally persisting events along with another database entity and processing arbitrary lists of independent event side effects asynchronously https://github.com/dillonstreator/txob/blob/main/examples/pg/index.ts

Curious to see what thoughts are on this.

@reshmakh reshmakh added this to the Milestone Quality milestone Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants