From 1e34e28dba8f471870c31762296701c8fe9e9b18 Mon Sep 17 00:00:00 2001 From: Armin Ronacher Date: Thu, 13 Jul 2023 12:39:40 +0200 Subject: [PATCH 1/4] WIP' --- text/XXXX-filestore-new.md | 50 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+) create mode 100644 text/XXXX-filestore-new.md diff --git a/text/XXXX-filestore-new.md b/text/XXXX-filestore-new.md new file mode 100644 index 00000000..b0883c46 --- /dev/null +++ b/text/XXXX-filestore-new.md @@ -0,0 +1,50 @@ +- Start Date: YYYY-MM-DD +- RFC Type: feature / decision / informational +- RFC PR: +- RFC Status: draft + +# Summary + +One of the systems that Sentry internally operates today is an abstract concept referred +to as "file store". It consists of a postgres level infrastructure to refer to blobs and +a go service also called "file store" which acts as a stateful proxy in front of GCS to +deal with latency spikes, write throughput and caching. + +This RFC summarizes issues with the current approach, the changed requirements that go +into this system and proposes a path forward. + +# Motivation + +Various issues have ocurred over the years with this system so that some decisions were +made that over time have resulted in new requirements for filestore and alternative +implementations. Replay for instance operates a seperate infrastructure that goes +straight to GCS but is running into write throughput issues that file store the Go service +solves. On the other hand race conditions and complex blob book-keeping in Sentry itself +prevent expiring of debug files and source maps after a period of time. + +The motivation of this RFC is to summarize the current state of affairs, work streams that +are currently planned are are in motion to come to a better conclusion about what should be +done with the internal abstractions and how they should be used. + +# Background + +blah + +# Supporting Data + +[Metrics to help support your decision (if applicable).] + +# Options Considered + +If an RFC does not know yet what the options are, it can propose multiple options. The +preferred model is to propose one option and to provide alternatives. + +# Drawbacks + +Why should we not do this? What are the drawbacks of this RFC or a particular option if +multiple options are presented. + +# Unresolved questions + +- What parts of the design do you expect to resolve through this RFC? +- What issues are out of scope for this RFC but are known? From b0a1340c0efed26b7117ef6f088647baa29c3830 Mon Sep 17 00:00:00 2001 From: Armin Ronacher Date: Thu, 13 Jul 2023 15:46:23 +0200 Subject: [PATCH 2/4] More text changes --- text/XXXX-filestore-new.md | 65 ++++++++++++++++++++++++++++++++------ 1 file changed, 55 insertions(+), 10 deletions(-) diff --git a/text/XXXX-filestore-new.md b/text/XXXX-filestore-new.md index b0883c46..9b132cac 100644 --- a/text/XXXX-filestore-new.md +++ b/text/XXXX-filestore-new.md @@ -28,23 +28,68 @@ done with the internal abstractions and how they should be used. # Background -blah +The primary internal abstraction in Sentry today is the `filestore` service which itself +is built on top of Django's `files` system. At this level "files" have names and they +are stored on a specific GCS bucket (or an alternative backend). On top of that the `files` +models are built. There each file is created out of blobs where each blob is stored +(deduplicated) just once in the backend of `filestore`. + +For this purpose each blob is given a unique filename (a UUID). Blobs are deduplicated +by content hash and only stored once. This causes some challenge to the system as it +means that the deletion of blobs has to be driven by the system as auto-expiration is +thus no longer possible. # Supporting Data -[Metrics to help support your decision (if applicable).] +We currently store petabytes of file assets we would like to delete. + +# Possible Changes + +These are some plans about what can be done to improve the system: + +## Removal of Blob Deduplication + +Today it's not possible for us to use GCS side expiration. That's because without the +knowledge of the usage of blobs from the database it's not save to delete blobs. This +can be resolved by removing deduplication. Blobs thus would be written more than once. +This works on the `filestore` level, but it does not work on the `FileBlob` level. +However `FileBlob` itself is rather well abstracted away from most users. A new model +could be added to replace the one one. One area where `FileBlob` leaks out is the +data export system which would need to be considered. + +`FileBlobOwner` itself could be fully removed, same with `FileBlobIndex` as once +deduplication is removed the need of the owner info no longer exists, and the index +info itself can be stored on the blob itself. + +```python +class FileBlob2(Model): + organization_id = BoundedBigIntegerField(db_index=True) + path = TextField(null=True) + offset = BoundedPositiveIntegerField() + size = BoundedPositiveIntegerField() + checksum = CharField(max_length=40, unique=True) + timestamp = DateTimeField(default=timezone.now, db_index=True) +``` + +## TTL Awareness -# Options Considered +The abstractions in place today do have any support for storage classes. Once however +blobs are deduplicated it would be possible to fully rely on GCS to clean up on it's own. +Because certain operations are going via our filestore proxy service, it would be preferrable +if the policies were encoded into the URL in one form or another. -If an RFC does not know yet what the options are, it can propose multiple options. The -preferred model is to propose one option and to provide alternatives. +## Assemble Staging Area -# Drawbacks +The chunk upload today depends on the ability to place blob by blob somewhere. Once blobs are +stored regularly in GCS there is no significant reason to slice them up into small pieces as +range requests are possible. This means that the assembly of the file needs to be reconsidered. -Why should we not do this? What are the drawbacks of this RFC or a particular option if -multiple options are presented. +The easiest solution here would be to allow chunks to be uploaded to a per-org staging area where +they linger for up to two hours per blob. That gives plenty of time to use these blobs for +assembly. A cleanup job (or TTL policy if placed in GCS) would then collect the leftovers +automatically. This also detaches the coupling of external blob sizes from internal blob +storage which gives us the ability to change blob sizes as we see fit. # Unresolved questions -- What parts of the design do you expect to resolve through this RFC? -- What issues are out of scope for this RFC but are known? +TBD From 432ff7fddb5337b564c5faf9911da0779fac0b63 Mon Sep 17 00:00:00 2001 From: Armin Ronacher Date: Thu, 13 Jul 2023 15:47:51 +0200 Subject: [PATCH 3/4] Update file --- text/{XXXX-filestore-new.md => 0108-filestore-new.md} | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) rename text/{XXXX-filestore-new.md => 0108-filestore-new.md} (97%) diff --git a/text/XXXX-filestore-new.md b/text/0108-filestore-new.md similarity index 97% rename from text/XXXX-filestore-new.md rename to text/0108-filestore-new.md index 9b132cac..5cc20b9b 100644 --- a/text/XXXX-filestore-new.md +++ b/text/0108-filestore-new.md @@ -1,6 +1,6 @@ -- Start Date: YYYY-MM-DD -- RFC Type: feature / decision / informational -- RFC PR: +- Start Date: 2023-07-13 +- RFC Type: informational +- RFC PR: https://github.com/getsentry/rfcs/pull/108 - RFC Status: draft # Summary From 6edfdcf76fe8c8005103f9c039b3dfab80dab7a6 Mon Sep 17 00:00:00 2001 From: Armin Ronacher Date: Thu, 20 Jul 2023 15:22:06 +0200 Subject: [PATCH 4/4] Apply suggestions from code review Co-authored-by: Mark Story --- text/0108-filestore-new.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/text/0108-filestore-new.md b/text/0108-filestore-new.md index 5cc20b9b..f7105a95 100644 --- a/text/0108-filestore-new.md +++ b/text/0108-filestore-new.md @@ -15,9 +15,9 @@ into this system and proposes a path forward. # Motivation -Various issues have ocurred over the years with this system so that some decisions were +Various issues have occurred over the years with this system so that some decisions were made that over time have resulted in new requirements for filestore and alternative -implementations. Replay for instance operates a seperate infrastructure that goes +implementations. Replay for instance operates seperate infrastructure that goes straight to GCS but is running into write throughput issues that file store the Go service solves. On the other hand race conditions and complex blob book-keeping in Sentry itself prevent expiring of debug files and source maps after a period of time. @@ -50,11 +50,11 @@ These are some plans about what can be done to improve the system: ## Removal of Blob Deduplication Today it's not possible for us to use GCS side expiration. That's because without the -knowledge of the usage of blobs from the database it's not save to delete blobs. This +knowledge of the usage of blobs from the database it's not safe to delete blobs. This can be resolved by removing deduplication. Blobs thus would be written more than once. This works on the `filestore` level, but it does not work on the `FileBlob` level. However `FileBlob` itself is rather well abstracted away from most users. A new model -could be added to replace the one one. One area where `FileBlob` leaks out is the +could be added to replace the current one. One area where `FileBlob` leaks out is the data export system which would need to be considered. `FileBlobOwner` itself could be fully removed, same with `FileBlobIndex` as once