-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Sentry Architecture #2
Changes from all commits
48f5604
c278cbb
dec4541
1c4edab
dab8e8a
e8095e8
89edb32
ec861d4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,243 @@ | ||
* Start Date: 2022-07-21 | ||
* RFC Type: informational | ||
* RFC PR: https://github.com/getsentry/rfcs/pull/2 | ||
|
||
# Summary | ||
|
||
This document is a living piece that is updated with some collected thoughts on what likely projects | ||
are over a multi year horizon to improve Sentry's underpinnings. | ||
|
||
# Motivation | ||
|
||
We are running into scaling limitations on our current infrastructure and as such some larger | ||
decisions have to be made about the future of our codebase. As we are talking about creating | ||
more Sentry deployments in different regions and for smaller clusters, some of these problems | ||
are becoming more pressing as they create complexity for each added installation. | ||
|
||
# Expected Workstreams | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we may have another work stream to deal with.
|
||
|
||
This captures all expected streams of work that are in the scope of overhauling Sentry's | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another aspect we may want to capture regarding the new architecture: the concept of ownership and self service. We have seen that we cannot live anymore in a world where ops and SnS own everything and that product teams need to take ownership of their system, which can include application code in the monolith but also systems (like kafka consumers or storage instances). This also includes being accountable for the resources they use so that they can take responsibility for decisions between a short term non-scalable solution to deliver faster and a long term scalable solution. I think there are two fundamental ways to approach the problem that generate different architectures:
For snuba we are going towards the second model. I am outlining it here https://www.notion.so/sentry/Improve-product-development-speed-92c5b9f06a6744288454437f5a8c3db6 Though this problem will apply to Sentry as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fpacifici would be nice to move this into a public space so I can reference it: https://www.notion.so/sentry/Improve-product-development-speed-on-Snuba-92c5b9f06a6744288454437f5a8c3db6 |
||
architecture. | ||
|
||
## Processing Pipeline | ||
|
||
These are long running concerns with the processing pipeline that should be addressed. | ||
|
||
### Isolate Processing Pipeline out of Monolith | ||
|
||
The celery tasks, event manager and related functionality used by the processing pipeline is living | ||
alongside the rest of the Sentry code in the `sentry` monolith. This means that it's quite chaotic | ||
and many unintended cross dependencies are regularly introduced into the code base. As some of the | ||
processing pipeline wants to move out of the monolith for better scalability (in some extreme cases | ||
even move out of Python as host language) it would be helpful by moving all of the processing pipeline | ||
into a `sentry_pipeline` package. It would be permissible for `sentry_pipeline` to import `sentry` | ||
but not the other way round. | ||
|
||
This will cause issues in the testsuite as the sentry test suite currently tries to create events | ||
regularly and pragmatic solutions for this need to be found. | ||
|
||
### Remove Pickle in Rabbit | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes please Are we also considering serialization formats other than JSON. I think we never tried anything seriously but it may be interesting to explore alternatives to improve parsing/serialization (but really parsing) |
||
|
||
We still use pickle as serialization format for Celery which means that it's not possible for code | ||
outside of the Sentry monolith to dispatch tasks. This is both an issue for the separation of | ||
`sentry` and `sentry_pipeline` (as import paths are changing) as well as it causes problems for | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While I absolutely support moving away from pickle, I would challenge that this is a blocker for renaming The only place where I think actual import paths are serialized for Celery today is when a complex object is pickled as a task argument. There is prior art to restricting what objects may be pickled in This step 0 will require a lot of coordination and followup tasks for many product areas as I am fairly sure we pickle complex objects in both task arguments and postgres There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The issue is not the naming of the task but the arguments. If I today want to dispatch a task containing a model from outside the sentry codebase, I need to create a message that pickles into a django model from within the sentry codebase. The only realistic and portable way to do this is to import the monolith, create the instance and dispatch the task that way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The non JSON safe detection code BTW is already in Sentry. The issue is that the tasks are still passing these complex objects in practice. That's a thing we need to start addressing before we can even entertain the idea of moving off it. |
||
dispatching and listening to tasks from Rust and other languages. | ||
|
||
### Remove HTTP Polling to Symbolicator | ||
|
||
The form of communication for event processing from the Python pipeline to Symbolicator (which is a | ||
Rust service) involves polling the symbolicator from Celery Python workers. This has shown many issues | ||
in the past where the entire thing can tilt if the load on the symbolicators moves too far from the | ||
load assumptions of the Python polling workers. The correct solution would be for tasks to be directly | ||
picked up by symbolicator. | ||
|
||
### Remove Celery and RabbitMQ for Kafka | ||
|
||
Our use of RabbitMQ is reaching the limits of what can be done with this system safely. If we hit | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have some references about this ? When you say "hit the disk" are you talking about Rabbit ? Or the tasks themselves ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Today if we ever hit disk on rabbit the thing crawls to a ground. I am unfortunately not able to find the last incident involving this as it predates the current incident process, but it's something we are generally aware of. We could probably operate rabbit so that going to disk is an option, but today that does not appear to be something we can do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wish we fully understood what is going on with RabbitMQ in that scenario and did tests that re-validate that understanding. |
||
disk our entire pipeline crawls to a grind and we're no longer able to keep up with the traffic. | ||
As such we are already throttling how the events make it from kafka rabbit in extreme cases. We are | ||
however relying on RabbitMQ to deal with the large variance of task execution times in the symbolication | ||
part of the pipeline. For instance a JavaScript event can make it in the low milliseconds through | ||
the entire pipeline whereas a native event can spend up to 30 minutes or more in the pipeline in very | ||
extreme cases (completely cold caches against a slow symbol server). | ||
|
||
It should however still be possible to move this to a Kafka model where batches of tasks are redistributed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel the split between classes of traffic and the need to move away from Rabbit are orthogonal. Or am I missing something? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Generally speaking I think we just need to make a decision how we feel about rabbit vs more kafka. Everything else there is downstream from it. If we no longer want rabbit we need to build some stuff to replicate the benefits we get from it with Kafka. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need though ?
Kafka:
So if we want to pick one over the other. It has to be Kafka because of the functional requirements around delivery guarantees that Rabbit does not provide. That means we cannot do without Kafka. I guess the question is whether we have real use cases where we use rabbitMQ and that would be hard to replace with Kafka. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From where I'm standing we have RabbitMQ for two reasons at the moment: one is that it does not have head of line blocking which makes it a pretty trivial choice for parts of the processing library where we are not quite sure how long something takes. The second reasons is that we build a lot of stuff on top of celery, particularly the cron based stuff. The first reason is in fact a benefit but it can also be done with Kafka. The second reason is just a historic thing I believe. I'm not sure if we have actual cost or scaling benefits with RabbitMQ still but that's something I would like to explore. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Particularly on the RabbitMQ side we currently operate with non durable, federated queues. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One concern I have is that celery is a great abstraction for quick prototyping while Kafka is everything but. Moving every trivially low-scale task that doesn't have consistency requirements to Kafka today is much harder than creating a task that runs on worker-glob. one thing we could do is to implement a celery broker on top of kafka. this would make it easier to gather operational experience with running certain product features on kafka vs rabbitmq, and could significantly sweeten the deal if we decided we're gonna ditch rabbitmq entirely for whatever reason. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have we considered a managed cloud service for tasks queues instead of RabbitMQ? GCP Tasks or Pub/Sub? Here's a comparison. I can confirm as well that scaling of workers on RabbitMQ is simpler than on Kafka due to the 1 consumer / partition limit and scaling of consumer groups causing rebalancing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we did want to do everything with Kafka agree with @fpacifici that the parallel consumer should be investigated for non-ordered workloads. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will say that I have no preference on if we want to use RabbitMQ (or a hosted queue) or Kafka. I can in general live with both but if we stay with RabbitMQ I would like to move us into a world where different codebases can use the queue and not just the monolith. Particularly also other languages. |
||
to slower and faster topics. For instance the processing of the task could be started and if it's not | ||
finishing within a deadline (or the code already knows that this execution can't happen) it would be | ||
dispatched to increasing slower topics. With sufficient concurrency on the consumers this is probably | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is how we would do parallelism right in a scenario where we have different classes of traffic that do not have to be processed in strict order |
||
a good enough system for running at scale and with some modifications this might also work well enough | ||
for single organization installations. | ||
|
||
### Move Sourcemap Processing to Symbolicator | ||
|
||
Our sourcemap processing system is running in a mix of Rust and Python code by fetching a bunch of data | ||
via HTTP and from the database models to then send the data fetched into a Rust module for resolving. | ||
This is generally quite inefficient but we are also running into limitations in the workers. For instance | ||
we rely exclusively on memcache for caches which means that we are limited by our cache size limitations. | ||
Anything above the cache size is not processable which can cause a bad user experience for users depending | ||
on very large source maps. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having one way to apply symbols sounds like a good simplification. |
||
|
||
Symbolicator has a file system based caching system with GCS backing for rapid synching of new symbolicator | ||
instances we could leverage. Additionally symbolicator has a better HTTP fetching system than Sentry as | ||
it's able to concurrencly fetch whereas the Sentry codebase is not. | ||
|
||
Future feature improvements on the source map side might also demand more complexity in the processing side | ||
which we are not currently considering given already existing scalability concerns on the existing workers. | ||
|
||
## Python Monolith | ||
|
||
These are changes to break up the monolith. The goals here are largely that different teams can work largely | ||
uninterrupted of each other, even on the monolith. For instance in an ideal scenario changes to an integration | ||
platform code do not necessarily have to run the entire Sentry testsuite on every commit. Likewise UI code | ||
ideally does not build the entirety of the Sentry UI codebase. | ||
|
||
### Import Compartmentalization | ||
|
||
As a first start imports in the Sentry codebase should be compartmentalized as much as possible. Today we have | ||
some catch all modules like `sentry.models` which import everything. This means that it's hard for as to track | ||
the actual dependencies from different pieces of code and also hides away circular imports. One of the results of | ||
this is that if you run a test against a single pytest module, you still pull in the entire codebase which already | ||
takes a couple of seconds. It also means that it's harder for test infrastructure code to analyze the minimal set | ||
of dependencies that might need testing. | ||
|
||
Likewise code in `sentry.utils` should most likely no longer import models etc. Some of that code might event | ||
move out of `sentry` entirely. | ||
|
||
### Move Code from Getsentry to Senty | ||
|
||
Currently quite a bit of code lives in `getsentry` for reasons of assumed convenience. However this has created | ||
the situation that it is quite easy to ship regressions because the changes in `getsentry` were not considered. | ||
There is likely a whole range of code that does not need to live in `getsentry` and moving it to `sentry` would | ||
make the testsuite more reliable, faster to run and catch more regressions early. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could see usage periods and usage collection being something that could move to sentry. We would keep invoicing and subscriptions in getsentry but the usage subsystem could be moved out. This could also help align product features like orgstats and dynamic sampling suggestions with the usage periods that we use for billing in saas more easily. |
||
|
||
### Remove Pickle in Database | ||
|
||
We still use pickled models in database code quite extensively. This reliance on pickle makes it harder than | ||
necessary to read and write from this outside of the main Python codebase. Moving this to JSON will simplify the | ||
handling of moving such code into environments where other code wants to access it. We already require depickling | ||
in the data pipeline for such cases. | ||
|
||
### Phase out Buffers where Possible | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. honestly i think buffers is conceptually a good idea to keep. it provides batching for a subset of operations in your task/function without forcing you to give up isolation of tasks/functions. I instead wish we made buffers more scalable under the hood, encourage/allow isolation per product domain, and extended the abstraction to work for actually more things than models. moving to bespoke kafka consumers that do batching is a pretty big investment to make for each product, and I would prefer if we continued to have higher-level building blocks (higher-level than arroyo). To me that would be a rewrite of buffers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think buffers is a good concept and it should stay. The way it's built is maybe not ideal (in particular the model and pickle dependency) and particularly given extended outages as a possibility it would be preferrable this to be backed by kafka rather than redis for instance. I don't think everything should build a kafka consumer but a buffer service could be a possibility. |
||
|
||
Buffers as they exist today could in many cases be replaced by better uses of Kafka and clickhouse. Some buffers are | ||
not even as important any more for the product. For instance the total counts for issues are already hidden from the | ||
more prominent places in the UI. The issue with buffers is that they are rather hard to scale, require the use of | ||
pickle for the model updates and are hard to work with when it comes to filtering. For instance we are able to give | ||
you the total event count on an issue, but not when broken down by some of the filters that the UI wants to provide | ||
(environment and release is possible, but any tag search will not able to give the right counts). | ||
|
||
## Relay | ||
|
||
Within relay some changes are likely to be necessary for continued to support of providing a stable service. In | ||
particular as more and more work moves into Relay certain risks that the current architecture is carrying need to | ||
Comment on lines
+131
to
+132
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it a good idea that more and more work keeps moving into Relay ?
Is there something that prevents us from making Relay simpler and moving all product logic into the specific data category ingestion pipeline to further isolate them? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Unclear, but I have been talking with @jan-auer already that if we keep doing we might want to split relay into two pieces two which are connected through a persistent queue. And once we do something like that we can also modularize it. Metrics aggregation I'm conflicted about, but in particular longer term there is a lot of reason to try to attempt symbolication before something hits the innermost parts of the pipeline, particularly now that we have a lot more mobile/native traffic.
Anything we need to know ahead of dynamic sampling needs to be expanded/processes by relay. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs a more concrete plan. We've been talking for a long time about doing symbolication in Relay but so far it's been a pie-in-the-sky idea. I think the problems this pipeline-reordering brings up are mostly impossible to solve. I have a rough idea on how to do grouping-aware filtering in some cases, but I don't see how we can stackwalk before dynamic sampling without any access to our massive internal system symbols, for example. |
||
be addressed. The goals here are horizontal scalability, the ability to route data more efficiently and to recover | ||
from catastrophic failures. | ||
|
||
### Traffic Steering | ||
|
||
As we are now performing aggregations in Relay it's beneficial to be able to route traffic intelligently through the | ||
layers. For instance we can achieve much better aggregations by ensuring that data from related metrics keys are | ||
forwarded to the same processing relays. | ||
Comment on lines
+139
to
+140
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there are three types of traffic routing problems we are facing today, though they are really instances of the same problem and we could have one system to solve the three.
I think the problem is always the same:
I think there is some value in building this as a Relay agnostic system that works based on an abstract concept of partition and physical resource. I wrote down some of that in the comments here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with filippo and I wish we would start building this service sooner rather than later. Our current initiatives don't have a common interface to code against, and since multiple teams are working on sharding/slicing/hybrid-cloud solutions I fear that there will be some avoidable fragmentation in approaches. |
||
|
||
### Partial Project Options | ||
|
||
Relays currently get one project config per project both via the HTTP protocol and by picking them up internally from | ||
a redis key. The size of this config package is exploding as more and more functionality is added. There are many | ||
benefits from splitting this up which should improve our ability to partially degrade service rather than the entire | ||
project, reduce the pressure of redis and make updates faster. For instance an experimental feature should ideally | ||
gets its own config key and the inability to generate that config could still retain functionality for the rest of | ||
the project. | ||
|
||
### Separation of Traffic Classes | ||
|
||
Relay currently forwards all envelopes to the next relay in chain the same. There are however benefits by being able | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. open question:
|
||
to route sessions to different relays than for instance transactions or errors. Particularly for future projects such | ||
as session replays and profiling it would be nice to be able to route experimental features to a specific relay rather | ||
than having to roll out the experimental support across the entire cluster of relays. | ||
|
||
### Modularization in Relay | ||
|
||
Clear up envelope processor to make it easier to add new items types etc and make it easier to add new functionality. | ||
Today the massive envelope processor needs to be updated for every single item type added. This opens many questions | ||
about how different items should work on different layers of relay, how they are routed etc. As we are adding more | ||
item types this can turn more into a serverless function like internal API for adding item types that make it easier | ||
to land new items. | ||
|
||
### Relay Disk Buffer | ||
|
||
Relays currently buffer all data in memory which means that if they go down, we lose data. Instead relay ideally buffers | ||
all its data additionally in a Kafka topic or on-disk storage that can make recovery from extended downtime possible. | ||
|
||
### Partial Grouping Awareness in Relay | ||
|
||
Today our grouping system really can only run in the tail end of the processing pipeline. This makes it impossible for | ||
us to perform certain types of filtering or grouping decisions in relay. There is a range of issues that could already | ||
be grouped quite accurately in Relay which would then permit us to do metrics extraction in relay or to perform efficent | ||
discarding. | ||
|
||
### Tail-based Sampling / Distributed Sampling Context Buffer | ||
|
||
Relay is currently unable to perform tail based sampling or consistent sampling if the dynamic sampling context changes. | ||
This could be remedied by extensive buffering or alternative approaches. While a full tail based sampling approach is | ||
likely to be quite costly, there might be hybrid approaches possible. | ||
|
||
## Service Infrastructure | ||
|
||
These are changes to the Sentry architecture to better support the creation of independent services. | ||
|
||
### In-Monolith Service Compartmentalization | ||
|
||
The Sentry monolith currently houses a range of small services that are however clustered together. For instance the | ||
processing pipeline really could be separated entirely from the rest of the System. However the same is true for quite | ||
a lot of different parts of the product. For instance certain parts of the user experience are sufficiently idependent | ||
in the UI code today already that they could turn into a separate service that just supplies UI views together with some | ||
APIs. While it's unclear if the existing experiences are worth splitting up into new services, it's probably likely that | ||
we can at least structure their code so that we can reduce the total amount of pieces of code that need to be checked and | ||
tested against changes. | ||
Comment on lines
+190
to
+196
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it worthwhile being explicit that service compartmentalization doesn't imply splitting the monolith into multiple repos and running them as independent processes. Similarly, compartmentalizing the UI doesn't mean 'micro-frontends' that are stitched together. My understanding is that we want to compartmentalize build and test dependencies so that CI and code review can be done more efficiently is that what you had in mind? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pretty much yes. I think the discussion about single vs multi repo is largely orthogonal to how things are organized. We have pretty bad tooling for both mono repo and multi repo at the moment and I think in the absence of a clear direction I would personally probably keep the status quo for now. We can still have a separate conversation later if we want to move things more into one repo or into multiple repos. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel we should clarify what we really want to achieve with a Service Infrastructure. I think different people have different ideas on the goal, and depending on the goal the solution is different. I would add another one which is architecture sanity (when everything depends on everything there is no architecture). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that covers it. (For me one hidden reason about the service discussion is that if we acknowledge that we already operate services, we would do a better job articulating the operational requirements. Today all services we create are one-off creations which makes operating the entirety outside of the primary sentry.io installation challenging. Though that applies less to the monolith than other things we operate but which we might separate from the monolith in the future.) |
||
|
||
In the ideal situation a change to the settings page of a project for instance does not need to run the UI tests for | ||
performance views etc. unless the settings are in fact related to that component. | ||
|
||
### Service Declarations | ||
|
||
The biggest limiting factor today in creating new services is the configuration of these services. When an engineer adds | ||
a new service there is typically a followup PR against the internal ops repo to configure these services for the main | ||
sentry installation. Then there is some sort of ad-hoc workaround to make the whole thing work in local development, how | ||
to make it work on a single tenant or self hosted installation. Each new services undergoes a new phase of rediscovery how | ||
suboptimal the rules and policies around scaling these services are. | ||
Comment on lines
+203
to
+207
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we overloading 'services'? Do we need a different term for the processes that a product service needs in order to operate? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed we should probably be more explicit about what those things are. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. semantic nit maybe. Should we start talking about domain boundaries and bounded contexts (from domain driven design) instead of services ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Totally agree that the first step to separating 'services' within the monolith or outside of it should start with bounded contexts and what consistency guarantees we need within each context. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the service separation within the monolith I'm trying to work something sensible out as a near term roadmap. |
||
|
||
Ideally we were able to describe the services in a config file (YAML?) which then can be consumed by deploy tools to | ||
correctly provision and auto scale services. This could either be a Sentry specific format or we adopt something that | ||
has already been tested. | ||
|
||
### Serverless Function Declarations for Consumers | ||
|
||
The smallest unit of execution at Sentry is not a docker image but in fact a function. We currently have no real serverless | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. related to the rabbitmq vs kafka discussion, but we already have this abstraction in sentry: a celery task we can consider two almost orthogonal things here:
|
||
setup but we have a few functions that are run as a kafka consumer, queue worker or similar. Like services making these | ||
scale is quite tricky and in some situations we need to needlessly spawn more processes and containers even though they | ||
could be colocated. | ||
|
||
## Data Store | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A few other ideas about data store improvements.
|
||
|
||
These are abstract changes to the data store. Largely this is not explored yet but some things are known. | ||
|
||
### High Cardinality Metrics | ||
|
||
As we are extracting metrics data from the existing transaction system we have effectively unlimited cardinality | ||
in the data stream coming in. The system currently applies various different attempts of dealing with this problem | ||
where a lot of this is done on the Relay side. This comes from the combination of the hard cardinality limited in | ||
Relay but also by attempting to not generate high cardinality data in the clients. | ||
|
||
However it's likely that we will be unable to reduce cardinality in the long run and the data model should ideally | ||
be able to represent this high cardinality data. | ||
Comment on lines
+231
to
+232
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is supporting high cardinality metrics the first problem to solve ? So, knowing that real world data is high cardinality, shouldn't we first decide whether we change the product assumption (and then build a high cardinality storage) or whether we keep the product assumption and fix data cardinality instead ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fpacifici I think you're mostly thinking about transaction name, but people are also wishing for significantly more tags on performance metrics AND release health that the product would be able to deal with very well... but not the infrastructure. I think it would help to note down some examples here of concrete product features or queries that we would want to enable with this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some examples of high cardinality data outside of URLs:
|
||
|
||
## Client Pipeline | ||
|
||
These are changes we expect in the protocol and client behavior. | ||
|
||
### Push Config into Clients | ||
|
||
Certain features such as dynamic sampling benefit of being able to push settings down to clients. For instance to turn | ||
on and off profiling at runtime it does not help us to discard unwanted profiles on the server, we want to selectively | ||
turn on profiling. Other uses for this are turning on minidump reporting for a small subset of users optionally. This | ||
requires the ability to push down config changes to clients periodically via relay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an implication to our architecture coming from the idea of multiple smaller clusters that we are not covering here: operation and monitoring.
Essentially the focus moves from the ability to run large scalable clusters to operate efficiently many clusters where each has smaller scalability issues.
There are two ways to address this problem (probably the two have to be addressed together).
I think the above should be a requirement for the new Sentry architecture.
Maybe this should be a work stream in its own.