Start Date: 2022-11-21
RFC Type: feature
RFC PR: #38
RFC Status: approved

Summary

Make scrubbing of sensitive data (security related data (passwords, keys, et al) and personal identifiable information (PII)) in Sentry smarter. This includes the SDK side (before we send data) and the Relay side (before we ingest data).

Motivation

Currently, the scrubbing of sensitive data works but is not very smart. It can happen that sensitive data is leaked and ends up in our data storage. It also can happen that scrubbing is too aggressive and too much data is removed, so the feature destroys value for the customer.

We want to have a smarter way of scrubbing sensitive data with a more fine grained control of what data is preserved and what data is scrubbed.

We do this to make sure we handle our users sensitive data with the respect it deserves.

Background

A user complained that our data scrubbing did not remove sensitive data from an URL the users project made a HTTP request to. The sensitive data showed up in the span description in the Sentry UI.

Supporting Data

There was an incident where privacy data was leaked in a http span. An URL including an access token in a querystring was saved in the span.description. Default data scrubbing mechanism was not running on span description at that time. When default data scrubbing was activated on that field it removed to much data having a negative impact on the value our product deliveres. The fix was reverted and this RFC was started.

Conclusion

Not one option will be implemented but a combination of options:

The SDKs will try to save structured data where possible (Option B)
The SDKs will try to specify what kind of data is in span.description (and other places) as good as possible. (Can be done by more fine grained span.ops or applying OTel trace semantic convention to span.data. Needs to be speced out)
Relay uses data from 2.) to parse content fields and scrub sensitive data (Option C)
If there is no structured data (so 3.) is not possible) Relay applies an improved version of the current data scrubbing mechanism to prevent leaking sensitive data. (Combination of Option D+E)

Options Considered

Status Quo:

Right now most data scrubbing is done in Relay.
There is a option sendDefaultPII in SDKs that may or may not remove some sensitive data before sending.
In Relay fields have a "pii" attribute that can be yes, no or maybe:
- yes means all the default data scrubbing Regexes are applied to the fields.
- maybe means that the data scrubbing Regexes are not run by default. If the user has advanced data scrubbing set (in Sentry.io under: Project Settings > Security & Privacy > Advanced Data Scrubbing) those custom rules are applied to the field and if a rule matches the whole content of the field is removed.
- no there is no possibility of data scrubbing for this field.
The regexes for data scrubbing are defined here
Some regexes can just remove the sensitive part of the content (like IP and SSH keys regexes).
Some of the regexes (like password regex) will remove the complete content of a field. This is because it is unstructured data. Relay does not know if the content is a SQL query, a JSON object, an URL, a Elasticsearch/MongoDB/whatever query in JSON format, or something else.

Option A): Remove Sensitive Data in SDKs

At the time when SDKs set the data in an event, it knows what the data represents and can remove sensitive information. This way, Relay has to believe that the SDK does the right thing and does not need to scrub data.

Pros:

Cons:

Option B): Store data in a structured way in SDKs

The content should not be a simple string but structured data. The content could be a template string where all the sensitive values are removed plus a dictionary with the values to insert into the string. Example span.description should be a string with named parameters in the format select * from user where email=%(email)s; or POST /api/v1/update_password?new_password=%(new_password)s and in span.data there should be {"email": "test@example.com"} or { "new_password": "123456" } respectively. Same goes for breadcrumb.message/breadcrumb.data, logentry.message/logentry.params, and message.message/message.params.

We need to identify all the fields we need to do this.

We should model this Option after how Logentry does this now.

Note: If we change the span.description then the hash for existing performance issues will be changed and existing performance issues will be recreated, so users would have duplicate performance issues in their list of issues. (Which we can just document in the CHANGELOG and everything should be fine.)

Pros:

Relay would not have to reverse-engineer the semantics of the information supplied by the SDK.

Cons:

Could be complex for nested JSON objects or for "Array of objects" kind of data.

Option C): Relay identifies what kind of data is present and parses it.

Relay can try for each field to "guess" what kind of data it is. Guessing can be done by looking at what field in general we are guessing (field X has always a SQL query in it), or the span.op or other fields as well as the content itself. If the content is a SQL query, JSON object, Elasticsearch Query, GraphQL Query, URL, ... When Relay is certain that it knows the content is of a specific kind, it can then run a parser on it to be able to scrub values of sensitive fields.

An existing example for this is parsing URL query parameters into a separate field, which Relay does when normalizing the Request Interface.

For the performance issue detection sentry does something similar:

https://github.com/getsentry/sentry/blob/68e44ed3e8343a5e69d0b0a51ad65c02ae427cd0/src/sentry/spans/grouping/strategy/base.py#L186
https://github.com/getsentry/sentry/blob/68e44ed3e8343a5e69d0b0a51ad65c02ae427cd0/src/sentry/spans/grouping/strategy/base.py#L142-L150

OpenTelementry has semantic conventions for tracing. A defined set of attributes set to the span describes the span data in more detail. We could borrow those semantic conventions and add them to span.data so that Relay can better parse the span.description: https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/

Pros:

The SDKs don't need to replicate the same logic.
The customers don't need to update their SDKs to benefit.

Cons:

Could be expensive to try multiple guesses before the right kind of data is identified. (Maybe its SQL? no. Maybe JSON? no. So it is a URL? yes.)

Option D): Generic tokenization in Relay.

Have a generic tokenizer in Relay that can not parse full fledged SQL, but can extract key/value pairs out of almost everything. With this the values of keys with potential sensitive information can be removed.

Pros:

The SDKs don't need to replicate the same logic.
The customers don't need to update their SDKs to benefit.

Cons:

NEW! Option E): Improved regexes

Keep the logic on how data scrubbing is done right now, but improve the regexes to be more specific. Especially the "password regex" could be changed to the auth rule does ONLY match auth but NOT author or authorize.

With this we could add data scrubbing back to span.description (and potentially other fields that ware marked with pii=maybe right now).

Pros:

The SDKs don't need to replicate the same logic.
The customers don't need to update their SDKs to benefit.
Least effort to hava quick (but not very substantial) improvement.

Cons:

Drawbacks

There is always a tradeoff between:

If we scrub too much data (sensitive or not) it diminishes the data we can give the user to fix his/her problems, this degrading the value of the product.
If we scrub too little data, we leak sensitive data or our users.

Unresolved questions

We need to check with legal and/or security to make sure that the stuff we are planing is actually OK with existing laws and regulations
We need to find all places in SDKs that sensitive data could be. Places we are targeting right now: span.description/span.data, breadcrumb.message/breadcrumb.data, logentry.message/logentry.params, and message.message/message.params, local variables, request bodies, response bodies, HTTP headers, cookies, ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0038-scrubbing-sensitive-data.md

0038-scrubbing-sensitive-data.md

Summary

Motivation

Background

Supporting Data

Conclusion

Options Considered

Status Quo:

Option A): Remove Sensitive Data in SDKs

Option B): Store data in a structured way in SDKs

Option C): Relay identifies what kind of data is present and parses it.

Option D): Generic tokenization in Relay.

NEW! Option E): Improved regexes

Drawbacks

Unresolved questions

Files

0038-scrubbing-sensitive-data.md

Latest commit

History

0038-scrubbing-sensitive-data.md

File metadata and controls

Summary

Motivation

Background

Supporting Data

Conclusion

Options Considered

Status Quo:

Option A): Remove Sensitive Data in SDKs

Option B): Store data in a structured way in SDKs

Option C): Relay identifies what kind of data is present and parses it.

Option D): Generic tokenization in Relay.

NEW! Option E): Improved regexes

Drawbacks

Unresolved questions