Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrubbing sensitive data #38

Merged
merged 36 commits into from
Nov 24, 2022
Merged

Conversation

antonpirker
Copy link
Member

@antonpirker antonpirker commented Nov 21, 2022

Make scrubbing of sensitive data (like security related data (passwords, keys, et al) and PII (personal identifiable information) in Sentry smarter.

Rendered RFC

Copy link
Member

@jjbayer jjbayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do both B and C, potentially also D. SDKs should provide as much structure as possible, but if they can't / for old SDKs, Relay should have some logic to parse key-value pairs out of strings.

text/0038-scrubbing-sensitive-data.md Outdated Show resolved Hide resolved
text/0038-scrubbing-sensitive-data.md Show resolved Hide resolved
@antonpirker
Copy link
Member Author

antonpirker commented Nov 21, 2022

@jjbayer suggestion is a multi layer data scrubbing:

  • SDKs store the data in a structured way (option B)
  • Relay than parses this structured data and removes sensitive information (option B and kind of option C)
  • If unstructured data flows in from old SDKs Relay tries to identify and scrub it (option C)
  • If Relay can not identify the content, it will do plain text matching like now


-

### Option B): Store data in a structured way in SDKs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside here is that not all the integrations have access to the template for parsing it correctly.
Same issue as Relay:

Relay does not know if the content is a SQL query, a JSON object, an URL, a Elasticsearch/MongoDB/whatever query in JSON format, or something else.

SDKs would know this is an HTTP request or a SQL query, but sometimes that's all we know.
Relay would know if it's an HTTP or SQL because the op is standardized, can we rely on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other point is that it'd only work for auto instrumentation (because we apply the structured data) or when people follow our guidelines, but when they don't, there might be PII, what do we do in such cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could improve the ops for Relay so it can identify what kind of data is included in the span more easily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AbhiPrasad had also the idea to add the "semantic convention from otel" for describing spans to our span.data: https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general Relay should never think that there is absolutely no sensitive data in fields that are set by the user.


- Could be complex for nested JSON objects or for "Array of objects" kind of data.

### Option C): Relay identifies what kind of data is present and parses it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this option because we don't run into this issue https://github.com/getsentry/rfcs/pull/38/files#r1029153081


_Cons:_

- Could be expensive to try multiple guesses before the right kind of data is identified. (Maybe its SQL? no. Maybe JSON? no. So it is a URL? yes.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Option D): Generic tozenization in Relay.

Have a generic tokenizer in Relay that can not parse full fledged SQL, but can extract key/value pairs out of almost everything. With this the values of keys with potential sensitive information can be removed.
Copy link
Contributor

@marandaneto marandaneto Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely the most expensive one, but it is the one that is implemented on Relay (a single source of truth for data scrubbing), and likely the one that degrades the product the least (scrubbing some key/values rather than the full URL).


-

### NEW! Option E): Improved regexes
Copy link
Contributor

@marandaneto marandaneto Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, in general, I just think that we should not scrub the whole content (example: URL) but rather scrub only the matched items, whenever tokenization is possible, for example.
/api/login?token=123 becomes /api/login?token=$token, there are some indicators such as /, =, ,, etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very similar to Option D IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this option as it's the most generic one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the difference between D and E is that E will still replace the entire string, but it will fire less often because the matching condition (the regex) is more specific.


(tbd)

- What parts of the design do you expect to resolve through this RFC?
Copy link
Contributor

@marandaneto marandaneto Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, it's unclear in this RFC if we should focus on scrubbing on Relay or if we don't even want the data coming into Sentry (in this case, the focus should be on SDKs).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this RFC we just want to find out the way we want to move in the future. And as it seems now most of the engineers are leaning towards Relay doing the scrubbing but are also on board with the SDKs providing better formatted data to make it easier for Relay to scrub the data.

@marandaneto
Copy link
Contributor

marandaneto commented Nov 22, 2022

In general, I believe that Relay should do its job.
SDKs can be good citizens and do tokenization when possible.
SDKs often run into issues similar to Relay (we have a String and this is the URL), but it's not a template, similar to DB queries, we have a String and this is the statement, but it's not a template.
Focusing on Relay implementation, we guarantee that data scrubbing would be applied for custom instrumentation as well when people don't follow the guidelines of what's put in the desc, op, data, etc, or if they did that manually, that does not matter? not sure.

@smeubank
Copy link
Member

smeubank commented Nov 22, 2022

@AbhiPrasad suggests using OTEL semantic conventions for span additional field. So SDKs and Relay can rely on the same convention for determining what the source is db.query.sql vs db.query.mongoos for example

@cleptric ensure we clarify that users can opt out of all of this

Not one option will be implemented but a combination of options:

1. The SDKs will try to save structured data where possible (Option B)
2. The SDKs will try to specify what kind of data is in `span.description` (and other places) as good as possible. (Can be done by more fine grained `span.op`s or applying OTel trace semantic convention to `span.data`. Needs to be speced out)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breadcrumbs already use structured data for http crumbs
https://develop.sentry.dev/sdk/event-payloads/breadcrumbs/
Relay can do 3. and 4. already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @untitaker told me that the plan was (in the long run) to have structured data everywhere, but somehow this plan was never fully executed over all sdks and data fields. So this is our chance now to make the plan come reality

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the things that relay does now for 3. and 4. need some improvements. (like you commented here: #38 (comment))


# Drawbacks

(none)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should at least mention that this could have an impact on the product.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, adding/removing scrubbing on spans can have impact on performance issue grouping at the very least

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already checked with the performance issues team and they think it should not have an impact at all:

Screenshot 2022-11-23 at 09 42 41

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cleptric what impacts to you have in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might degrade the usefulness of the product or could have unknown impact if we accidentally scrub too much.

Copy link
Member

@jjbayer jjbayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like Option D and E are immediately actionable. Option B would be a nice to have.

text/0038-scrubbing-sensitive-data.md Outdated Show resolved Hide resolved

-

### NEW! Option E): Improved regexes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the difference between D and E is that E will still replace the entire string, but it will fire less often because the matching condition (the regex) is more specific.

@antonpirker
Copy link
Member Author

What do you think @jjbayer (and everyone else) should we:

  1. make the regexes more specific and remove the ENTIRE content
  2. make the regexes more specific and remove just the sensitive part (matching some delimiter chars)
  3. start with 1) and then improve to 2)

Copy link
Member

@philipphofmann philipphofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments are nitpicks. I used the LOGAF scale to show you how important a comment is.

  • l: Low - Nitpick
  • m: Medium - worth having a look
  • h: High - important comment

text/0038-scrubbing-sensitive-data.md Outdated Show resolved Hide resolved
text/0038-scrubbing-sensitive-data.md Outdated Show resolved Hide resolved
text/0038-scrubbing-sensitive-data.md Outdated Show resolved Hide resolved
text/0038-scrubbing-sensitive-data.md Outdated Show resolved Hide resolved
text/0038-scrubbing-sensitive-data.md Show resolved Hide resolved
text/0038-scrubbing-sensitive-data.md Outdated Show resolved Hide resolved
text/0038-scrubbing-sensitive-data.md Outdated Show resolved Hide resolved
text/0038-scrubbing-sensitive-data.md Outdated Show resolved Hide resolved
text/0038-scrubbing-sensitive-data.md Show resolved Hide resolved
Not one option will be implemented but a combination of options:

1. The SDKs will try to save structured data where possible (Option B)
2. The SDKs will try to specify what kind of data is in `span.description` (and other places) as good as possible. (Can be done by more fine grained `span.op`s or applying OTel trace semantic convention to `span.data`. Needs to be speced out)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h: If the SDK already uses structured data, should it also specify what kind of data is in span.description? If yes, how do we prevent Relay from changing the already structured data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in TSC, I think we can clarify the details in a develop docs PR, @antonpirker. Thanks for taking care of this RFC 👏😀.

@jjbayer
Copy link
Member

jjbayer commented Nov 23, 2022

  1. make the regexes more specific and remove the ENTIRE content
  2. make the regexes more specific and remove just the sensitive part (matching some delimiter chars)
  3. start with 1) and then improve to 2)

@antonpirker I think that "matching some delimiter chars" essentially comes down to Option D. It would be very hard to generalize "apply regex only to a section of the string". For example, if we declare whitespace to be a delimiter, we would not catch the token in Authorization: Bearer <token>.

That said, I actually think Option D will bring us more value than Option E, so I would start with D.

@antonpirker
Copy link
Member Author

I have now updated the conclusion to point out that we are implementing option D) rather than option E): https://github.com/getsentry/rfcs/blob/antonpirker/scrubbing-sensitive-data/text/0038-scrubbing-sensitive-data.md#conclusion

@jjbayer are you confident that we can implement a generic tokenizer that can extract key/value pairs out of any unstructured data without leaking sensitive data?

@jjbayer
Copy link
Member

jjbayer commented Nov 23, 2022

@jjbayer are you confident that we can implement a generic tokenizer that can extract key/value pairs out of any unstructured data without leaking sensitive data?

@antonpirker Actually, I am confident that we cannot. Data scrubbing remains a best-effort deal. But I think we can improve the status quo of not scrubbing certain fields at all.

@antonpirker antonpirker marked this pull request as ready for review November 24, 2022 08:36
@antonpirker
Copy link
Member Author

Some details where discussed yesterday in the Client Infra TSC meeting. The RFC was updated accordingly.
This RFC has been approved by the TSC and can be merge.

@antonpirker antonpirker merged commit d47ddcc into main Nov 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants