Scrubbing sensitive data #38

antonpirker · 2022-11-21T09:29:49Z

Make scrubbing of sensitive data (like security related data (passwords, keys, et al) and PII (personal identifiable information) in Sentry smarter.

Rendered RFC

text/0038-scrubbing-sensitive-data.md

jjbayer

I think we should do both B and C, potentially also D. SDKs should provide as much structure as possible, but if they can't / for old SDKs, Relay should have some logic to parse key-value pairs out of strings.

text/0038-scrubbing-sensitive-data.md

Co-authored-by: Joris Bayer <[email protected]>

antonpirker · 2022-11-21T13:38:06Z

@jjbayer suggestion is a multi layer data scrubbing:

SDKs store the data in a structured way (option B)
Relay than parses this structured data and removes sensitive information (option B and kind of option C)
If unstructured data flows in from old SDKs Relay tries to identify and scrub it (option C)
If Relay can not identify the content, it will do plain text matching like now

text/0038-scrubbing-sensitive-data.md

marandaneto · 2022-11-22T10:29:16Z

text/0038-scrubbing-sensitive-data.md

+
+-
+
+### Option B): Store data in a structured way in SDKs


The downside here is that not all the integrations have access to the template for parsing it correctly.
Same issue as Relay:

Relay does not know if the content is a SQL query, a JSON object, an URL, a Elasticsearch/MongoDB/whatever query in JSON format, or something else.

SDKs would know this is an HTTP request or a SQL query, but sometimes that's all we know.
Relay would know if it's an HTTP or SQL because the op is standardized, can we rely on that?

The other point is that it'd only work for auto instrumentation (because we apply the structured data) or when people follow our guidelines, but when they don't, there might be PII, what do we do in such cases?

We could improve the ops for Relay so it can identify what kind of data is included in the span more easily.

@AbhiPrasad had also the idea to add the "semantic convention from otel" for describing spans to our span.data: https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/

In general Relay should never think that there is absolutely no sensitive data in fields that are set by the user.

marandaneto · 2022-11-22T10:32:00Z

text/0038-scrubbing-sensitive-data.md

+
+- Could be complex for nested JSON objects or for "Array of objects" kind of data.
+
+### Option C): Relay identifies what kind of data is present and parses it.


I like this option because we don't run into this issue https://github.com/getsentry/rfcs/pull/38/files#r1029153081

marandaneto · 2022-11-22T10:32:53Z

text/0038-scrubbing-sensitive-data.md

+
+_Cons:_
+
+- Could be expensive to try multiple guesses before the right kind of data is identified. (Maybe its SQL? no. Maybe JSON? no. So it is a URL? yes.)


We could rely on the op as mentioned here https://github.com/getsentry/rfcs/pull/38/files#r1029150641
But then we run into the issue https://github.com/getsentry/rfcs/pull/38/files#r1029153081

marandaneto · 2022-11-22T10:34:06Z

text/0038-scrubbing-sensitive-data.md

+
+### Option D): Generic tozenization in Relay.
+
+Have a generic tokenizer in Relay that can not parse full fledged SQL, but can extract key/value pairs out of almost everything. With this the values of keys with potential sensitive information can be removed.


Likely the most expensive one, but it is the one that is implemented on Relay (a single source of truth for data scrubbing), and likely the one that degrades the product the least (scrubbing some key/values rather than the full URL).

marandaneto · 2022-11-22T10:35:53Z

text/0038-scrubbing-sensitive-data.md

+
+-
+
+### NEW! Option E): Improved regexes


I like this idea, in general, I just think that we should not scrub the whole content (example: URL) but rather scrub only the matched items, whenever tokenization is possible, for example.
/api/login?token=123 becomes /api/login?token=$token, there are some indicators such as /, =, ,, etc

Very similar to Option D IMO.

I like this option as it's the most generic one.

I think the difference between D and E is that E will still replace the entire string, but it will fire less often because the matching condition (the regex) is more specific.

marandaneto · 2022-11-22T10:36:46Z

text/0038-scrubbing-sensitive-data.md

+
+(tbd)
+
+- What parts of the design do you expect to resolve through this RFC?


For me, it's unclear in this RFC if we should focus on scrubbing on Relay or if we don't even want the data coming into Sentry (in this case, the focus should be on SDKs).

In this RFC we just want to find out the way we want to move in the future. And as it seems now most of the engineers are leaning towards Relay doing the scrubbing but are also on board with the SDKs providing better formatted data to make it easier for Relay to scrub the data.

marandaneto · 2022-11-22T10:44:24Z

In general, I believe that Relay should do its job.
SDKs can be good citizens and do tokenization when possible.
SDKs often run into issues similar to Relay (we have a String and this is the URL), but it's not a template, similar to DB queries, we have a String and this is the statement, but it's not a template.
Focusing on Relay implementation, we guarantee that data scrubbing would be applied for custom instrumentation as well when people don't follow the guidelines of what's put in the desc, op, data, etc, or if they did that manually, that does not matter? not sure.

Co-authored-by: Manoel Aranda Neto <[email protected]>

smeubank · 2022-11-22T13:42:59Z

@AbhiPrasad suggests using OTEL semantic conventions for span additional field. So SDKs and Relay can rely on the same convention for determining what the source is db.query.sql vs db.query.mongoos for example

@cleptric ensure we clarify that users can opt out of all of this

…sentry/rfcs into antonpirker/scrubbing-sensitive-data

marandaneto · 2022-11-22T15:28:28Z

text/0038-scrubbing-sensitive-data.md

+Not one option will be implemented but a combination of options:
+
+1. The SDKs will try to save structured data where possible (Option B)
+2. The SDKs will try to specify what kind of data is in `span.description` (and other places) as good as possible. (Can be done by more fine grained `span.op`s or applying OTel trace semantic convention to `span.data`. Needs to be speced out)


Breadcrumbs already use structured data for http crumbs
https://develop.sentry.dev/sdk/event-payloads/breadcrumbs/
Relay can do 3. and 4. already.

Yes, @untitaker told me that the plan was (in the long run) to have structured data everywhere, but somehow this plan was never fully executed over all sdks and data fields. So this is our chance now to make the plan come reality

But the things that relay does now for 3. and 4. need some improvements. (like you commented here: #38 (comment))

cleptric · 2022-11-22T17:23:35Z

text/0038-scrubbing-sensitive-data.md

+
+# Drawbacks
+
+(none)


We should at least mention that this could have an impact on the product.

yes, adding/removing scrubbing on spans can have impact on performance issue grouping at the very least

I already checked with the performance issues team and they think it should not have an impact at all:

@cleptric what impacts to you have in mind?

This might degrade the usefulness of the product or could have unknown impact if we accidentally scrub too much.

jjbayer

I feel like Option D and E are immediately actionable. Option B would be a nice to have.

text/0038-scrubbing-sensitive-data.md

jjbayer · 2022-11-23T08:51:51Z

text/0038-scrubbing-sensitive-data.md

+
+-
+
+### NEW! Option E): Improved regexes


I think the difference between D and E is that E will still replace the entire string, but it will fire less often because the matching condition (the regex) is more specific.

Co-authored-by: Joris Bayer <[email protected]>

antonpirker · 2022-11-23T09:07:16Z

What do you think @jjbayer (and everyone else) should we:

make the regexes more specific and remove the ENTIRE content
make the regexes more specific and remove just the sensitive part (matching some delimiter chars)
start with 1) and then improve to 2)

philipphofmann

Most of my comments are nitpicks. I used the LOGAF scale to show you how important a comment is.

l: Low - Nitpick
m: Medium - worth having a look
h: High - important comment

text/0038-scrubbing-sensitive-data.md

philipphofmann · 2022-11-23T10:15:41Z

text/0038-scrubbing-sensitive-data.md

+Not one option will be implemented but a combination of options:
+
+1. The SDKs will try to save structured data where possible (Option B)
+2. The SDKs will try to specify what kind of data is in `span.description` (and other places) as good as possible. (Can be done by more fine grained `span.op`s or applying OTel trace semantic convention to `span.data`. Needs to be speced out)


h: If the SDK already uses structured data, should it also specify what kind of data is in span.description? If yes, how do we prevent Relay from changing the already structured data?

As discussed in TSC, I think we can clarify the details in a develop docs PR, @antonpirker. Thanks for taking care of this RFC 👏😀.

jjbayer · 2022-11-23T10:25:28Z

make the regexes more specific and remove the ENTIRE content

make the regexes more specific and remove just the sensitive part (matching some delimiter chars)

start with 1) and then improve to 2)

@antonpirker I think that "matching some delimiter chars" essentially comes down to Option D. It would be very hard to generalize "apply regex only to a section of the string". For example, if we declare whitespace to be a delimiter, we would not catch the token in Authorization: Bearer <token>.

That said, I actually think Option D will bring us more value than Option E, so I would start with D.

Co-authored-by: Philipp Hofmann <[email protected]>

…sentry/rfcs into antonpirker/scrubbing-sensitive-data

antonpirker · 2022-11-23T14:00:21Z

I have now updated the conclusion to point out that we are implementing option D) rather than option E): https://github.com/getsentry/rfcs/blob/antonpirker/scrubbing-sensitive-data/text/0038-scrubbing-sensitive-data.md#conclusion

@jjbayer are you confident that we can implement a generic tokenizer that can extract key/value pairs out of any unstructured data without leaking sensitive data?

jjbayer · 2022-11-23T15:16:44Z

@jjbayer are you confident that we can implement a generic tokenizer that can extract key/value pairs out of any unstructured data without leaking sensitive data?

@antonpirker Actually, I am confident that we cannot. Data scrubbing remains a best-effort deal. But I think we can improve the status quo of not scrubbing certain fields at all.

Co-authored-by: Philipp Hofmann <[email protected]>

…sentry/rfcs into antonpirker/scrubbing-sensitive-data

antonpirker · 2022-11-24T08:37:03Z

Some details where discussed yesterday in the Client Infra TSC meeting. The RFC was updated accordingly.
This RFC has been approved by the TSC and can be merge.

antonpirker added 3 commits November 21, 2022 10:28

Created new empty rfc

630efcc

Set name of RFC

004fc8b

Added some initial thoughts

e606528

olksdr reviewed Nov 21, 2022

View reviewed changes

text/0038-scrubbing-sensitive-data.md Outdated Show resolved Hide resolved

jjbayer reviewed Nov 21, 2022

View reviewed changes

text/0038-scrubbing-sensitive-data.md Outdated Show resolved Hide resolved

text/0038-scrubbing-sensitive-data.md Show resolved Hide resolved

antonpirker and others added 5 commits November 21, 2022 14:17

Updated status quo

2f5bb2b

Update text/0038-scrubbing-sensitive-data.md

22cbf25

Co-authored-by: Joris Bayer <[email protected]>

Update text/0038-scrubbing-sensitive-data.md

6f1ff22

Co-authored-by: Joris Bayer <[email protected]>

Formattting

fe2dd82

Some more thoughts.

92c55ba

antonpirker added 7 commits November 21, 2022 15:24

Better structured example

ccd803a

Examples of sentry doing content parsing.

7962dc6

Added more places where this is the case

580d5af

Identified what we should model this after

2f95954

Note on duplicate performance issues

5cfc127

Added the combined suggestion with doing multiple layers of scrubbing.

26ca1be

Added idea of having better regexes

48537f8

marandaneto reviewed Nov 22, 2022

View reviewed changes

text/0038-scrubbing-sensitive-data.md Outdated Show resolved Hide resolved

marandaneto reviewed Nov 22, 2022

View reviewed changes

Update text/0038-scrubbing-sensitive-data.md

39a5f3a

Co-authored-by: Manoel Aranda Neto <[email protected]>

antonpirker added 2 commits November 22, 2022 15:29

Added semantic conventions from otel.

d2e9e1d

Merge branch 'antonpirker/scrubbing-sensitive-data' of github.com:get…

2b90779

…sentry/rfcs into antonpirker/scrubbing-sensitive-data

Added a conclusion

9647e76

marandaneto reviewed Nov 22, 2022

View reviewed changes

cleptric reviewed Nov 22, 2022

View reviewed changes

jjbayer reviewed Nov 23, 2022

View reviewed changes

Update text/0038-scrubbing-sensitive-data.md

7074460

Co-authored-by: Joris Bayer <[email protected]>

philipphofmann reviewed Nov 23, 2022

View reviewed changes

antonpirker and others added 6 commits November 23, 2022 12:15

Update text/0038-scrubbing-sensitive-data.md

4553f4f

Co-authored-by: Philipp Hofmann <[email protected]>

Update text/0038-scrubbing-sensitive-data.md

6f3728e

Co-authored-by: Philipp Hofmann <[email protected]>

Update text/0038-scrubbing-sensitive-data.md

4edd85e

Co-authored-by: Philipp Hofmann <[email protected]>

Update text/0038-scrubbing-sensitive-data.md

3991634

Co-authored-by: Philipp Hofmann <[email protected]>

Updated conclusion to implement option d rather then e

a6d0e38

Merge branch 'antonpirker/scrubbing-sensitive-data' of github.com:get…

2bea889

…sentry/rfcs into antonpirker/scrubbing-sensitive-data

misha-sentry requested review from mdtro and misha-sentry November 23, 2022 15:50

antonpirker and others added 9 commits November 24, 2022 09:12

Changed the conclusion with input from Joris.

df26f27

Update text/0038-scrubbing-sensitive-data.md

66e8f5b

Co-authored-by: Philipp Hofmann <[email protected]>

Update text/0038-scrubbing-sensitive-data.md

d7b0aea

Co-authored-by: Philipp Hofmann <[email protected]>

Added drawbacks

9112c0e

Merge branch 'antonpirker/scrubbing-sensitive-data' of github.com:get…

33c1d3f

…sentry/rfcs into antonpirker/scrubbing-sensitive-data

Added some pros and cons

fad0d3a

Update status of RFC to approved following yesterdays TSC meeting.

94c5ac3

Included status in template.

b2b83c8

Added RFC-0038 to readme

8940db4

antonpirker marked this pull request as ready for review November 24, 2022 08:36

antonpirker merged commit d47ddcc into main Nov 24, 2022


		- Could be complex for nested JSON objects or for "Array of objects" kind of data.

		### Option C): Relay identifies what kind of data is present and parses it.


		_Cons:_

		- Could be expensive to try multiple guesses before the right kind of data is identified. (Maybe its SQL? no. Maybe JSON? no. So it is a URL? yes.)


		### Option D): Generic tozenization in Relay.

		Have a generic tokenizer in Relay that can not parse full fledged SQL, but can extract key/value pairs out of almost everything. With this the values of keys with potential sensitive information can be removed.


		(tbd)

		- What parts of the design do you expect to resolve through this RFC?


		# Drawbacks

		(none)

Scrubbing sensitive data #38

Scrubbing sensitive data #38

Conversation

antonpirker commented Nov 21, 2022 • edited Loading

jjbayer left a comment

Choose a reason for hiding this comment

antonpirker commented Nov 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marandaneto Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

marandaneto Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marandaneto Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marandaneto commented Nov 22, 2022 • edited Loading

smeubank commented Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjbayer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonpirker commented Nov 23, 2022

philipphofmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjbayer commented Nov 23, 2022

antonpirker commented Nov 23, 2022

jjbayer commented Nov 23, 2022

antonpirker commented Nov 24, 2022

antonpirker commented Nov 21, 2022 •

edited

Loading

antonpirker commented Nov 21, 2022 •

edited

Loading

marandaneto Nov 22, 2022 •

edited

Loading

marandaneto Nov 22, 2022 •

edited

Loading

marandaneto Nov 22, 2022 •

edited

Loading

marandaneto commented Nov 22, 2022 •

edited

Loading

smeubank commented Nov 22, 2022 •

edited

Loading