Background

We want to create a privacy-first data clean room solution where data movements from sources are possible and trustworthy enough for customers to use.

Glossary

  1. Collaborator:
  1. Source:
  1. Transformation: A Transformation is a function with an owner $c \isin C$, $t_c$ can be defined as:

    1. $t_c: P(\tau) \setminus \phi \rightarrow \tau \times ( T \cup \{\phi\})$
    2. $\tau$ $\times$$\{\phi \}$ = $\tau$

    where $\phi$ is a null transformation, $C$ is the set of all Collaborators, $S$ is the set of all Sources, $T$ is the set of all Transformations, and $\tau$ is the set of all tables.

  2. Destination: A Destination with an owner $c \isin C$ $$ can be defined as $d_c \subseteq D \times T$, where $C$ is the set of all Collaborators, $T$ is the set of all Transformations, and $D \subseteq \tau$, where $\tau$ is set of all tables.

  3. Data Access Grants: Access Controls are the set of permissions that can be:

    1. Source Data Access Grants:
      1. transformations_allowed : a source gives to another transformation from another collaborator to use itself
      2. destinations_allowed : a source gives a tuple of (destination, transformation) to a destination owner. The transformation here means the transformation output of the mentioned transformation.
    2. Transformation Data Access Grants:
      1. destinations_allowed : allows a destination owner to use its output as a destination request.
      2. A transformation does not need to specify sources_allowed as it already explicitly asks for sources.
      3. Transformation should have a result property which is defined as either a table or error
    3. A Destination is the one that requires Data Access Grants from both Transformation and Sources before running.

    A Source also gives permissions at column-level granularity:

    These are the fields (non-exhaustive) that need to specified for DP SQL Queries:

    1. selectable: true | false
      1. Whether a column can be used in a select statement or not. For eg. you can not directly select emails
    2. aggregates_allowed: A list of aggregate function types.
      1. The kind of aggregates functions allowed to run on a particular column. It can be both private/non-private aggregates.
    3. join_key: true | false
      1. can the column be used as join_key or not.
  4. Trust Group: If the transformation is a Private SQl Query then, Collaborators who give data access grants to same type of transformation forms a trust group. Only one member of the trust group is allowed to define transformation_allowed/noise parameters and define destinations for the same.

  5. Collaboration: Collaboration will happen when all collaborators send their packages to clean room service and to drive their analysis/insights through clean room service. To start a collaboration, a folder will be provided as input to DCR App, in which will contain all the collaboration packages of all the collaborators involved in a Collaboration.

  6. Collaboration Package: A collaboration package consists of

    1. sources
    2. transformations
    3. destinations

    that contains enough context metadata to create a collaboration event successfully. Each collaborator will have its own collaboration package.

  7. Collaboration Graph: A Collaboration Graph is a Directed Acyclic Graph that defines the relationship between Source Tables, Transformations, and Destinations. The Graph will have directed outgoing edges only from transformation to sources and destination to transformation.

  8. Clean Room Service: A Service that operates on data from Source Tables provided by the collaborators in the collaboration package. Extracts session Data from where the source tables are located. Apply transformation on the data as per defined in the package and send back The kind of operation/query to be defined in the Collaboration Package.

  9. Collaboration Event: A collaboration event will be created for every transformation asked to be run by a collaborator, along with the batch of data needed to run the transformation generated from the parametric transformation used, and return computation results to collaborators who have permission to see the results.

  10. Orchestrator: Orchestrator is what creates a new clean room service session on successful validation of the collaboration package. Orchestrator also generates/modifies the query that needs to be run inside the trusted env. Orchestrator also needs to do security checks to prevent any malicious query from executing inside the environment.