Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add namespace filter option in config file #2688

Merged
merged 11 commits into from
Dec 18, 2023

Conversation

yanlibert
Copy link
Contributor

Problem

We are implementing Marquez at scale, and we often end up with the creation of a lot of namespaces containing neither datasets nor jobs.
This can be caused by the handling of symlinks (see #2645) or an issue with the open lineage integrations.
We would like to add the possibilty to filter the list of namespaces returned by the API by providing a regex in the marquez.yml config file

Closes: #2687

Solution

This solution provides an optional namespaceFilter parameter to be used in the marquez.yml config file
Example:

server:
  applicationConnectors:
  - type: http
    port: ${MARQUEZ_PORT:-5000}
    httpCompliance: RFC7230_LEGACY
  adminConnectors:
  - type: http
    port: ${MARQUEZ_ADMIN_PORT:-5001}

db:
  driverClass: org.postgresql.Driver
  url: jdbc:postgresql:https://localhost:5432/marquez
  user: marquez
  password: marquez

migrateOnStartup: true

graphql:
  enabled: true

logging:
  level: INFO
  appenders:
    - type: console

tags:
  - name: PII
    description: Personally identifiable information
  - name: SENSITIVE
    description: Contains sensitive information

namespaceFilter: "foo*"

One-line summary:

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label Nov 24, 2023
Copy link

boring-cyborg bot commented Nov 24, 2023

Thanks for opening your first pull request in the Marquez project! Please check out our contributing guidelines (https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md).

Copy link

netlify bot commented Nov 24, 2023

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
🔨 Latest commit ea55703
🔍 Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/657d843500a04900082fa532

Copy link

codecov bot commented Nov 24, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (429e352) 84.27% compared to head (ea55703) 84.31%.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2688      +/-   ##
============================================
+ Coverage     84.27%   84.31%   +0.03%     
- Complexity     1407     1413       +6     
============================================
  Files           249      251       +2     
  Lines          6386     6402      +16     
  Branches        291      291              
============================================
+ Hits           5382     5398      +16     
  Misses          851      851              
  Partials        153      153              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wslulciuc
Copy link
Member

@yanlibert, thanks for extending our marquez.yml to support filtering metadata based off a pattern! We'll also want to have a fix for #2645 (the root cause), but think supporting a config-based why to exclude metadata make sense.

I'd suggest we define a exclude option (in place of namespaceFilter):

exclude:
  namespaces: [<pattern0>, <pattern1>,...]

which can then define the environment variable EXCLUDE_NAMESPACES to override patterns for namespaces:

exclude:
  namespaces: ${EXCLUDE_NAMESPACES}

Then, as a follow up PR, we can expand filtering to exclude datasets and jobs:

exclude:
  namespaces: [<pattern0>, <pattern1>,...]
  datasets: [<pattern0>, <pattern1>,...]
  jobs: [<pattern0>, <pattern1>,...]

@wslulciuc wslulciuc added this to the 0.44.0 milestone Nov 29, 2023
@wslulciuc
Copy link
Member

wslulciuc commented Nov 29, 2023

@yanlibert as a follow up to my comment above, we may also want to consider how we might avoid storing metadata as well (therefore, avoid polluting metadata all together). That is, if we provide an exclude option on read, it would also make sense to provide an exclude on write. So something like:

exclude:
  namespaces:
    onRead: boolean
    onWrite: boolean
    patterns: [<pattern0>, <pattern1>,...]

You would be able to control filtering of read/write using exclude.namespaces.onRead and exclude.namespaces.onWrite. I know this is now going beyond your initial work / proposal, but I think I'd be great to establish a simple configuration around exclusion that can be expanded. Thoughts?

Note, to keep the scope limited, we'll only want to focus on exclude.namespaces.onRead.

@yanlibert
Copy link
Contributor Author

I think this is a great addition to the feature. In fact, we are already filtering some events before sending them to Marquez and it would be great if we could do that directly with Marquez.
I'm going to have a try at your suggestion and propose a first iteration of a solution

Signed-off-by: Yan <[email protected]>
Signed-off-by: yanlibert <[email protected]>
@yanlibert
Copy link
Contributor Author

yanlibert commented Dec 7, 2023

@wslulciuc I followed your recommendations and refactored the way of getting the filter from the config file. I created a separate filter Class, so that it opens the way of a finer control and more options for filtering. I also covered this new behavior with a test case.
Let me know if it's ok, I'd be happy to improve on this if needed.

@YLibert
Copy link

YLibert commented Dec 11, 2023

@wslulciuc I increased the test coverage, please let me know if this suits your requirements.

Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to have more clarification on the usage. For example: how to pass multiple filter patterns

public static class NamespaceExclusion {
@Getter @JsonProperty public boolean onRead;
@Getter @JsonProperty public boolean onWrite;
@Getter @JsonProperty public String patterns;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a single pattern or a list of patterns separated in some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pawel-big-lebowski yes, sorry about that, I forgot to include the proper documentation. I added a section to the FAQ to explain the usage, I'd be happy to move it to another place that is more suitable if needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 2 options:

  1. Define patterns as an array of patterns (i.e. Set<String> patterns)
  2. or, simply define patterns as a string with as REGEXP (which in that case, we can use String pattern)

I think option 1 might less error prone and more readable (though might make adding an exclusion filter in SQL tricky). For example, in marquez.yml:

exclude:
  namespaces:
     onRead: boolean
     onWrite: boolean
     patterns: [<pattern0>, <pattern1>,...]

while option 2 would be:

exclude:
  namespaces:
     onRead: boolean
     onWrite: boolean
     pattern: "<pattern0>|<pattern1>|..."

I think either option works and perhaps we just rename patterns to pattern for clarity and go with option 2 as it's the simpler option (both for configuration and implementation).

public static class NamespaceExclusion {
@Getter @JsonProperty public boolean onRead;
@Getter @JsonProperty public boolean onWrite;
@Getter @JsonProperty public String patterns;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 2 options:

  1. Define patterns as an array of patterns (i.e. Set<String> patterns)
  2. or, simply define patterns as a string with as REGEXP (which in that case, we can use String pattern)

I think option 1 might less error prone and more readable (though might make adding an exclusion filter in SQL tricky). For example, in marquez.yml:

exclude:
  namespaces:
     onRead: boolean
     onWrite: boolean
     patterns: [<pattern0>, <pattern1>,...]

while option 2 would be:

exclude:
  namespaces:
     onRead: boolean
     onWrite: boolean
     pattern: "<pattern0>|<pattern1>|..."

I think either option works and perhaps we just rename patterns to pattern for clarity and go with option 2 as it's the simpler option (both for configuration and implementation).

api/src/main/java/marquez/db/NamespaceDao.java Outdated Show resolved Hide resolved
@@ -74,8 +75,16 @@ public Response get(@PathParam("namespace") NamespaceName name) {
public Response list(
@QueryParam("limit") @DefaultValue("100") @Min(value = 0) int limit,
@QueryParam("offset") @DefaultValue("0") @Min(value = 0) int offset) {
final List<Namespace> namespaces = namespaceService.findAll(limit, offset);
return Response.ok(new Namespaces(namespaces)).build();
final String namespaceFilter = ExclusionsFilter.getNamespacesReadFilter();
Copy link
Member

@wslulciuc wslulciuc Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! To clean up the code, I suggest (FYI, these are only suggestions and feel free to modify as needed):

  1. Define class Exclusions with a cleaner interface defined in pkg marquez.exclusions, or marquez.common (see below) in place of ExclusionsFilter
  2. Use lambdas to call NamespaceService.findAll() or NamespaceService.findAllWith()

Exclusions.java

package marquez.common;

import com.google.common.collect.ClassToInstanceMap;
import com.google.common.collect.MutableClassToInstanceMap;
import javax.annotation.Nullable;
import lombok.NonNull;

public final class Exclusions {
  private Exclusions() {}

  private static final ClassToInstanceMap<Object> EXCLUSIONS = MutableClassToInstanceMap.create();
 

  public static void use(@NonNull ExclusionsConfig config) {
    // Build exclusions map using config.
    EXCLUSIONS.put(NamespaceExclusion.class, new NamespaceExclusion(...));
  }

  public static NamespaceExclusion namespaces() {
    return EXCLUSIONS.getInstance(NamespaceExclusion.class);
  }
 
  // To avoid duplicating the pattern (onRead, onWrite) we can also have a 
  // ON_READ_EXCLUSIONS and ON_WRITE_EXCLUSIONS maps;
  // therefore the contract would just be NamespaceExclusion(@NonNull String pattern);
  // but, 'Exclusions.namespaces().onRead()' is more concise.
  public record NamespaceExclusion(@Nullable String onRead, @Nullable String onWrite) {}
}

Then, you can chain your calls:

NamespaceResource.list()

final List<Namespace> namespaces =
  Optional.ofNullable(Exclusions.namespaces().onRead())
    .map(exclusion -> namespaceService.findAllWithExclusion(exclusion, limit, offset))
    .orElseGet(() -> namespaceService.findAll(limit, offset));

return Response.ok(new Namespaces(namespaces)).build();

Copy link
Member

@wslulciuc wslulciuc Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanlibert I'm wondering, as we work through the Exclusions code, would it make sense to just have:

exclude:
  namespaces:
    onRead:
      enabled: boolean                      # (default: true)
      pattern: "<pattern0>|<pattern1>|..."
    onWrite: "<pattern0>|<pattern1>|..."
      enabled: boolean                      # (default: true)
      pattern: "<pattern0>|<pattern1>|..."

where onRead and onWrite can be different (or equal). This would give greater control and flexibility over what to exclude 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a great suggestion, I refactored the code accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wslulciuc I also think option 2 where pattern is a string of regex is simpler, and more straightforward to implement as I'm not a java dev 😄
I increased the coverage and refactored the code as suggested by implementing a cleaner Exclusions interface.

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks great to me and really like how simple it is to define exclusions 💯 💯 💯

@wslulciuc wslulciuc enabled auto-merge (squash) December 18, 2023 19:15
@wslulciuc wslulciuc merged commit 83608bb into MarquezProject:main Dec 18, 2023
15 checks passed
Copy link

boring-cyborg bot commented Dec 18, 2023

Great job! Congrats on your first merged pull request in the Marquez project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

possibility to filter Namespaces
4 participants