Add namespace filter option in config file #2688

yanlibert · 2023-11-24T12:13:39Z

Problem

We are implementing Marquez at scale, and we often end up with the creation of a lot of namespaces containing neither datasets nor jobs.
This can be caused by the handling of symlinks (see #2645) or an issue with the open lineage integrations.
We would like to add the possibilty to filter the list of namespaces returned by the API by providing a regex in the marquez.yml config file

Closes: #2687

Solution

This solution provides an optional namespaceFilter parameter to be used in the marquez.yml config file
Example:

server:
  applicationConnectors:
  - type: http
    port: ${MARQUEZ_PORT:-5000}
    httpCompliance: RFC7230_LEGACY
  adminConnectors:
  - type: http
    port: ${MARQUEZ_ADMIN_PORT:-5001}

db:
  driverClass: org.postgresql.Driver
  url: jdbc:postgresql:https://localhost:5432/marquez
  user: marquez
  password: marquez

migrateOnStartup: true

graphql:
  enabled: true

logging:
  level: INFO
  appenders:
    - type: console

tags:
  - name: PII
    description: Personally identifiable information
  - name: SENSITIVE
    description: Contains sensitive information

namespaceFilter: "foo*"

One-line summary:

Checklist

You've signed-off your work
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
You've included a header in any source code files (if relevant)

boring-cyborg · 2023-11-24T12:13:43Z

Thanks for opening your first pull request in the Marquez project! Please check out our contributing guidelines (https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md).

netlify · 2023-11-24T12:13:45Z

✅ Deploy Preview for peppy-sprite-186812 canceled.

Name	Link
🔨 Latest commit	`ea55703`
🔍 Latest deploy log	https://app.netlify.com/sites/peppy-sprite-186812/deploys/657d843500a04900082fa532

codecov · 2023-11-24T12:23:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (429e352) 84.27% compared to head (ea55703) 84.31%.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2688      +/-   ##
============================================
+ Coverage     84.27%   84.31%   +0.03%     
- Complexity     1407     1413       +6     
============================================
  Files           249      251       +2     
  Lines          6386     6402      +16     
  Branches        291      291              
============================================
+ Hits           5382     5398      +16     
  Misses          851      851              
  Partials        153      153

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wslulciuc · 2023-11-29T00:58:07Z

@yanlibert, thanks for extending our marquez.yml to support filtering metadata based off a pattern! We'll also want to have a fix for #2645 (the root cause), but think supporting a config-based why to exclude metadata make sense.

I'd suggest we define a exclude option (in place of namespaceFilter):

exclude:
  namespaces: [<pattern0>, <pattern1>,...]

which can then define the environment variable EXCLUDE_NAMESPACES to override patterns for namespaces:

exclude:
  namespaces: ${EXCLUDE_NAMESPACES}

Then, as a follow up PR, we can expand filtering to exclude datasets and jobs:

exclude:
  namespaces: [<pattern0>, <pattern1>,...]
  datasets: [<pattern0>, <pattern1>,...]
  jobs: [<pattern0>, <pattern1>,...]

wslulciuc · 2023-11-29T19:30:46Z

@yanlibert as a follow up to my comment above, we may also want to consider how we might avoid storing metadata as well (therefore, avoid polluting metadata all together). That is, if we provide an exclude option on read, it would also make sense to provide an exclude on write. So something like:

exclude:
  namespaces:
    onRead: boolean
    onWrite: boolean
    patterns: [<pattern0>, <pattern1>,...]

You would be able to control filtering of read/write using exclude.namespaces.onRead and exclude.namespaces.onWrite. I know this is now going beyond your initial work / proposal, but I think I'd be great to establish a simple configuration around exclusion that can be expanded. Thoughts?

Note, to keep the scope limited, we'll only want to focus on exclude.namespaces.onRead.

yanlibert · 2023-11-30T09:10:17Z

I think this is a great addition to the feature. In fact, we are already filtering some events before sending them to Marquez and it would be great if we could do that directly with Marquez.
I'm going to have a try at your suggestion and propose a first iteration of a solution

Signed-off-by: Yan <[email protected]> Signed-off-by: yanlibert <[email protected]>

yanlibert · 2023-12-07T22:23:29Z

@wslulciuc I followed your recommendations and refactored the way of getting the filter from the config file. I created a separate filter Class, so that it opens the way of a finer control and more options for filtering. I also covered this new behavior with a test case.
Let me know if it's ok, I'd be happy to improve on this if needed.

Signed-off-by: yanlibert <[email protected]>

YLibert · 2023-12-11T09:36:20Z

@wslulciuc I increased the test coverage, please let me know if this suits your requirements.

pawel-big-lebowski

Would love to have more clarification on the usage. For example: how to pass multiple filter patterns

pawel-big-lebowski · 2023-12-11T11:52:08Z

api/src/main/java/marquez/api/filter/exclusions/ExclusionsConfig.java

+ public static class NamespaceExclusion {
+ @Getter @JsonProperty public boolean onRead;
+ @Getter @JsonProperty public boolean onWrite;
+ @Getter @JsonProperty public String patterns;


Is this a single pattern or a list of patterns separated in some way?

@pawel-big-lebowski yes, sorry about that, I forgot to include the proper documentation. I added a section to the FAQ to explain the usage, I'd be happy to move it to another place that is more suitable if needed

We have 2 options:

Define patterns as an array of patterns (i.e. Set<String> patterns)

or, simply define patterns as a string with as REGEXP (which in that case, we can use String pattern)

I think option 1 might less error prone and more readable (though might make adding an exclusion filter in SQL tricky). For example, in marquez.yml:

exclude: namespaces: onRead: boolean onWrite: boolean patterns: [<pattern0>, <pattern1>,...]

while option 2 would be:

exclude: namespaces: onRead: boolean onWrite: boolean pattern: "<pattern0>|<pattern1>|..."

I think either option works and perhaps we just rename patterns to pattern for clarity and go with option 2 as it's the simpler option (both for configuration and implementation).

Signed-off-by: yanlibert <[email protected]>

wslulciuc · 2023-12-12T10:32:17Z

api/src/main/java/marquez/api/filter/exclusions/ExclusionsConfig.java

+ public static class NamespaceExclusion {
+ @Getter @JsonProperty public boolean onRead;
+ @Getter @JsonProperty public boolean onWrite;
+ @Getter @JsonProperty public String patterns;


We have 2 options:

Define patterns as an array of patterns (i.e. Set<String> patterns)

or, simply define patterns as a string with as REGEXP (which in that case, we can use String pattern)

I think option 1 might less error prone and more readable (though might make adding an exclusion filter in SQL tricky). For example, in marquez.yml:

exclude: namespaces: onRead: boolean onWrite: boolean patterns: [<pattern0>, <pattern1>,...]

while option 2 would be:

exclude: namespaces: onRead: boolean onWrite: boolean pattern: "<pattern0>|<pattern1>|..."

I think either option works and perhaps we just rename patterns to pattern for clarity and go with option 2 as it's the simpler option (both for configuration and implementation).

api/src/main/java/marquez/db/NamespaceDao.java

wslulciuc · 2023-12-12T11:30:12Z

api/src/main/java/marquez/api/NamespaceResource.java

@@ -74,8 +75,16 @@ public Response get(@PathParam("namespace") NamespaceName name) {
 public Response list(
 @QueryParam("limit") @DefaultValue("100") @Min(value = 0) int limit,
 @QueryParam("offset") @DefaultValue("0") @Min(value = 0) int offset) {
- final List<Namespace> namespaces = namespaceService.findAll(limit, offset);
- return Response.ok(new Namespaces(namespaces)).build();
+ final String namespaceFilter = ExclusionsFilter.getNamespacesReadFilter();


Great work! To clean up the code, I suggest (FYI, these are only suggestions and feel free to modify as needed):

Define class Exclusions with a cleaner interface defined in pkg marquez.exclusions, or marquez.common (see below) in place of ExclusionsFilter

Use lambdas to call NamespaceService.findAll() or NamespaceService.findAllWith()

Exclusions.java

package marquez.common; import com.google.common.collect.ClassToInstanceMap; import com.google.common.collect.MutableClassToInstanceMap; import javax.annotation.Nullable; import lombok.NonNull; public final class Exclusions { private Exclusions() {} private static final ClassToInstanceMap<Object> EXCLUSIONS = MutableClassToInstanceMap.create(); public static void use(@NonNull ExclusionsConfig config) { // Build exclusions map using config. EXCLUSIONS.put(NamespaceExclusion.class, new NamespaceExclusion(...)); } public static NamespaceExclusion namespaces() { return EXCLUSIONS.getInstance(NamespaceExclusion.class); } // To avoid duplicating the pattern (onRead, onWrite) we can also have a // ON_READ_EXCLUSIONS and ON_WRITE_EXCLUSIONS maps; // therefore the contract would just be NamespaceExclusion(@NonNull String pattern); // but, 'Exclusions.namespaces().onRead()' is more concise. public record NamespaceExclusion(@Nullable String onRead, @Nullable String onWrite) {} }

Then, you can chain your calls:

NamespaceResource.list()

final List<Namespace> namespaces = Optional.ofNullable(Exclusions.namespaces().onRead()) .map(exclusion -> namespaceService.findAllWithExclusion(exclusion, limit, offset)) .orElseGet(() -> namespaceService.findAll(limit, offset)); return Response.ok(new Namespaces(namespaces)).build();

@yanlibert I'm wondering, as we work through the Exclusions code, would it make sense to just have:

exclude: namespaces: onRead: enabled: boolean # (default: true) pattern: "<pattern0>|<pattern1>|..." onWrite: "<pattern0>|<pattern1>|..." enabled: boolean # (default: true) pattern: "<pattern0>|<pattern1>|..."

where onRead and onWrite can be different (or equal). This would give greater control and flexibility over what to exclude 😅

I think it's a great suggestion, I refactored the code accordingly

@wslulciuc I also think option 2 where pattern is a string of regex is simpler, and more straightforward to implement as I'm not a java dev 😄
I increased the coverage and refactored the code as suggested by implementing a cleaner Exclusions interface.

Signed-off-by: yanlibert <[email protected]>

wslulciuc

Nice! This looks great to me and really like how simple it is to define exclusions 💯 💯 💯

boring-cyborg · 2023-12-18T19:15:33Z

Great job! Congrats on your first merged pull request in the Marquez project!

boring-cyborg bot added the api API layer changes label Nov 24, 2023

yanlibert force-pushed the main branch from 8a462b6 to 9aec0ef Compare November 24, 2023 15:40

wslulciuc added this to the 0.44.0 milestone Nov 29, 2023

yanlibert added 2 commits December 7, 2023 23:09

feat: add namespace filter option in config

2d7e24e

Signed-off-by: Yan <[email protected]> Signed-off-by: yanlibert <[email protected]>

doc: better comment

c7c3607

Signed-off-by: Yan <[email protected]> Signed-off-by: yanlibert <[email protected]>

yanlibert force-pushed the main branch from 2479bf2 to 72af69a Compare December 7, 2023 22:11

boring-cyborg bot added client/java docs spec web labels Dec 7, 2023

yanlibert force-pushed the main branch from 70c2b0f to 81e64ba Compare December 8, 2023 06:55

feat: implement excludes for namespaces

4d8b91d

Signed-off-by: yanlibert <[email protected]>

yanlibert force-pushed the main branch from 81e64ba to 4d8b91d Compare December 8, 2023 07:42

yanlibert and others added 3 commits December 8, 2023 08:48

Merge branch 'MarquezProject:main' into main

5da8b07

feat: relocate test to proper location

09f0418

Signed-off-by: yanlibert <[email protected]>

feat: increased test coverage

6294c8e

Signed-off-by: yanlibert <[email protected]>

pawel-big-lebowski reviewed Dec 11, 2023

View reviewed changes

doc: add FAQ entry for filtering namespaces

4a6c80c

Signed-off-by: yanlibert <[email protected]>

wslulciuc reviewed Dec 12, 2023

View reviewed changes

yanlibert and others added 3 commits December 12, 2023 20:29

Merge branch 'MarquezProject:main' into main

beab297

refacto: cleaner Exclusion interface and better yml usage

c5e445a

Signed-off-by: yanlibert <[email protected]>

feat: add Exclusion interface

bec782e

Signed-off-by: yanlibert <[email protected]>

Merge branch 'MarquezProject:main' into main

ea55703

wslulciuc approved these changes Dec 18, 2023

View reviewed changes

wslulciuc enabled auto-merge (squash) December 18, 2023 19:15

wslulciuc disabled auto-merge December 18, 2023 19:15

wslulciuc merged commit 83608bb into MarquezProject:main Dec 18, 2023
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add namespace filter option in config file #2688

Add namespace filter option in config file #2688

yanlibert commented Nov 24, 2023

boring-cyborg bot commented Nov 24, 2023

netlify bot commented Nov 24, 2023 •

edited

Loading

codecov bot commented Nov 24, 2023 •

edited

Loading

wslulciuc commented Nov 29, 2023

wslulciuc commented Nov 29, 2023 •

edited

Loading

yanlibert commented Nov 30, 2023

yanlibert commented Dec 7, 2023 •

edited

Loading

YLibert commented Dec 11, 2023

pawel-big-lebowski left a comment •

edited

Loading

pawel-big-lebowski Dec 11, 2023

yanlibert Dec 11, 2023

wslulciuc Dec 12, 2023

wslulciuc Dec 12, 2023

wslulciuc Dec 12, 2023 •

edited

Loading

wslulciuc Dec 12, 2023 •

edited

Loading

yanlibert Dec 16, 2023

yanlibert Dec 16, 2023

wslulciuc left a comment

boring-cyborg bot commented Dec 18, 2023

Add namespace filter option in config file #2688

Add namespace filter option in config file #2688

Conversation

yanlibert commented Nov 24, 2023

Problem

Solution

Checklist

boring-cyborg bot commented Nov 24, 2023

netlify bot commented Nov 24, 2023 • edited Loading

✅ Deploy Preview for peppy-sprite-186812 canceled.

codecov bot commented Nov 24, 2023 • edited Loading

Codecov Report

wslulciuc commented Nov 29, 2023

wslulciuc commented Nov 29, 2023 • edited Loading

yanlibert commented Nov 30, 2023

yanlibert commented Dec 7, 2023 • edited Loading

YLibert commented Dec 11, 2023

pawel-big-lebowski left a comment • edited Loading

Choose a reason for hiding this comment

pawel-big-lebowski Dec 11, 2023

Choose a reason for hiding this comment

yanlibert Dec 11, 2023

Choose a reason for hiding this comment

wslulciuc Dec 12, 2023

Choose a reason for hiding this comment

wslulciuc Dec 12, 2023

Choose a reason for hiding this comment

wslulciuc Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

wslulciuc Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

yanlibert Dec 16, 2023

Choose a reason for hiding this comment

yanlibert Dec 16, 2023

Choose a reason for hiding this comment

wslulciuc left a comment

Choose a reason for hiding this comment

boring-cyborg bot commented Dec 18, 2023

netlify bot commented Nov 24, 2023 •

edited

Loading

codecov bot commented Nov 24, 2023 •

edited

Loading

wslulciuc commented Nov 29, 2023 •

edited

Loading

yanlibert commented Dec 7, 2023 •

edited

Loading

pawel-big-lebowski left a comment •

edited

Loading

wslulciuc Dec 12, 2023 •

edited

Loading

wslulciuc Dec 12, 2023 •

edited

Loading