Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata cmd #2091

Merged
merged 11 commits into from
Aug 26, 2022
Merged

Add metadata cmd #2091

merged 11 commits into from
Aug 26, 2022

Conversation

wslulciuc
Copy link
Member

@wslulciuc wslulciuc commented Aug 26, 2022

Problem

There's currently no good way to performance test the data model of Marquez with significantly large OL events (see #2076).

Solution

Add cmd metadata to generate OpenLineage events; generated events will be saved to a file called metadata.json that can be used to seed Marquez via the seed cmd (sweet, right!?):

$ java -jar marquez-api.jar metadata --help
usage: java -jar marquez-api.jar
       metadata [--runs RUNS] [--bytes-per-event BYTES-PER-EVENT] [-o OUTPUT] [-h]

generate random metadata using the OpenLineage standard

named arguments:
  --runs RUNS            limits OL runs up to N (default: 25)
  --bytes-per-event BYTES-PER-EVENT
                         size (in bytes) per OL event (default: 33404)
  -o OUTPUT, --output OUTPUT
                         the output metadata file (default: metadata.json)
  -h, --help             show this help message and exit

When seeding Marquez with generated events, we can now observe query performance via pghero. When running:

$ ./docker/up.sh --build

containers, marquez-api, marquez-web, marquez-db and now pghero will start. Query stats aren't enabled by default, you'll need to manually enable query profiling via the UI by browsing to http:https://localhost:8080:

Screen Shot 2022-08-26 at 12 08 59 AM

Limitations of metadata cmd

As follow up work, well want to:

  • Expose option to set the number of I/O per event with --inputs-per-event / --outputs-per-event
  • Expose option for input / output schemas to have very large field names and descriptions (or just randomize the filed name length and description length give some range 5...N)
  • Link upstream and downstream jobs (randomly), currently all jobs have unique I/O datasets; therefore, a lineage graph consists only of a single job node and it's I/O datasets (not all that interesting!)

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

Signed-off-by: wslulciuc <[email protected]>
Signed-off-by: wslulciuc <[email protected]>
Signed-off-by: wslulciuc <[email protected]>
@wslulciuc wslulciuc added the review Ready for review label Aug 26, 2022
Signed-off-by: wslulciuc <[email protected]>
@codecov
Copy link

codecov bot commented Aug 26, 2022

Codecov Report

Merging #2091 (ddf8630) into main (07ba426) will decrease coverage by 2.09%.
The diff coverage is 8.21%.

❗ Current head ddf8630 differs from pull request most recent head f6d0536. Consider uploading reports for the commit f6d0536 to get more accurate results

@@             Coverage Diff              @@
##               main    #2091      +/-   ##
============================================
- Coverage     77.04%   74.94%   -2.10%     
- Complexity     1013     1017       +4     
============================================
  Files           201      202       +1     
  Lines          4643     4789     +146     
  Branches        389      393       +4     
============================================
+ Hits           3577     3589      +12     
- Misses          628      762     +134     
  Partials        438      438              
Impacted Files Coverage Δ
api/src/main/java/marquez/cli/MetadataCommand.java 7.58% <7.58%> (ø)
api/src/main/java/marquez/MarquezApp.java 65.33% <100.00%> (+0.46%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@wslulciuc wslulciuc enabled auto-merge (squash) August 26, 2022 08:07
Signed-off-by: wslulciuc <[email protected]>
@wslulciuc wslulciuc removed the review Ready for review label Aug 26, 2022
@wslulciuc wslulciuc merged commit cf44452 into main Aug 26, 2022
@wslulciuc wslulciuc deleted the feature/metadata-cmd branch August 26, 2022 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants