Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-9964] Move workerCacheMb to a user-visible place #11849

Merged
merged 1 commit into from
May 29, 2020

Conversation

steveniemitz
Copy link
Contributor

#11710 added the plumbing to use workerCacheMb parameter to size the streaming dataflow worker state cache. However, the parameter itself is inaccessible from user jobs because it's in DataflowWorkerHarnessOptions, which is only exposed in the worker itself.

Trying to set it produces:

$ java -jar myjob.jar  ... --runner=DataflowRunner --workerCacheMb=400
Exception in thread "main" java.lang.IllegalArgumentException: Class interface ...MyOptions missing a property named 'workerCacheMb'.
	at org.apache.beam.sdk.options.PipelineOptionsFactory.parseObjects(PipelineOptionsFactory.java:1625)

This simply moves it to a user-accessible location.

R: @omarismail94 @pabloem


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@pabloem
Copy link
Member

pabloem commented May 28, 2020

gahh so sorry that I missed this. I guess you did have to end up contributing this : )

@steveniemitz
Copy link
Contributor Author

gahh so sorry that I missed this. I guess you did have to end up contributing this : )

heh no problem, teamwork! :highfive:

@steveniemitz
Copy link
Contributor Author

=/ looks like the dataflow precommit succeeded but the API call to update it here failed.

@pabloem
Copy link
Member

pabloem commented May 29, 2020

Just a question - I am a little confused. How come the DataflowPipelineDebugOptions class is visible, but DataflowWorkerHarnessOptions isn't? If you inherit from it, shouldn't you be able to use it?
It may be that we generally encourage users to rely on DataflowPipelineDebugOptions for their Dataflow pipeline needs? If so, it makes sense to move the option... I'm just confused about what is affecting the visibility of the classes

@steveniemitz
Copy link
Contributor Author

steveniemitz commented May 29, 2020

If you look at DataflowPipelineOptions it doesn't include DataflowWorkerHarnessOptions. In fact, DataflowWorkerHarnessOptions implement DataflowPipelineOptions instead. The harness options are used in the harness itself, while the DataflowPipelineOptions are what are validated against in the dataflow runner.

edit: Also to clarify, user's don't (in general) directly implement DataflowPipelineOptions, they're included implicitly when the dataflow runner is used. One could specifically implement DataflowWorkerHarnessOptions (or even just define the property in any options they have, we actually used to just do that) if they wanted to.

@pabloem
Copy link
Member

pabloem commented May 29, 2020

gotcha. Makes sense. LGTM.

@pabloem pabloem merged commit 8420eee into apache:master May 29, 2020
@steveniemitz steveniemitz deleted the move-df-cache-option branch May 29, 2020 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants