[FLINK-12541][container][python] Add support for Python jobs in build script #8609

dianfu · 2019-06-04T08:36:26Z

What is the purpose of the change

This pull request adds support to build job specific docker image for Python Table API jobs.

Brief change log

Improves the build script to support build a Python job specific docker image

Verifying this change

Verify this change manually.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable)

flinkbot · 2019-06-04T08:38:58Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❗ 3. Needs [attention] from.
- Needs attention by @tillrohrmann [PMC]
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

sunjincheng121 · 2019-06-04T09:24:22Z

@flinkbot attention @tillrohrmann

tillrohrmann

Thanks for opening this PR @dianfu. I think this PR does not properly separate concerns. It mixes the concern of Python programs with the StandaloneJobClusterEntrypoint, StandaloneJobClusterConfiguration and the ClassPathJobGraphRetriever. I suspect that there is a better separation of concerns by delegating the responsibility of parsing python specific job options to the PythonDriver. Otherwise I fear that we are going to add special case logic for every language binding which we might support in the future.

I would also suggest to add appropriate test cases to test your changes. Moreover, your first commit contains unrelated changes to the WordCount example which are attributed to FLINK-12541. Please revert them.

tillrohrmann · 2019-06-05T13:27:33Z

...ontainer/src/main/java/org/apache/flink/container/entrypoint/ClassPathJobGraphRetriever.java

- this(jobId, savepointRestoreSettings, programArguments, jobClassName, JarsOnClassPath.INSTANCE);
+ @Nullable String jobEntryPointName,
+ @Nullable String jobPythonArtifacts) {
+ this(jobId, savepointRestoreSettings, programArguments, jobEntryPointName, jobPythonArtifacts, JarsOnClassPath.INSTANCE);


Why do we need to touch the ClassPathJobGraphRetriever at all? This class should not need to know about Python. Otherwise we need to add special casing for all supported languages in the future. This does not seem right.

tillrohrmann · 2019-06-05T13:28:23Z

...ontainer/src/main/java/org/apache/flink/container/entrypoint/ClassPathJobGraphRetriever.java

 throw new FlinkException("Could not load the provided entrypoint class.", e);
+ } catch (ClassNotFoundException e) {
+ if (!isPythonProgram && jobPythonArtifacts != null) {
+ return createPackagedProgram("org.apache.flink.python.client.PythonDriver", true);


This is overly complicated and error prone. Why not starting the cluster entrypoint with o.a.f.python.client.PythonDriver as jobClassName?

tillrohrmann · 2019-06-05T13:29:12Z

...ontainer/src/main/java/org/apache/flink/container/entrypoint/ClassPathJobGraphRetriever.java

+ args[2] = "pyfs";
+ args[3] = jobPythonArtifacts;
+ System.arraycopy(programArguments, 0, args, 4, programArguments.length);
+ return args;


Why do we need to do this command line argument magic? Just because PythonDriver expects the arguments to be passed in a special order? Why not making the PythonDriver more flexible.

tillrohrmann · 2019-06-05T13:29:32Z

...r/src/main/java/org/apache/flink/container/entrypoint/StandaloneJobClusterConfiguration.java

+ }
+
+ @Nullable
+ String getJobPythonArtifacts() {


Why is this needed?

tillrohrmann · 2019-06-05T13:30:18Z

...va/org/apache/flink/container/entrypoint/StandaloneJobClusterConfigurationParserFactory.java

+ .desc("Entrypoint name of the job to run.")
+ .build();
+
+ private static final Option JOB_PYTHON_ARTIFACTS_OPTION = Option.builder("ja")


Why does this need to be handled by the StandaloenJobClusterConfiguration and cannot be parsed by the PythonDriver program?

tillrohrmann · 2019-06-05T13:31:02Z

...iner/src/main/java/org/apache/flink/container/entrypoint/StandaloneJobClusterEntryPoint.java

+ private final String jobEntryPointName;
+
+ @Nullable
+ private final String jobPythonArtifacts;


I think it is not a good idea to add special case logic into the StandaloneJobClusterEntrypoint. Why do you think this is needed?

tillrohrmann · 2019-06-05T13:47:05Z

Why has this PR the same Flink issue assigned as #8532? Every PR should have it's own JIRA issue as stated in the contribution guidelines.

sunjincheng121

Thanks for the PR @dianfu
I think the suggestion from @tillrohrmann makes more sense to me. we should let the PR have its own JIRA
furthermore, the improvement of pyfs also can in a new PR (with a new JIRA).
BTW: please rebase the code, and will have another review :)
What do you think?

dianfu · 2019-06-10T05:27:29Z

@tillrohrmann Thanks a lot for your review.
Your suggestion makes much sense to me. I have created a dedicated JIRA FLINK-12788 for this PR. Regarding to the changes to StandaloneJobClusterEntrypoint, agree that there should not be special logic for Python. I will revert that part of changes.

@sunjincheng121 Thanks a lot for your review. I have created a separate JIRA FLINK-12787 for the PythonDriver improving. Then we can focus this PR on the build script changes for Python jobs support.

Will updated the PR later today.

dianfu · 2019-06-11T03:21:16Z

@tillrohrmann @sunjincheng121 I have updated the PR and the changes are only related to the build script for Python jobs support. Looking forward to your feedback.

sunjincheng121 · 2019-06-11T06:51:37Z

Thanks for the update @dianfu!

Currently, the PR looks pretty clean form my side. only one improvement I am not pretty sure that is build command option --opt-jars. In this way the opt jar does not need upload. but without this option, the user can build a fat jar with opt jars. So this change is an improvement(not necessary change) So we may need some opinion from @tillrohrmann!
Otherwise the PR LGTM. I'll merge it when @tillrohrmann says ok!

Best,
Jincheng

sunjincheng121 · 2019-06-12T01:49:14Z

I think we can remove the --opt-jars due to we @StephanEwen already bring up the discussion for put the table JAR into lib. detail can be found here: http:https://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Putting-Table-API-jars-in-lib-by-default-td29492.html
What do you think?

dianfu · 2019-06-12T02:10:25Z

@sunjincheng121 Make sense to me as --opt-jars is not a must have option. Considering that the table JAR will be put into lib by default in the future, removing this option is good for me as the table JAR is the motivation for this option. Have updated the PR. Looking forward to your response.

tillrohrmann

The changes look good now @dianfu. I had two last comments which we should address before merging this PR.

tillrohrmann · 2019-06-12T11:35:14Z

flink-container/kubernetes/README.md

@@ -29,7 +29,7 @@ In non HA mode, you should first start the job cluster service:

 In order to deploy the job cluster entrypoint run:

-`FLINK_IMAGE_NAME=<IMAGE_NAME> FLINK_JOB=<JOB_NAME> FLINK_JOB_PARALLELISM=<PARALLELISM> envsubst < job-cluster-job.yaml.template | kubectl create -f -`


Why did we remove FLINK_JOB here?

I found that FLINK_JOB has been removed in commit 753e0c6 and just correct the documentation here. What do you think?

Good catch. Now this makes sense.

tillrohrmann · 2019-06-12T11:38:31Z

flink-container/docker/README.md

@@ -13,26 +13,28 @@ Install the most recent stable version of [Docker](https://docs.docker.com/insta
 Images are based on the official Java Alpine (OpenJDK 8) image.

 Before building the image, one needs to build the user code jars for the job.
-Assume that the job jar is stored under `<PATH_TO_JOB_JAR>` 
+Assume that the job jar is stored under `<COMMA_SEPARATED_PATH_TO_JOB_ARTIFACTS>`


Seems not to fit. I guess the sentence needs to be changed to something like: A Flink job can consist of multiple jars. In order to specify the required jars, they need to be passed to --job-artifacts` of the build script. The individual paths are comma separated.

+1. Only one concern: what about changing jars to artifacts?

dianfu · 2019-06-12T15:01:53Z

@tillrohrmann Thanks a lot for your comments. I have updated the PR and addressed one of your concerns. Regarding to FLINK_JOB, this option has been removed and the documentation has not been updated. Just correct the documentation here. Feel free to let me know if you feel that change doesn't make sense or you prefer to address that in another PR.

tillrohrmann

Thanks for addressing my comments @dianfu. LGTM. Failing test seems to be unrelated. Merging now.

tillrohrmann · 2019-06-13T08:50:20Z

flink-container/kubernetes/README.md

@@ -29,7 +29,7 @@ In non HA mode, you should first start the job cluster service:

 In order to deploy the job cluster entrypoint run:

-`FLINK_IMAGE_NAME=<IMAGE_NAME> FLINK_JOB=<JOB_NAME> FLINK_JOB_PARALLELISM=<PARALLELISM> envsubst < job-cluster-job.yaml.template | kubectl create -f -`


Good catch. Now this makes sense.

knaufk · 2019-06-19T08:50:20Z

FYI: This change broke test_docker_embedded_job.sh and test_kubernetes_embedded_job.sh as they were still using the --job-jar argument. I have included a [hotfix] in #8741.

dianfu · 2019-06-19T09:14:50Z

@knaufk Thanks a lot for the hotfix. The fix makes sense to me.

This closes apache#8609.

rmetzger added the review=description? label Jun 4, 2019

dianfu force-pushed the FLINK-12541-docker branch 3 times, most recently from bfefeb7 to 900f728 Compare June 4, 2019 08:56

rmetzger added the component=API/Python label Jun 4, 2019

rmetzger requested a review from tillrohrmann June 5, 2019 08:46

tillrohrmann requested changes Jun 5, 2019

View reviewed changes

rmetzger requested a review from tillrohrmann June 5, 2019 13:43

sunjincheng121 reviewed Jun 10, 2019

View reviewed changes

[FLINK-12788][container] Add support for Python jobs in build script

904d9d7

dianfu force-pushed the FLINK-12541-docker branch from 900f728 to 904d9d7 Compare June 11, 2019 03:15

Address review comments

ed17aa5

tillrohrmann requested changes Jun 12, 2019

View reviewed changes

Address review comments

3ac4c8d

tillrohrmann approved these changes Jun 13, 2019

View reviewed changes

tillrohrmann closed this in 00d90a4 Jun 13, 2019

rmetzger added the component=Runtime/REST label Jun 20, 2019

sjwiesman pushed a commit to sjwiesman/flink that referenced this pull request Jun 26, 2019

[FLINK-12788][container] Add support for Python jobs in build script

e1853a8

This closes apache#8609.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-12541][container][python] Add support for Python jobs in build script #8609

[FLINK-12541][container][python] Add support for Python jobs in build script #8609

dianfu commented Jun 4, 2019 •

edited

Loading

flinkbot commented Jun 4, 2019 •

edited

Loading

sunjincheng121 commented Jun 4, 2019

tillrohrmann left a comment

tillrohrmann Jun 5, 2019

tillrohrmann Jun 5, 2019

tillrohrmann Jun 5, 2019

tillrohrmann Jun 5, 2019

tillrohrmann Jun 5, 2019

tillrohrmann Jun 5, 2019

tillrohrmann commented Jun 5, 2019

sunjincheng121 left a comment

dianfu commented Jun 10, 2019

dianfu commented Jun 11, 2019

sunjincheng121 commented Jun 11, 2019 •

edited

Loading

sunjincheng121 commented Jun 12, 2019

dianfu commented Jun 12, 2019

tillrohrmann left a comment

tillrohrmann Jun 12, 2019

dianfu Jun 12, 2019

tillrohrmann Jun 13, 2019

tillrohrmann Jun 12, 2019

dianfu Jun 12, 2019

dianfu commented Jun 12, 2019 •

edited

Loading

tillrohrmann left a comment

tillrohrmann Jun 13, 2019

knaufk commented Jun 19, 2019 •

edited

Loading

dianfu commented Jun 19, 2019

		@@ -29,7 +29,7 @@ In non HA mode, you should first start the job cluster service:

		In order to deploy the job cluster entrypoint run:

		`FLINK_IMAGE_NAME=<IMAGE_NAME> FLINK_JOB=<JOB_NAME> FLINK_JOB_PARALLELISM=<PARALLELISM> envsubst < job-cluster-job.yaml.template \| kubectl create -f -`

[FLINK-12541][container][python] Add support for Python jobs in build script #8609

[FLINK-12541][container][python] Add support for Python jobs in build script #8609

Conversation

dianfu commented Jun 4, 2019 • edited Loading

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Jun 4, 2019 • edited Loading

Review Progress

sunjincheng121 commented Jun 4, 2019

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann commented Jun 5, 2019

sunjincheng121 left a comment

Choose a reason for hiding this comment

dianfu commented Jun 10, 2019

dianfu commented Jun 11, 2019

sunjincheng121 commented Jun 11, 2019 • edited Loading

sunjincheng121 commented Jun 12, 2019

dianfu commented Jun 12, 2019

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dianfu commented Jun 12, 2019 • edited Loading

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knaufk commented Jun 19, 2019 • edited Loading

dianfu commented Jun 19, 2019

dianfu commented Jun 4, 2019 •

edited

Loading

flinkbot commented Jun 4, 2019 •

edited

Loading

sunjincheng121 commented Jun 11, 2019 •

edited

Loading

dianfu commented Jun 12, 2019 •

edited

Loading

knaufk commented Jun 19, 2019 •

edited

Loading