[Beam 6677] Flink portability cluster #7848

lgajowy · 2019-02-15T12:03:34Z

Provides a Dataproc setup of Flink cluster with support for Portability Framework

Please see comments in the scripts for running instructions.

@angoenka @iemejia @pabloem Could you take a look?

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

lgajowy

Self-reviewing because I am concerned about the licenses of some files (as in review comments).

The official docker and flink init-actions perfectly fit our needs. The sole purpose why I "forked" those files to beam repo is that it's best for us to have them in the repository and upload every time to GCS buckets (this is what create_flink_cluster.sh does). This way we have full control over what we are actually running.

An alternative would be to fetch those init-actions from a public gcs bucket provided by Dataproc team (gs:https://dataproc-init-actions). The docker and flink init actions are there. I didn't do this because:

we won't be able to modify the scripts
if a script changes on the public bucket, we will be affected by this change. this potentially breakes something in the future

If for some reason what I did is forbidden due to license terms we should go with the alternative approach (official bucket).

See: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions
The dataproc repository: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions
Init actions bucket: https://console.cloud.google.com/storage/browser/dataproc-initialization-actions

@iemejia @angoenka wdyt?

lgajowy · 2019-02-15T12:04:14Z

.test-infra/dataproc/init-actions/docker.sh

@@ -0,0 +1,79 @@
+#!/usr/bin/env bash


This file was copied from: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/docker/docker.sh and was unchanged.

Is it ok to copy this like that? What license should I put there?

It looks like this project is licensed under the Apache license, so it should be fine to just add it. I'll add it for you now to get the precommits running (even though I guess they won't exercise this code..)

Ok. Thanks!

lgajowy · 2019-02-15T12:07:31Z

.test-infra/dataproc/init-actions/flink.sh

@@ -0,0 +1,202 @@
+#!/bin/bash
+#
+# Licensed under the Apache License, Version 2.0 (the "License");


This file was copied from https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/flink/flink.sh

Is it ok to copy the file this way? What license should I put there? Should I change it?

Apache license looks like it's okay.

.test-infra/dataproc/create_flink_cluster.sh

lgajowy · 2019-02-15T12:23:00Z

.test-infra/dataproc/init-actions/beam.sh

@@ -0,0 +1,77 @@
+#!/usr/bin/env bash
+
+# Licensed to the Apache Software Foundation (ASF) under one or more


I know that Beam init action exists in Dataproc repo too, but IMO it didn't fit our needs. I provided a version that is very different. It's got a Beam license header. Is this ok?

That seems fine to me.

pabloem

Thanks. I've added an Apache license for the file missing it. I've added a couple questions regarding documentation.

pabloem · 2019-02-15T19:51:25Z

.test-infra/dataproc/create_flink_cluster.sh

+# $4: SDK Harness' images to pull on dataproc workers (python,java,go)
+# $5: Url to Flink .tar archive to be installed on the cluster
+# $6: Number of Flink workers
+# $7: Number of Flink slots


Flink slots per worker? (sorry, I don't know about these things..)

ok - it's a good idea to clarify this better. Will do. 👍

According to my knowledge, task slots are a way of telling how many tasks a worker accepts. Thanks to this a Task Manager (located in a worker) can evenly divide resources per slot providing some sort of isolation between them.

pabloem · 2019-02-15T19:52:00Z

.test-infra/dataproc/init-actions/flink.sh

@@ -0,0 +1,202 @@
+#!/bin/bash
+#
+# Licensed under the Apache License, Version 2.0 (the "License");


Apache license looks like it's okay.

pabloem · 2019-02-15T19:52:36Z

.test-infra/dataproc/init-actions/docker.sh

+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#


Could you specify in the header what is the origin of this file, and a small summary of what it does please?

pabloem · 2019-02-15T19:52:56Z

.test-infra/dataproc/init-actions/beam.sh

@@ -0,0 +1,77 @@
+#!/usr/bin/env bash
+
+# Licensed to the Apache Software Foundation (ASF) under one or more


That seems fine to me.

.test-infra/dataproc/create_flink_cluster.sh

pabloem · 2019-02-15T19:55:39Z

Besides the comments, if you've tested this, it LGTM.

cluster

lgajowy · 2019-02-19T12:14:27Z

Thanks! I applied your suggestions and fixed the branch history a bit. Could you take a look again?

angoenka · 2019-02-22T00:05:57Z

.test-infra/dataproc/create_flink_cluster.sh

+DOCKER_INIT="$GCS_BUCKET/$INIT_ACTIONS_FOLDER_NAME/docker.sh"
+
+# Portability options
+HARNESS_IMAGES_REPOSITORY_URL=$3


I would suggest taking the arguments as environment variable instead of numbered arguments to avoid ordering related issue and ease of adding parameters in the future.

Good point - changed.

angoenka · 2019-02-22T01:01:33Z

.test-infra/dataproc/init-actions/beam.sh

+ # Pull beam images with `sudo -i` since if pulling from GCR, yarn will be
+ # configured with GCR authorization
+ if ${pull_python_image} ; then
+ sudo -u yarn -i docker pull "${image_repo}/python:${beam_image_version}"


Lets also convey the information that the name of the images in the repo should have specific names.
However, I would suggest having the complete name of the image instead of repo and type separately and then constructing the image url.
Type in it self is not very used in this case.

I changed it to accept a list of urls to be then downloaded by beam init action - is this ok?

angoenka · 2019-02-22T01:03:51Z

.test-infra/dataproc/init-actions/docker.sh

+# https://cloud.google.com/dataproc/concepts/dataproc-versions
+#
+# This file originated from:
+# https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/docker/docker.sh


Lets mention the git version from which we copied this file.
Otherwise we can just use the direct link to the dataproc init action in the create cluster script and download it automatically and use it to avoid code copy.

ditto for flink init action.

ok, mentioned in both files. The reason I copied the files is that I think it's better to be immune to any hypothetical changes in case scripts from the official buckets change and have the possibility to change the script easily (eg. configure flink/docker/beam as we need).

angoenka · 2019-02-22T01:05:10Z

.test-infra/dataproc/init-actions/beam.sh

+ pull_images
+}
+
+main "$@"


nit: Empty line at the end of file.

angoenka · 2019-02-22T01:08:27Z

.test-infra/dataproc/init-actions/flink.sh

+function main() {
+ local role="$(/usr/share/google/get_metadata_value attributes/dataproc-role)"
+ local snapshot_url="$(/usr/share/google/get_metadata_value \
+ "attributes/${FLINK_INSTALL_SNAPSHOT_METADATA_KEY}" \


Is this populated by dataproc.

Shall we also control the version of flink so that we do not get hit by a flink version change on the dataproc side?

Yes - it's populated in the create_flink_cluster.sh in line 128: metadata+="flink-snapshot-url=${FLINK_DOWNLOAD_URL}

We have control over flink version. The way it works is that it downloads and installs the snapshot from the url with a version provided in the url.

example url: https://archive.apache.org/dist/flink/flink-1.7.0/flink-1.7.0-bin-hadoop28-scala_2.12.tgz

…harness' urls to beam init-action

lgajowy · 2019-03-01T13:07:39Z

@angoenka I applied your suggestions. Could you take a look again?

lgajowy · 2019-03-06T14:58:44Z

@angoenka pinging this :)

angoenka

Thanks @lgajowy

lgajowy · 2019-03-12T10:11:54Z

Thanks! Merging

lgajowy requested review from angoenka and iemejia February 15, 2019 12:03

lgajowy commented Feb 15, 2019

View reviewed changes

pabloem reviewed Feb 15, 2019

View reviewed changes

lgajowy added 3 commits February 19, 2019 13:05

[BEAM-6677] Provide necessary Docker and Flink Dataproc init actions

ef48cd1

[BEAM-6677] Create script that will pull SDK Harness images to Dataproc

46e4522

cluster

[BEAM-6677] Add script to easily setup Flink Dataproc cluster

8d066cb

lgajowy force-pushed the BEAM-6677_flink_portability_cluster branch from afc33c3 to b6fe449 Compare February 19, 2019 12:12

[BEAM-6677] Adding apache license to file.

6b8a5f9

lgajowy force-pushed the BEAM-6677_flink_portability_cluster branch from b6fe449 to 278d9ec Compare February 19, 2019 12:13

angoenka reviewed Feb 22, 2019

View reviewed changes

lgajowy added 2 commits March 1, 2019 14:03

[BEAM-6677] Fix documentation

a4777a9

[BEAM-6677] Use env variables instead of numbered args. Pass list of …

6f9a635

…harness' urls to beam init-action

lgajowy force-pushed the BEAM-6677_flink_portability_cluster branch from 790d3b9 to 6f9a635 Compare March 1, 2019 13:06

angoenka approved these changes Mar 11, 2019

View reviewed changes

lgajowy merged commit b1ed061 into apache:master Mar 12, 2019

lgajowy deleted the BEAM-6677_flink_portability_cluster branch March 12, 2019 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Beam 6677] Flink portability cluster #7848

[Beam 6677] Flink portability cluster #7848

lgajowy commented Feb 15, 2019 •

edited

Loading

lgajowy left a comment

lgajowy Feb 15, 2019

pabloem Feb 15, 2019

lgajowy Feb 19, 2019

lgajowy Feb 15, 2019

pabloem Feb 15, 2019

lgajowy Feb 15, 2019 •

edited

Loading

pabloem Feb 15, 2019

pabloem left a comment

pabloem Feb 15, 2019

lgajowy Feb 19, 2019

pabloem Feb 15, 2019

pabloem Feb 15, 2019

lgajowy Feb 19, 2019

pabloem Feb 15, 2019

pabloem commented Feb 15, 2019

lgajowy commented Feb 19, 2019

angoenka Feb 22, 2019

lgajowy Mar 1, 2019

angoenka Feb 22, 2019

lgajowy Mar 1, 2019

angoenka Feb 22, 2019

angoenka Feb 22, 2019

lgajowy Mar 1, 2019

angoenka Feb 22, 2019

lgajowy Mar 1, 2019

angoenka Feb 22, 2019

lgajowy Mar 1, 2019

lgajowy commented Mar 1, 2019

lgajowy commented Mar 6, 2019

angoenka left a comment

lgajowy commented Mar 12, 2019

		@@ -0,0 +1,77 @@
		#!/usr/bin/env bash

		# Licensed to the Apache Software Foundation (ASF) under one or more

[Beam 6677] Flink portability cluster #7848

[Beam 6677] Flink portability cluster #7848

Conversation

lgajowy commented Feb 15, 2019 • edited Loading

Post-Commit Tests Status (on master branch)

lgajowy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgajowy Feb 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Feb 15, 2019

lgajowy commented Feb 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgajowy commented Mar 1, 2019

lgajowy commented Mar 6, 2019

angoenka left a comment

Choose a reason for hiding this comment

lgajowy commented Mar 12, 2019

lgajowy commented Feb 15, 2019 •

edited

Loading

lgajowy Feb 15, 2019 •

edited

Loading