Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Beam 6677] Flink portability cluster #7848

Merged
merged 6 commits into from
Mar 12, 2019

Conversation

lgajowy
Copy link
Contributor

@lgajowy lgajowy commented Feb 15, 2019

Provides a Dataproc setup of Flink cluster with support for Portability Framework

Please see comments in the scripts for running instructions.

@angoenka @iemejia @pabloem Could you take a look?


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

Copy link
Contributor Author

@lgajowy lgajowy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-reviewing because I am concerned about the licenses of some files (as in review comments).

The official docker and flink init-actions perfectly fit our needs. The sole purpose why I "forked" those files to beam repo is that it's best for us to have them in the repository and upload every time to GCS buckets (this is what create_flink_cluster.sh does). This way we have full control over what we are actually running.

An alternative would be to fetch those init-actions from a public gcs bucket provided by Dataproc team (gs:https://dataproc-init-actions). The docker and flink init actions are there. I didn't do this because:

  • we won't be able to modify the scripts
  • if a script changes on the public bucket, we will be affected by this change. this potentially breakes something in the future

If for some reason what I did is forbidden due to license terms we should go with the alternative approach (official bucket).

See: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions
The dataproc repository: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions
Init actions bucket: https://console.cloud.google.com/storage/browser/dataproc-initialization-actions

@iemejia @angoenka wdyt?

@@ -0,0 +1,79 @@
#!/usr/bin/env bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was copied from: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/docker/docker.sh and was unchanged.

Is it ok to copy this like that? What license should I put there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this project is licensed under the Apache license, so it should be fine to just add it. I'll add it for you now to get the precommits running (even though I guess they won't exercise this code..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks!

@@ -0,0 +1,202 @@
#!/bin/bash
#
# Licensed under the Apache License, Version 2.0 (the "License");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was copied from https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/flink/flink.sh

Is it ok to copy the file this way? What license should I put there? Should I change it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache license looks like it's okay.

.test-infra/dataproc/create_flink_cluster.sh Show resolved Hide resolved
@@ -0,0 +1,77 @@
#!/usr/bin/env bash

# Licensed to the Apache Software Foundation (ASF) under one or more
Copy link
Contributor Author

@lgajowy lgajowy Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that Beam init action exists in Dataproc repo too, but IMO it didn't fit our needs. I provided a version that is very different. It's got a Beam license header. Is this ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine to me.

Copy link
Member

@pabloem pabloem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've added an Apache license for the file missing it. I've added a couple questions regarding documentation.

# $4: SDK Harness' images to pull on dataproc workers (python,java,go)
# $5: Url to Flink .tar archive to be installed on the cluster
# $6: Number of Flink workers
# $7: Number of Flink slots
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flink slots per worker? (sorry, I don't know about these things..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - it's a good idea to clarify this better. Will do. 👍

According to my knowledge, task slots are a way of telling how many tasks a worker accepts. Thanks to this a Task Manager (located in a worker) can evenly divide resources per slot providing some sort of isolation between them.

@@ -0,0 +1,202 @@
#!/bin/bash
#
# Licensed under the Apache License, Version 2.0 (the "License");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache license looks like it's okay.

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specify in the header what is the origin of this file, and a small summary of what it does please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -0,0 +1,77 @@
#!/usr/bin/env bash

# Licensed to the Apache Software Foundation (ASF) under one or more
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine to me.

.test-infra/dataproc/create_flink_cluster.sh Show resolved Hide resolved
@pabloem
Copy link
Member

pabloem commented Feb 15, 2019

Besides the comments, if you've tested this, it LGTM.

@lgajowy lgajowy force-pushed the BEAM-6677_flink_portability_cluster branch from afc33c3 to b6fe449 Compare February 19, 2019 12:12
@lgajowy lgajowy force-pushed the BEAM-6677_flink_portability_cluster branch from b6fe449 to 278d9ec Compare February 19, 2019 12:13
@lgajowy
Copy link
Contributor Author

lgajowy commented Feb 19, 2019

Thanks! I applied your suggestions and fixed the branch history a bit. Could you take a look again?

DOCKER_INIT="$GCS_BUCKET/$INIT_ACTIONS_FOLDER_NAME/docker.sh"

# Portability options
HARNESS_IMAGES_REPOSITORY_URL=$3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest taking the arguments as environment variable instead of numbered arguments to avoid ordering related issue and ease of adding parameters in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - changed.

# Pull beam images with `sudo -i` since if pulling from GCR, yarn will be
# configured with GCR authorization
if ${pull_python_image} ; then
sudo -u yarn -i docker pull "${image_repo}/python:${beam_image_version}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also convey the information that the name of the images in the repo should have specific names.
However, I would suggest having the complete name of the image instead of repo and type separately and then constructing the image url.
Type in it self is not very used in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to accept a list of urls to be then downloaded by beam init action - is this ok?

# https://cloud.google.com/dataproc/concepts/dataproc-versions
#
# This file originated from:
# https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/docker/docker.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets mention the git version from which we copied this file.
Otherwise we can just use the direct link to the dataproc init action in the create cluster script and download it automatically and use it to avoid code copy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for flink init action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, mentioned in both files. The reason I copied the files is that I think it's better to be immune to any hypothetical changes in case scripts from the official buckets change and have the possibility to change the script easily (eg. configure flink/docker/beam as we need).

pull_images
}

main "$@"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Empty line at the end of file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

function main() {
local role="$(/usr/share/google/get_metadata_value attributes/dataproc-role)"
local snapshot_url="$(/usr/share/google/get_metadata_value \
"attributes/${FLINK_INSTALL_SNAPSHOT_METADATA_KEY}" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this populated by dataproc.

Shall we also control the version of flink so that we do not get hit by a flink version change on the dataproc side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - it's populated in the create_flink_cluster.sh in line 128: metadata+="flink-snapshot-url=${FLINK_DOWNLOAD_URL}

We have control over flink version. The way it works is that it downloads and installs the snapshot from the url with a version provided in the url.

example url: https://archive.apache.org/dist/flink/flink-1.7.0/flink-1.7.0-bin-hadoop28-scala_2.12.tgz

@lgajowy lgajowy force-pushed the BEAM-6677_flink_portability_cluster branch from 790d3b9 to 6f9a635 Compare March 1, 2019 13:06
@lgajowy
Copy link
Contributor Author

lgajowy commented Mar 1, 2019

@angoenka I applied your suggestions. Could you take a look again?

@lgajowy
Copy link
Contributor Author

lgajowy commented Mar 6, 2019

@angoenka pinging this :)

Copy link
Contributor

@angoenka angoenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lgajowy

@lgajowy
Copy link
Contributor Author

lgajowy commented Mar 12, 2019

Thanks! Merging

@lgajowy lgajowy merged commit b1ed061 into apache:master Mar 12, 2019
@lgajowy lgajowy deleted the BEAM-6677_flink_portability_cluster branch March 12, 2019 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants