-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Beam 6677] Flink portability cluster #7848
[Beam 6677] Flink portability cluster #7848
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-reviewing because I am concerned about the licenses of some files (as in review comments).
The official docker and flink init-actions perfectly fit our needs. The sole purpose why I "forked" those files to beam repo is that it's best for us to have them in the repository and upload every time to GCS buckets (this is what create_flink_cluster.sh does). This way we have full control over what we are actually running.
An alternative would be to fetch those init-actions from a public gcs bucket provided by Dataproc team (gs:https://dataproc-init-actions). The docker and flink init actions are there. I didn't do this because:
- we won't be able to modify the scripts
- if a script changes on the public bucket, we will be affected by this change. this potentially breakes something in the future
If for some reason what I did is forbidden due to license terms we should go with the alternative approach (official bucket).
See: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions
The dataproc repository: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions
Init actions bucket: https://console.cloud.google.com/storage/browser/dataproc-initialization-actions
@@ -0,0 +1,79 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was copied from: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/docker/docker.sh and was unchanged.
Is it ok to copy this like that? What license should I put there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this project is licensed under the Apache license, so it should be fine to just add it. I'll add it for you now to get the precommits running (even though I guess they won't exercise this code..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thanks!
@@ -0,0 +1,202 @@ | |||
#!/bin/bash | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was copied from https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/flink/flink.sh
Is it ok to copy the file this way? What license should I put there? Should I change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apache license looks like it's okay.
@@ -0,0 +1,77 @@ | |||
#!/usr/bin/env bash | |||
|
|||
# Licensed to the Apache Software Foundation (ASF) under one or more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that Beam init action exists in Dataproc repo too, but IMO it didn't fit our needs. I provided a version that is very different. It's got a Beam license header. Is this ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I've added an Apache license for the file missing it. I've added a couple questions regarding documentation.
# $4: SDK Harness' images to pull on dataproc workers (python,java,go) | ||
# $5: Url to Flink .tar archive to be installed on the cluster | ||
# $6: Number of Flink workers | ||
# $7: Number of Flink slots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flink slots per worker? (sorry, I don't know about these things..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok - it's a good idea to clarify this better. Will do. 👍
According to my knowledge, task slots are a way of telling how many tasks a worker accepts. Thanks to this a Task Manager (located in a worker) can evenly divide resources per slot providing some sort of isolation between them.
@@ -0,0 +1,202 @@ | |||
#!/bin/bash | |||
# | |||
# Licensed under the Apache License, Version 2.0 (the "License"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apache license looks like it's okay.
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you specify in the header what is the origin of this file, and a small summary of what it does please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -0,0 +1,77 @@ | |||
#!/usr/bin/env bash | |||
|
|||
# Licensed to the Apache Software Foundation (ASF) under one or more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems fine to me.
Besides the comments, if you've tested this, it LGTM. |
afc33c3
to
b6fe449
Compare
b6fe449
to
278d9ec
Compare
Thanks! I applied your suggestions and fixed the branch history a bit. Could you take a look again? |
DOCKER_INIT="$GCS_BUCKET/$INIT_ACTIONS_FOLDER_NAME/docker.sh" | ||
|
||
# Portability options | ||
HARNESS_IMAGES_REPOSITORY_URL=$3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest taking the arguments as environment variable instead of numbered arguments to avoid ordering related issue and ease of adding parameters in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - changed.
# Pull beam images with `sudo -i` since if pulling from GCR, yarn will be | ||
# configured with GCR authorization | ||
if ${pull_python_image} ; then | ||
sudo -u yarn -i docker pull "${image_repo}/python:${beam_image_version}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets also convey the information that the name of the images in the repo should have specific names.
However, I would suggest having the complete name of the image instead of repo and type separately and then constructing the image url.
Type in it self is not very used in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to accept a list of urls to be then downloaded by beam init action - is this ok?
# https://cloud.google.com/dataproc/concepts/dataproc-versions | ||
# | ||
# This file originated from: | ||
# https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/docker/docker.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets mention the git version from which we copied this file.
Otherwise we can just use the direct link to the dataproc init action in the create cluster script and download it automatically and use it to avoid code copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for flink init action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, mentioned in both files. The reason I copied the files is that I think it's better to be immune to any hypothetical changes in case scripts from the official buckets change and have the possibility to change the script easily (eg. configure flink/docker/beam as we need).
pull_images | ||
} | ||
|
||
main "$@" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Empty line at the end of file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
function main() { | ||
local role="$(/usr/share/google/get_metadata_value attributes/dataproc-role)" | ||
local snapshot_url="$(/usr/share/google/get_metadata_value \ | ||
"attributes/${FLINK_INSTALL_SNAPSHOT_METADATA_KEY}" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this populated by dataproc.
Shall we also control the version of flink so that we do not get hit by a flink version change on the dataproc side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - it's populated in the create_flink_cluster.sh in line 128: metadata+="flink-snapshot-url=${FLINK_DOWNLOAD_URL}
We have control over flink version. The way it works is that it downloads and installs the snapshot from the url with a version provided in the url.
example url: https://archive.apache.org/dist/flink/flink-1.7.0/flink-1.7.0-bin-hadoop28-scala_2.12.tgz
…harness' urls to beam init-action
790d3b9
to
6f9a635
Compare
@angoenka I applied your suggestions. Could you take a look again? |
@angoenka pinging this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lgajowy
Thanks! Merging |
Provides a Dataproc setup of Flink cluster with support for Portability Framework
Please see comments in the scripts for running instructions.
@angoenka @iemejia @pabloem Could you take a look?
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.Post-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.