Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-14881] [s3|kinesis] Add support for IAM Roles for Service Accounts on AWS EKS #12008

Closed
wants to merge 4 commits into from

Conversation

aroch
Copy link
Contributor

@aroch aroch commented May 6, 2020

What is the purpose of the change

IAM Roles for Service Accounts have many advantages when deploying Flink on AWS EKS.
From AWS documentation:

With IAM roles for service accounts on Amazon EKS clusters, you can associate an IAM role with a Kubernetes service account. This service account can then provide AWS permissions to the containers in any pod that uses that service account. With this feature, you no longer need to provide extended permissions to the worker node IAM role so that pods on that node can call AWS APIs.

In order to support for IAM Roles for Service Accounts we need to add support for a relatively new authentication method in AWS SDK called WebIdentityToken.

This pull request adds support for WebIdentityToken authentication method when using AWS services. This includes S3 and Kinesis.

Brief change log

S3

  • Bumping AWS SDK dependency to 1.11.754
  • Bumping httpcomponents:httpclient dependency to 4.5.9
  • Added aws-java-sdk-sts dependency

Kinesis

  • Bumping AWS SDK dependency to 1.11.754
  • Added CredentialProvider type for WebIdentityToken

Verifying this change

It is tricky to verify this change by end-to-end test as Kinesalite doesn't support authentication methods. Minio does support it but requires extra containers to be launched to do the actual token verification that will complicate and lengthen the s3 tests.

Added configuration unit tests in AWSUtilTest

As this change is mainly a dependency change, I verified it manually on an actual EKS cluster.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no)
  • The runtime per-record code paths (performance sensitive): (yes / no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Currently I could not find the deployment on EKS documented anywhere. If needed i'm willing to add docs if that would be helpful.

@flinkbot
Copy link
Collaborator

flinkbot commented May 6, 2020

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 444df93 (Wed May 06 13:48:27 UTC 2020)

Warnings:

  • 4 pom.xml files were touched: Check for build and licensing issues.
  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented May 6, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@aroch
Copy link
Contributor Author

aroch commented May 7, 2020

@flinkbot run azure

@rmetzger
Copy link
Contributor

rmetzger commented May 8, 2020

Thanks a lot for your contribution. I manually re-triggered the failed job on Azure: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=721&view=results

@rmetzger
Copy link
Contributor

rmetzger commented May 9, 2020

CI has passed ✅

@aroch
Copy link
Contributor Author

aroch commented May 9, 2020

Thanks @rmetzger!

@rmetzger
Copy link
Contributor

@tweise (kinesis) @pnowojski (s3): What's your take on this change?
I'm tempted to merge it without additional testing. @aroch has tested the change, and it is unlikely to break anything.

@aroch
Copy link
Contributor Author

aroch commented May 11, 2020

It would be great if someone could take a second look.
I verified on EKS with the Lyft k8s operator.

@rmetzger
Copy link
Contributor

Okay, looks like nobody has time for this. I will test it myself, because it would be a bummer if this change didn't make it into 1.11.0

public static final String AWS_ROLE_ARN = roleArn(AWS_CREDENTIALS_PROVIDER);

/** The role session name to use when credential provider type is set to ASSUME_ROLE. */
/** The role session name to use when credential provider type is set to ASSUME_ROLE or WEB_IDENTITY_TOKEN. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document this also in docs/dev/connectors/kinesis.md?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean in general that this authentication method exists, not this particular line of code :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw there was no dedicated section for setting up access to Kinesis. I thought it makes sense to add a new section with some details.
I'm not a technical writer, so please let me know what you think :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. It reads pretty well!

@tweise
Copy link
Contributor

tweise commented May 13, 2020

@rmetzger I don't have time to test this change, but I also think it should be merged before 1.11.0 so that any hidden issues can be caught as part of the release candidate testing. As we have just discovered with the kinesis producer in 1.10, benign dependency changes can have unforeseen consequences, so putting this through a thorough benchmark would be in order before release.

@rmetzger
Copy link
Contributor

@aroch Thanks a lot for updating the docs! I'm still working on this.
I'm struggling my way through learning EKS, building custom Flink docker images etc. :)
Right now, I'm facing org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.. From my initial research, I assume that my alpine-based docker image (build from flink-container/docker won't work with the Kinesis binaries).
How are you building your 1.11-SNAPSHOT docker containers?

@rmetzger
Copy link
Contributor

Actually, I managed to get a debian based docker image. I still have the mentioned exception though.

Copy link
Contributor

@rmetzger rmetzger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a "doesn't break things" point of view, I tested Flink on EKS with a Kinesis job, checkpointing to S3, and it properly got the right access to these services in the EKS environment. No objections to merge from that point of view.

I have found two minor license documentation issues. Once these are resolved, I will merge the PR.
Let me know if you don't have time to resolve these, then I can fix & merge.

- com.amazonaws:aws-java-sdk-dynamodb:1.11.754
- com.amazonaws:aws-java-sdk-kms:1.11.754
- com.amazonaws:aws-java-sdk-s3:1.11.754
- com.amazonaws:jmespath-java:1.11.754
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The com.amazonaws:aws-java-sdk-sts:1.11.754 dependency is missing here?

- com.amazonaws:aws-java-sdk-core:1.11.754
- com.amazonaws:aws-java-sdk-dynamodb:1.11.754
- com.amazonaws:aws-java-sdk-kms:1.11.754
- com.amazonaws:aws-java-sdk-s3:1.11.754
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The com.amazonaws:aws-java-sdk-sts:1.11.754 dependency is missing here?

Copy link
Contributor

@rmetzger rmetzger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. Merging

rmetzger pushed a commit to rmetzger/flink that referenced this pull request May 14, 2020
@rmetzger rmetzger closed this in c1ea6fc May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants