[SPARK-35672][CORE][YARN] Pass user classpath entries to executors using config instead of command line. #34120

xkrogen · 2021-09-27T20:38:40Z

What changes were proposed in this pull request?

Refactor the logic for constructing the user classpath from yarn.ApplicationMaster into yarn.Client so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, getUserClassPath, is added to CoarseGrainedExecutorBackend which defaults to Nil (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). YarnCoarseGrainedExecutorBackend overrides this to construct the user classpath from the existing APP_JAR and SECONDARY_JARS configs. Within yarn.Client, environment variables in the configured paths are resolved before constructing the classpath.

Please note that this is a re-submission of #32810, which was reverted in #34082 due to the issues described in this comment. This PR additionally includes the changes described in #34084 to resolve the issue, though this PR has been enhanced to properly handle escape strings, unlike #34084.

Why are the changes needed?

User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in ExecutorRunnable), which then passes the information to the executors (CoarseGrainedExecutorBackend) by specifying each JAR on the executor command line as --user-class-path /path/to/myjar.jar. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

/bin/bash: Argument list too long

A Google search indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

Does this PR introduce any user-facing change?

There is one small behavioral change which is a bug fix. Previously the spark.yarn.config.gatewayPath and spark.yarn.config.replacementPath options were only applied to executors, meaning they would not work for the driver when running in cluster mode. This appears to be a bug; the documentation for this functionality does not mention any limitations that this is only for executors. This PR fixes that issue.

Additionally, this fixes the main bash argument length issue, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before, and substitution of environment variables in spark.jars or spark.yarn.config.replacementPath works as expected.

How was this patch tested?

New unit tests were added in YarnClusterSuite. Also, we have been running a similar fix internally for 4 months with great success.

xkrogen · 2021-09-27T20:39:34Z

cc @mridulm @tgravescs @peter-toth @gengliangwang @HyukjinKwon

SparkQA · 2021-09-27T20:55:01Z

Test build #143658 has finished for PR 34120 at commit 77b4fc1.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-09-27T20:59:26Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48171/

SparkQA · 2021-09-27T23:04:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48173/

SparkQA · 2021-09-27T23:50:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48173/

SparkQA · 2021-09-28T01:07:10Z

Test build #143660 has finished for PR 34120 at commit 708a377.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

SparkQA · 2021-09-29T19:17:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48237/

SparkQA · 2021-09-29T19:24:46Z

Test build #143728 has finished for PR 34120 at commit dce813c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-09-29T20:14:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48239/

SparkQA · 2021-09-29T20:19:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48237/

SparkQA · 2021-09-29T21:00:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48239/

SparkQA · 2021-09-29T21:15:44Z

Test build #143726 has finished for PR 34120 at commit ddf4549.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2021-09-30T14:32:14Z

overall changes look fine, at this point I think hold off on merging til 3.2 go out assuming you want to try to put in 3.2.1

xkrogen · 2021-09-30T16:19:50Z

SGTM, let's wait for the release to wrap up. I will also fix the style issue.

SparkQA · 2021-09-30T18:08:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48281/

SparkQA · 2021-09-30T18:51:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48281/

SparkQA · 2021-09-30T20:12:22Z

Test build #143770 has finished for PR 34120 at commit 92e296b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros

One minor thing (moving the replaceEnvVars into a an object).

Otherwise looks fine. But let me have one more round when this small refactor is done.

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala

xkrogen · 2021-10-11T18:00:43Z

Just pushed up a new commit moving replaceEnvVars from Client to YarnSparkHadoopUtil. Also did a minor refactor to share a common variable for the name regex pattern.

SparkQA · 2021-10-11T18:49:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48574/

SparkQA · 2021-10-11T19:28:08Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48574/

SparkQA · 2021-10-11T20:46:41Z

Test build #144096 has finished for PR 34120 at commit ec146d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xkrogen · 2021-11-01T21:03:19Z

@tgravescs @attilapiros -- now that the Spark 3.2 release is all wrapped up, can you take another look? I just rebased on latest master.

attilapiros · 2021-11-03T16:06:03Z

@xkrogen I can only do the review on this Friday/Saturday

xkrogen · 2021-11-03T16:07:34Z

Thanks @tgravescs , re-triggered the tests.

@attilapiros no problem -- we can target to merge next week if there are no issues from your side.

SparkQA · 2021-11-03T17:31:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49349/

SparkQA · 2021-11-03T18:29:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49349/

SparkQA · 2021-11-03T19:22:06Z

Test build #144879 has finished for PR 34120 at commit e1b0668.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros

Just a few questions but in general it looks good.

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala

…ing config instead of command line ### What changes were proposed in this pull request? Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs. ### Why are the changes needed? User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message: > /bin/bash: Argument list too long A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue. ### Does this PR introduce _any_ user-facing change? No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before. ### How was this patch tested? New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success. Closes apache#32810 from xkrogen/xkrogen-SPARK-35672-classpath-scalable. Authored-by: Erik Krogen <[email protected]> Signed-off-by: Thomas Graves <[email protected]>

… lists

…, add more test cases, follow Unix variable name conventions even on Windows based on the example of Hadoop's Shell class

…r to share a common variable for the name regex pattern.

…ateway replacement doesn't take place, and also make use of a non-empty environment variable

xkrogen · 2021-11-10T21:28:37Z

Thanks a lot for the review @attilapiros , some great enhancements to the test cases resulted. I believe they all should be addressed now.

SparkQA · 2021-11-10T22:50:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49540/

SparkQA · 2021-11-10T23:33:08Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49540/

SparkQA · 2021-11-11T00:39:28Z

Test build #145071 has finished for PR 34120 at commit 97540f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros

LGTM

Thanks @xkrogen!

xkrogen · 2021-11-15T17:57:15Z

Thanks @attilapiros ! Would you or @tgravescs be willing to help merge this?

attilapiros · 2021-11-16T17:12:30Z

merged to master

xkrogen · 2021-11-16T22:30:10Z

Many thanks @attilapiros and @tgravescs ! Also thanks to @peter-toth for initially reporting the issue with the original PR.

…scala for Scala 2.13 ### What changes were proposed in this pull request? This PR mitigate an issue that MiMa fails for Scala 2.13 after SPARK-35672 (#34120). ``` $ dev/change-scala-version.sh 2.13 $ dev/mima ... [error] spark-core: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.2.0! Found 8 potential problems (filtered 905) [error] * method userClassPath()scala.collection.mutable.ListBuffer in class org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments does not have a correspondent in current version [error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.userClassPath") [error] * method copy(java.lang.String,java.lang.String,java.lang.String,java.lang.String,Int,java.lang.String,scala.Option,scala.collection.mutable.ListBuffer,scala.Option,Int)org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments in class org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments does not have a correspondent in current version [error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.copy") [error] * synthetic method copy$default$10()Int in class org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments does not have a correspondent in current version [error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.copy$default$10") [error] * synthetic method copy$default$8()scala.collection.mutable.ListBuffer in class org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments has a different result type in current version, where it is scala.Option rather than scala.collection.mutable.ListBuffer [error] filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.copy$default$8") [error] * synthetic method copy$default$9()scala.Option in class org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments has a different result type in current version, where it is Int rather than scala.Option [error] filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.copy$default$9") [error] * method this(java.lang.String,java.lang.String,java.lang.String,java.lang.String,Int,java.lang.String,scala.Option,scala.collection.mutable.ListBuffer,scala.Option,Int)Unit in class org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments does not have a correspondent in current version [error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.this") [error] * the type hierarchy of object org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments is different in current version. Missing types {scala.runtime.AbstractFunction10} [error] filter with: ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend$Arguments$") [error] * method apply(java.lang.String,java.lang.String,java.lang.String,java.lang.String,Int,java.lang.String,scala.Option,scala.collection.mutable.ListBuffer,scala.Option,Int)org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments in object org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments does not have a correspondent in current version [error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.apply") ... ``` It's funny that the class `Arguments` is `public` but it's a member class of `CoarseGrainedExecutorBackend` which is `package private` and MiMa doesn't raise error for Scala 2.12, but adding an exclusion rule is one workaround. ### Why are the changes needed? To keep the build stable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed MiMa passed. ``` $ dev/change-scala-version.sh 2.13 $ dev/mima ``` Closes #34649 from sarutak/followup-SPARK-35672-mima. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added CORE YARN labels Sep 27, 2021

github-actions bot added the KUBERNETES label Sep 27, 2021

tgravescs mentioned this pull request Sep 29, 2021

[SPARK-35672][CORE][YARN] Handle environment variable replacement in user classpath lists #34084

Closed

tgravescs reviewed Sep 29, 2021

View reviewed changes

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala Outdated Show resolved Hide resolved

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala Outdated Show resolved Hide resolved

attilapiros reviewed Oct 11, 2021

View reviewed changes

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala Outdated Show resolved Hide resolved

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala Outdated Show resolved Hide resolved

xkrogen force-pushed the xkrogen-SPARK-35672-yarn-classpath-list-take2 branch from 92e296b to ec146d1 Compare October 11, 2021 17:59

xkrogen force-pushed the xkrogen-SPARK-35672-yarn-classpath-list-take2 branch from ec146d1 to 023963c Compare November 1, 2021 20:58

attilapiros reviewed Nov 6, 2021

View reviewed changes

xkrogen added 11 commits November 8, 2021 13:48

SPARK-35672 Handle environment variable replacement in user classpath…

d227b7d

… lists

Adjust environment variable logic to accommodate escape strings

aff6721

fix KubernetesExecutorBackend

f51d135

enhance comment to be more descriptive

c44eac5

Add handling for carat-based escapes in Windows, simplify regex a bit…

cf9d4f4

…, add more test cases, follow Unix variable name conventions even on Windows based on the example of Hadoop's Shell class

fix scalastyle issue and update comments a bit

1fe5ac4

Move replaceEnvVars from Client to YarnSparkHadoopUtil. Minor refacto…

6e36251

…r to share a common variable for the name regex pattern.

empty commit to trigger build

4a97b26

test simplification per suggestion by attilapiros

8e33a99

enhance test case for environment variable to make sure it fails if g…

97540f8

…ateway replacement doesn't take place, and also make use of a non-empty environment variable

xkrogen force-pushed the xkrogen-SPARK-35672-yarn-classpath-list-take2 branch from e1b0668 to 97540f8 Compare November 10, 2021 21:25

attilapiros approved these changes Nov 11, 2021

View reviewed changes

attilapiros closed this in 693537f Nov 16, 2021

xkrogen deleted the xkrogen-SPARK-35672-yarn-classpath-list-take2 branch November 16, 2021 22:29

sarutak mentioned this pull request Nov 18, 2021

[SPARK-35672][FOLLOWUP][TESTS] Add an exclusion rule to MimaExcludes.scala for Scala 2.13. #34649

Closed

[SPARK-35672][CORE][YARN] Pass user classpath entries to executors using config instead of command line. #34120

[SPARK-35672][CORE][YARN] Pass user classpath entries to executors using config instead of command line. #34120

Conversation

xkrogen commented Sep 27, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

xkrogen commented Sep 27, 2021

SparkQA commented Sep 27, 2021

SparkQA commented Sep 27, 2021

SparkQA commented Sep 27, 2021

SparkQA commented Sep 27, 2021

SparkQA commented Sep 28, 2021

SparkQA commented Sep 29, 2021

SparkQA commented Sep 29, 2021

SparkQA commented Sep 29, 2021

SparkQA commented Sep 29, 2021

SparkQA commented Sep 29, 2021

SparkQA commented Sep 29, 2021

tgravescs commented Sep 30, 2021

xkrogen commented Sep 30, 2021

SparkQA commented Sep 30, 2021

SparkQA commented Sep 30, 2021

SparkQA commented Sep 30, 2021

attilapiros left a comment

Choose a reason for hiding this comment

xkrogen commented Oct 11, 2021

SparkQA commented Oct 11, 2021

SparkQA commented Oct 11, 2021

SparkQA commented Oct 11, 2021

xkrogen commented Nov 1, 2021

attilapiros commented Nov 3, 2021

xkrogen commented Nov 3, 2021

SparkQA commented Nov 3, 2021

SparkQA commented Nov 3, 2021

SparkQA commented Nov 3, 2021

attilapiros left a comment

Choose a reason for hiding this comment

xkrogen commented Nov 10, 2021

SparkQA commented Nov 10, 2021

SparkQA commented Nov 10, 2021

SparkQA commented Nov 11, 2021

attilapiros left a comment

Choose a reason for hiding this comment

xkrogen commented Nov 15, 2021

attilapiros commented Nov 16, 2021

xkrogen commented Nov 16, 2021

xkrogen commented Sep 27, 2021 •

edited

Loading