Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35672][CORE][YARN] Pass user classpath entries to executors using config instead of command line. #34120

Conversation

xkrogen
Copy link
Contributor

@xkrogen xkrogen commented Sep 27, 2021

What changes were proposed in this pull request?

Refactor the logic for constructing the user classpath from yarn.ApplicationMaster into yarn.Client so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, getUserClassPath, is added to CoarseGrainedExecutorBackend which defaults to Nil (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). YarnCoarseGrainedExecutorBackend overrides this to construct the user classpath from the existing APP_JAR and SECONDARY_JARS configs. Within yarn.Client, environment variables in the configured paths are resolved before constructing the classpath.

Please note that this is a re-submission of #32810, which was reverted in #34082 due to the issues described in this comment. This PR additionally includes the changes described in #34084 to resolve the issue, though this PR has been enhanced to properly handle escape strings, unlike #34084.

Why are the changes needed?

User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in ExecutorRunnable), which then passes the information to the executors (CoarseGrainedExecutorBackend) by specifying each JAR on the executor command line as --user-class-path /path/to/myjar.jar. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

/bin/bash: Argument list too long

A Google search indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

Does this PR introduce any user-facing change?

There is one small behavioral change which is a bug fix. Previously the spark.yarn.config.gatewayPath and spark.yarn.config.replacementPath options were only applied to executors, meaning they would not work for the driver when running in cluster mode. This appears to be a bug; the documentation for this functionality does not mention any limitations that this is only for executors. This PR fixes that issue.

Additionally, this fixes the main bash argument length issue, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before, and substitution of environment variables in spark.jars or spark.yarn.config.replacementPath works as expected.

How was this patch tested?

New unit tests were added in YarnClusterSuite. Also, we have been running a similar fix internally for 4 months with great success.

@xkrogen
Copy link
Contributor Author

xkrogen commented Sep 27, 2021

@SparkQA
Copy link

SparkQA commented Sep 27, 2021

Test build #143658 has finished for PR 34120 at commit 77b4fc1.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 27, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48171/

@SparkQA
Copy link

SparkQA commented Sep 27, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48173/

@SparkQA
Copy link

SparkQA commented Sep 27, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48173/

@SparkQA
Copy link

SparkQA commented Sep 28, 2021

Test build #143660 has finished for PR 34120 at commit 708a377.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48237/

@SparkQA
Copy link

SparkQA commented Sep 29, 2021

Test build #143728 has finished for PR 34120 at commit dce813c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48239/

@SparkQA
Copy link

SparkQA commented Sep 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48237/

@SparkQA
Copy link

SparkQA commented Sep 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48239/

@SparkQA
Copy link

SparkQA commented Sep 29, 2021

Test build #143726 has finished for PR 34120 at commit ddf4549.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

overall changes look fine, at this point I think hold off on merging til 3.2 go out assuming you want to try to put in 3.2.1

@xkrogen
Copy link
Contributor Author

xkrogen commented Sep 30, 2021

SGTM, let's wait for the release to wrap up. I will also fix the style issue.

@SparkQA
Copy link

SparkQA commented Sep 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48281/

@SparkQA
Copy link

SparkQA commented Sep 30, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48281/

@SparkQA
Copy link

SparkQA commented Sep 30, 2021

Test build #143770 has finished for PR 34120 at commit 92e296b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor thing (moving the replaceEnvVars into a an object).

Otherwise looks fine. But let me have one more round when this small refactor is done.

@xkrogen xkrogen force-pushed the xkrogen-SPARK-35672-yarn-classpath-list-take2 branch from 92e296b to ec146d1 Compare October 11, 2021 17:59
@xkrogen
Copy link
Contributor Author

xkrogen commented Oct 11, 2021

Just pushed up a new commit moving replaceEnvVars from Client to YarnSparkHadoopUtil. Also did a minor refactor to share a common variable for the name regex pattern.

@SparkQA
Copy link

SparkQA commented Oct 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48574/

@SparkQA
Copy link

SparkQA commented Oct 11, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48574/

@SparkQA
Copy link

SparkQA commented Oct 11, 2021

Test build #144096 has finished for PR 34120 at commit ec146d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xkrogen xkrogen force-pushed the xkrogen-SPARK-35672-yarn-classpath-list-take2 branch from ec146d1 to 023963c Compare November 1, 2021 20:58
@xkrogen
Copy link
Contributor Author

xkrogen commented Nov 1, 2021

@tgravescs @attilapiros -- now that the Spark 3.2 release is all wrapped up, can you take another look? I just rebased on latest master.

@attilapiros
Copy link
Contributor

@xkrogen I can only do the review on this Friday/Saturday

@xkrogen
Copy link
Contributor Author

xkrogen commented Nov 3, 2021

Thanks @tgravescs , re-triggered the tests.

@attilapiros no problem -- we can target to merge next week if there are no issues from your side.

@SparkQA
Copy link

SparkQA commented Nov 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49349/

@SparkQA
Copy link

SparkQA commented Nov 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49349/

@SparkQA
Copy link

SparkQA commented Nov 3, 2021

Test build #144879 has finished for PR 34120 at commit e1b0668.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions but in general it looks good.

…ing config instead of command line

### What changes were proposed in this pull request?
Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs.

### Why are the changes needed?
User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

> /bin/bash: Argument list too long

A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

### Does this PR introduce _any_ user-facing change?
No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.

### How was this patch tested?
New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success.

Closes apache#32810 from xkrogen/xkrogen-SPARK-35672-classpath-scalable.

Authored-by: Erik Krogen <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
…, add more test cases, follow Unix variable name conventions even on Windows based on the example of Hadoop's Shell class
…r to share a common variable for the name regex pattern.
…ateway replacement doesn't take place, and also make use of a non-empty environment variable
@xkrogen xkrogen force-pushed the xkrogen-SPARK-35672-yarn-classpath-list-take2 branch from e1b0668 to 97540f8 Compare November 10, 2021 21:25
@xkrogen
Copy link
Contributor Author

xkrogen commented Nov 10, 2021

Thanks a lot for the review @attilapiros , some great enhancements to the test cases resulted. I believe they all should be addressed now.

@SparkQA
Copy link

SparkQA commented Nov 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49540/

@SparkQA
Copy link

SparkQA commented Nov 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49540/

@SparkQA
Copy link

SparkQA commented Nov 11, 2021

Test build #145071 has finished for PR 34120 at commit 97540f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @xkrogen!

@xkrogen
Copy link
Contributor Author

xkrogen commented Nov 15, 2021

Thanks @attilapiros ! Would you or @tgravescs be willing to help merge this?

@attilapiros
Copy link
Contributor

merged to master

@xkrogen xkrogen deleted the xkrogen-SPARK-35672-yarn-classpath-list-take2 branch November 16, 2021 22:29
@xkrogen
Copy link
Contributor Author

xkrogen commented Nov 16, 2021

Many thanks @attilapiros and @tgravescs ! Also thanks to @peter-toth for initially reporting the issue with the original PR.

dongjoon-hyun pushed a commit that referenced this pull request Nov 19, 2021
…scala for Scala 2.13

### What changes were proposed in this pull request?

This PR mitigate an issue that MiMa fails for Scala 2.13 after SPARK-35672 (#34120).
```
$ dev/change-scala-version.sh 2.13
$ dev/mima
...
[error] spark-core: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.2.0! Found 8 potential problems (filtered 905)
[error]  * method userClassPath()scala.collection.mutable.ListBuffer in class org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.userClassPath")
[error]  * method copy(java.lang.String,java.lang.String,java.lang.String,java.lang.String,Int,java.lang.String,scala.Option,scala.collection.mutable.ListBuffer,scala.Option,Int)org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments in class org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.copy")
[error]  * synthetic method copy$default$10()Int in class org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.copy$default$10")
[error]  * synthetic method copy$default$8()scala.collection.mutable.ListBuffer in class org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments has a different result type in current version, where it is scala.Option rather than scala.collection.mutable.ListBuffer
[error]    filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.copy$default$8")
[error]  * synthetic method copy$default$9()scala.Option in class org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments has a different result type in current version, where it is Int rather than scala.Option
[error]    filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.copy$default$9")
[error]  * method this(java.lang.String,java.lang.String,java.lang.String,java.lang.String,Int,java.lang.String,scala.Option,scala.collection.mutable.ListBuffer,scala.Option,Int)Unit in class org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.this")
[error]  * the type hierarchy of object org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments is different in current version. Missing types {scala.runtime.AbstractFunction10}
[error]    filter with: ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend$Arguments$")
[error]  * method apply(java.lang.String,java.lang.String,java.lang.String,java.lang.String,Int,java.lang.String,scala.Option,scala.collection.mutable.ListBuffer,scala.Option,Int)org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments in object org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.executor.CoarseGrainedExecutorBackend#Arguments.apply")
...
```

It's funny that the class `Arguments` is `public` but it's a member class of `CoarseGrainedExecutorBackend` which is `package private` and MiMa doesn't raise error for Scala 2.12, but adding an exclusion rule is one workaround.

### Why are the changes needed?

To keep the build stable.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed MiMa passed.
```
$ dev/change-scala-version.sh 2.13
$ dev/mima
```

Closes #34649 from sarutak/followup-SPARK-35672-mima.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants