Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Breaking][jvm-packages] add barrier execution mode #7836

Merged
merged 1 commit into from
Apr 25, 2022

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Apr 24, 2022

To fix #7835. Credit to @WeichenXu123

By introducing the barrier execution mode, we don't need to kill SparkContext when some xgboost tasks failed, instead, Spark will abort the whole barrier stage which does not depend on SparkListener, we will never encounter the xgboost task hang issue. So in this PR, the killSparkContextOnWorkerFailure parameter is deleted.

Since this PR has deleted Spark SparkParallelismTracker, the timeoutRequestWorkers parameter is not needed anymore. But one test cross-version model loading (0.82) will fail if we just delete timeoutRequestWorkers, we just keep this parameter in this PR. Will file a following up PR to delete timeoutRequestWorkers parameters.

With introducing barrier execution mode. we don't need to kill SparkContext
when some xgboost tasks failed, instead, Spark will do it for us. So in this
PR, killSparkContextOnWorkerFailure parameter is deleted.
@wbo4958
Copy link
Contributor Author

wbo4958 commented Apr 25, 2022

@trivialfis could you help to review it?

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, being able to remove code and improve robustness is exciting!

  • Could you please share more details on what's the breaking change and how users should adapt to it?
  • Could you please keep some RABIT mock tests here to show that exceptions are indeed being well handled?

Comment on lines 375 to 386
test("throw exception for empty partition in trainingset") {
val paramMap = Map("eta" -> "0.1", "max_depth" -> "6", "silent" -> "1",
"objective" -> "multi:softmax", "num_class" -> "2", "num_round" -> 5,
"num_workers" -> numWorkers, "tree_method" -> "auto")
"objective" -> "binary:logistic", "num_class" -> "2", "num_round" -> 5,
"num_workers" -> numWorkers, "tree_method" -> "auto", "allow_non_zero_for_missing" -> true)
// The Dmatrix will be empty
val trainingDF = buildDataFrame(Seq(XGBLabeledPoint(1.0f, 1, Array(), Array())))
val trainingDF = buildDataFrame(Seq(XGBLabeledPoint(1.0f, 4,
Array(0, 1, 2, 3), Array(0, 1, 2, 3))))
val xgb = new XGBoostClassifier(paramMap)
intercept[XGBoostError] {
val model = xgb.fit(trainingDF)
intercept[SparkException] {
xgb.fit(trainingDF)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes strictly related to the support of barrier mode?

Copy link
Contributor Author

@wbo4958 wbo4958 Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely not. I just make this test more like the test description.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Apr 25, 2022

Thank you for working on this, being able to remove code and improve robustness is exciting!

  • Could you please share more details on what's the breaking change and how users should adapt to it?

killSparkContextOnWorkerFailure parameter is deleted. Users are supposed not to use this parameter.

  • Could you please keep some RABIT mock tests here to show that exceptions are indeed being well handled?

Not exactly. the deleted case test("test SparkContext should not be killed ") is only for kill_spark_context_on_worker_failure parameter, and there already has the rabbit mock tests https://github.com/dmlc/xgboost/pull/7836/files#diff-83dc1ab309d0aaa9e7ec44586d99b82cee3c34daa70592fccaca834c6acc27afR85

@trivialfis trivialfis merged commit dc2e699 into dmlc:master Apr 25, 2022
@trivialfis trivialfis mentioned this pull request Apr 25, 2022
7 tasks
@wbo4958 wbo4958 deleted the barrier branch April 26, 2022 02:47
trivialfis pushed a commit to trivialfis/xgboost that referenced this pull request Apr 29, 2022
With the introduction of the barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed. Instead, Spark will handle the errors for us. So in this PR, `killSparkContextOnWorkerFailure` parameter is deleted.
trivialfis added a commit that referenced this pull request Apr 29, 2022
* [jvm-packages] move the dmatrix building into rabit context (#7823)

This fixes the QuantileDeviceDMatrix in distributed environment.

* [doc] update the jvm tutorial to 1.6.1 [skip ci] (#7834)

* [Breaking][jvm-packages] Use barrier execution mode (#7836)

With the introduction of the barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed. Instead, Spark will handle the errors for us. So in this PR, `killSparkContextOnWorkerFailure` parameter is deleted.

* [doc] remove the doc about killing SparkContext [skip ci] (#7840)

* [jvm-package] remove the coalesce in barrier mode (#7846)

* [jvm-packages] Fix model compatibility (#7845)

* Ignore all Java exceptions when looking for Linux musl support (#7844)

Co-authored-by: Bobby Wang <[email protected]>
Co-authored-by: Michael Allman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] [JVM-Packages] Add barrier execution mode for support
2 participants