[Breaking][jvm-packages] add barrier execution mode #7836

wbo4958 · 2022-04-24T10:37:14Z

By introducing the barrier execution mode, we don't need to kill SparkContext when some xgboost tasks failed, instead, Spark will abort the whole barrier stage which does not depend on SparkListener, we will never encounter the xgboost task hang issue. So in this PR, the killSparkContextOnWorkerFailure parameter is deleted.

Since this PR has deleted Spark SparkParallelismTracker, the timeoutRequestWorkers parameter is not needed anymore. But one test cross-version model loading (0.82) will fail if we just delete timeoutRequestWorkers, we just keep this parameter in this PR. Will file a following up PR to delete timeoutRequestWorkers parameters.

With introducing barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed, instead, Spark will do it for us. So in this PR, killSparkContextOnWorkerFailure parameter is deleted.

wbo4958 · 2022-04-25T03:54:16Z

@trivialfis could you help to review it?

trivialfis

Thank you for working on this, being able to remove code and improve robustness is exciting!

Could you please share more details on what's the breaking change and how users should adapt to it?
Could you please keep some RABIT mock tests here to show that exceptions are indeed being well handled?

trivialfis · 2022-04-25T08:02:18Z

...kages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostGeneralSuite.scala

  test("throw exception for empty partition in trainingset") {
    val paramMap = Map("eta" -> "0.1", "max_depth" -> "6", "silent" -> "1",
-      "objective" -> "multi:softmax", "num_class" -> "2", "num_round" -> 5,
-      "num_workers" -> numWorkers, "tree_method" -> "auto")
+      "objective" -> "binary:logistic", "num_class" -> "2", "num_round" -> 5,
+      "num_workers" -> numWorkers, "tree_method" -> "auto", "allow_non_zero_for_missing" -> true)
    // The Dmatrix will be empty
-    val trainingDF = buildDataFrame(Seq(XGBLabeledPoint(1.0f, 1, Array(), Array())))
+    val trainingDF = buildDataFrame(Seq(XGBLabeledPoint(1.0f, 4,
+      Array(0, 1, 2, 3), Array(0, 1, 2, 3))))
    val xgb = new XGBoostClassifier(paramMap)
-    intercept[XGBoostError] {
-      val model = xgb.fit(trainingDF)
+    intercept[SparkException] {
+      xgb.fit(trainingDF)
    }
  }


Are these changes strictly related to the support of barrier mode?

Completely not. I just make this test more like the test description.

wbo4958 · 2022-04-25T08:40:47Z

Thank you for working on this, being able to remove code and improve robustness is exciting!

Could you please share more details on what's the breaking change and how users should adapt to it?

killSparkContextOnWorkerFailure parameter is deleted. Users are supposed not to use this parameter.

Could you please keep some RABIT mock tests here to show that exceptions are indeed being well handled?

Not exactly. the deleted case test("test SparkContext should not be killed ") is only for kill_spark_context_on_worker_failure parameter, and there already has the rabbit mock tests https://github.com/dmlc/xgboost/pull/7836/files#diff-83dc1ab309d0aaa9e7ec44586d99b82cee3c34daa70592fccaca834c6acc27afR85

With the introduction of the barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed. Instead, Spark will handle the errors for us. So in this PR, `killSparkContextOnWorkerFailure` parameter is deleted.

* [jvm-packages] move the dmatrix building into rabit context (#7823) This fixes the QuantileDeviceDMatrix in distributed environment. * [doc] update the jvm tutorial to 1.6.1 [skip ci] (#7834) * [Breaking][jvm-packages] Use barrier execution mode (#7836) With the introduction of the barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed. Instead, Spark will handle the errors for us. So in this PR, `killSparkContextOnWorkerFailure` parameter is deleted. * [doc] remove the doc about killing SparkContext [skip ci] (#7840) * [jvm-package] remove the coalesce in barrier mode (#7846) * [jvm-packages] Fix model compatibility (#7845) * Ignore all Java exceptions when looking for Linux musl support (#7844) Co-authored-by: Bobby Wang <[email protected]> Co-authored-by: Michael Allman <[email protected]>

[Breaking][jvm-packages] add barrier execution mode

0a958f4

With introducing barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed, instead, Spark will do it for us. So in this PR, killSparkContextOnWorkerFailure parameter is deleted.

This was referenced Apr 24, 2022

[Followup][JVM-Packages] delete timeoutRequestWorkers parameter #7837

Closed

[Followup][JVM-Packages] re-architecture the xgboost testing framework #7838

Open

trivialfis reviewed Apr 25, 2022

View reviewed changes

trivialfis approved these changes Apr 25, 2022

View reviewed changes

trivialfis merged commit dc2e699 into dmlc:master Apr 25, 2022

This was referenced Apr 25, 2022

[Breaking][jvm-packages] remove timeoutRequestWorkers parameter #7839

Merged

[doc] remove the doc about killing SparkContext [skip ci] #7840

Merged

[jvm-packages] bridge the gaps between jvm package and native xgboost #7802

Closed

trivialfis mentioned this pull request Apr 25, 2022

1.6.1 Patch Release #7841

Closed

7 tasks

wbo4958 deleted the barrier branch April 26, 2022 02:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Breaking][jvm-packages] add barrier execution mode #7836

[Breaking][jvm-packages] add barrier execution mode #7836

wbo4958 commented Apr 24, 2022

wbo4958 commented Apr 25, 2022

trivialfis left a comment

trivialfis Apr 25, 2022

wbo4958 Apr 25, 2022 •

edited

Loading

wbo4958 commented Apr 25, 2022

[Breaking][jvm-packages] add barrier execution mode #7836

[Breaking][jvm-packages] add barrier execution mode #7836

Conversation

wbo4958 commented Apr 24, 2022

wbo4958 commented Apr 25, 2022

trivialfis left a comment

Choose a reason for hiding this comment

trivialfis Apr 25, 2022

Choose a reason for hiding this comment

wbo4958 Apr 25, 2022 • edited Loading

Choose a reason for hiding this comment

wbo4958 commented Apr 25, 2022

wbo4958 Apr 25, 2022 •

edited

Loading