Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
-
This was tested from the spark-shell, in local mode. All Spark versions were run with default settings.
Spark 4.0 SNAPSHOT: Exception.
Spark 4.0 Preview: Exception.
Spark 3.5.1: Success.
Description
One seems to run into a NumberFormatException, possibly from an error in WholeStageCodeGen, when I exercise SUBSTRING_INDEX with a null row, thus:
// Create integer table with one null. sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) ").repartition(1).write.mode("overwrite").parquet("/tmp/mytable") // Exercise substring-index. sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM PARQUET.`/tmp/mytable` ").show()
On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following exception:
java.lang.NumberFormatException: For input string: "columnartorow_value_0" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) at java.base/java.lang.Integer.parseInt(Integer.java:668) at org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868) at org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202) at org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202) at org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162) at org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74) at scala.collection.immutable.List.map(List.scala:247) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085) at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74) at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200) at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153) at org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68) at org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193) at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99)
The same query seems to run alright on Spark 3.5.x:
+----+-----+ | num| subs| +----+-----+ | 1| a| | 2| a_a| | 3|a_a_a| |NULL| NULL| +----+-----+