Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48989

WholeStageCodeGen error resulting in NumberFormatException when calling SUBSTRING_INDEX

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.0.0
    • 4.0.0
    • SQL
    • This was tested from the spark-shell, in local mode. All Spark versions were run with default settings.

      Spark 4.0 SNAPSHOT:  Exception.
      Spark 4.0 Preview:  Exception.
      Spark 3.5.1:  Success.

    Description

      One seems to run into a NumberFormatException, possibly from an error in WholeStageCodeGen, when I exercise SUBSTRING_INDEX with a null row, thus:

      
      // Create integer table with one null.
      sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) ").repartition(1).write.mode("overwrite").parquet("/tmp/mytable")
      
      // Exercise substring-index.
      sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM PARQUET.`/tmp/mytable` ").show()
      
      
      

      On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following exception:

      java.lang.NumberFormatException: For input string: "columnartorow_value_0"
        at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
        at java.base/java.lang.Integer.parseInt(Integer.java:668)
        at org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660)
        at org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869)
        at org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888)
        at org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868)
        at org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659)
        at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
        at scala.Option.getOrElse(Option.scala:201)
        at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
        at org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62)
        at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
        at scala.Option.getOrElse(Option.scala:201)
        at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
        at org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162)
        at org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74)
        at scala.collection.immutable.List.map(List.scala:247)
        at scala.collection.immutable.List.map(List.scala:79)
        at org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74)
        at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085)
        at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74)
        at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200)
        at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153)
        at org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68)
        at org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193)
        at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99)
      

      The same query seems to run alright on Spark 3.5.x:

      +----+-----+
      | num| subs|
      +----+-----+
      |   1|    a|
      |   2|  a_a|
      |   3|a_a_a|
      |NULL| NULL|
      +----+-----+
      
      

      Attachments

        Activity

          People

            mithun Mithun Radhakrishnan
            mithun Mithun Radhakrishnan
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: