-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XGBoostSageMakerEstimator.fit() returns libsvm exception when reading csv file. #47
Comments
Hi, thanks for using SageMaker Spark! XGBoostSageMakerEstimator uses Spark's LibSVMOutputWriter, which is rather restrictive in its schema validation: https://github.com/apache/spark/blob/930b90a84871e2504b57ed50efa7b8bb52d3ba44/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala#L79 I think the issue stems from the number of columns in your training data? There was some discussion of extra columns in #12 - not sure if anything in that issue might be relevant here. |
Thanks for laurenyu's reply. I wonder whether XGBoostSageMakerEstimator use the verifySchema() provided here https://github.com/apache/spark/blob/930b90a84871e2504b57ed50efa7b8bb52d3ba44/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala#L79 even if the input is in csv format? The official guide https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html says that input can be libsvm or csv, and
So I think the data read in the csv way dose not need to fit the two column schema. In this example https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.ipynb, the author also said that
I followed the example's steps to create the csv file it uses, the first 3 lines are: |
All SageMakerEstimators rely on Spark's DataFrame writers. The XGBoostSageMakerEstimator defaults to write data using "libsvm" format. Can you try passing in "csv" to "trainingSparkDataFormat" (or "com.databricks.spark.csv" if you're using spark-csv)? Line 491 in dabd136
|
@andremoeller |
Sure! I just commented on that issue. You will also have to pass in Some("csv") for the Line 488 in dabd136
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html
|
@andremoeller
So I think that the only difference between training and inferencing csv files is the first column. The question is that when I call: Seems that the tranform() function wants the "label" and "features" 2 column format. I tried to find some parameters to pass in the initializer like before but can't find one in the page you provided before. Thank you! |
Lines 479 to 480 in 81ac056
If you want to send CSV, you should use this The Lines 27 to 28 in 81ac056
Right now, your DataFrame doesn't have such a column for a features vector, but you can make one with a VectorAssembler. After making an XGBoost estimator with
Feel free to reach out if you run into trouble or if this was unclear. |
@andremoeller Which python library should I import to have UnlabeledCSVRequestRowSerializer()? |
@haowang-ms89 Lines 31 to 40 in 81ac056
|
@andremoeller
The "features" column in line 1 contains 27, the number of features, but the feature vector is wrong (the feature vector of all vectors with the number 27 are [0,1,3,4,5,6,...]). I use the Python version of assembler I found here https://spark.apache.org/docs/2.3.0/ml-features.html#vectorassembler: And test_data_features.show() gives: |
That's normal. The VectorAssembler sparsely encodes vectors if there are lots of zeros in the data to save memory. The rows with 27 are SparseVectors. The 27 is the size of the array, followed by an array of indices, followed by an array of values. The densely encoded rows just have more nonzero values. I believe the UnlabeledCSVRequestRowSerializer handles Sparse vectors correctly (that is, fills in the zeros when serializing to CSV). |
@andremoeller The following exception occurs: Does this mean that the transform() function take "label" column? But label should not be required when doing prediction? |
That looks like it's still using the LibSVM serializer, not the UnlabeledCSVRequestRowSerializer. The LibSVM serializer validates the schema like this: Lines 28 to 30 in 81ac056
Did you set |
@andremoeller
It looks like there is a parsing error? The bracket '[' should not stick with the number? |
Hi @haowang-ms89 ,
This line indicates that the XGBoost CSV deserializer is failing to deserialize the response from the XGBoost model. That number ( Would it be possible to send request body you're sending to help us reproduce? I believe you can find it in the endpoint logs for your endpoint, in CloudWatch. In the meantime, I'll reach out to the developers of the XGBoost SageMaker algorithm. Thanks! |
Hi @andremoeller |
Hi @haowang-ms89 , Huh, it's possible that they don't log failed requests. Thank you for that warning, though. I'll update this issue when I hear back from them. |
Hi @andremoeller |
Yes, it sure would. If you can post it, I'll try to reproduce the issue. |
@andremoeller Here is the code I wrote:
productionData0207_noheader_allnumber.txt |
Hi @haowang-ms89 , Thanks! I could reproduce this. I've contacted the XGBoost developers and asked them to take a look at what's going wrong. |
Hi @haowang-ms89, Thanks for sharing the details. Is the issue here that you cannot get the predictions results on hosting? |
Hi @EvanZzZz |
There's a bug in the XGBoost container with the Other objectives (those that return a scalar per record rather than a vector) still work as expected. You could also call InvokeEndpoint directly using the AWS Java SDK or boto3 client (or another AWS client) for SageMaker Runtime. Please let us know if you have any other questions. Thanks! |
Hi @andremoeller |
Yeah, I think that's right. Hyperparameters are passed in to XGBoost just as documented on the XGBoost GitHub page: https://github.com/dmlc/xgboost/blob/master/doc/parameter.md |
@andremoeller |
@andremoeller |
Do they show up in your CloudWatch logs for your XGBoost training job? If not, we won't be able to get them, but if so: streaming logs from CloudWatch to Spark is possible, but just not implemented. |
Labeling this as a bug and keeping this open to track the new output format for XGBoost for multi-dimensional arrays. |
Hi there, I am looking to use the |
Hello, I'm on the same situation as @DanyalAndriano and can't make |
I write my python code with Zeppelin 0.7.3 and Spark 2.3.0 on an EMR (emr-5.13.0) cluster to use SageMaker's XGBoost algorithm. The input data is a csv file. The first 3 lines of the file are (the first column is 0 or 1 for target class, and there is no header line):
0,9.6071,2,1,1,2,1,1,1,1,3,1,0,0,0,0,3,0,0,3,0,0,3,0,2,1,1,1 0,2.7296,3,1,1,1,1,1,0,0,8,1,0,0,0,0,3,0,0,3,0,0,3,0,1,1,1,1 0,10.3326,1,0,1,2,1,1,0,0,4,1,1,0,1,0,3,0,0,3,0,0,3,0,0,3,0,0
I imported as the example does:
%pyspark from pyspark import SparkContext, SparkConf from sagemaker_pyspark import IAMRole, classpath_jars from sagemaker_pyspark.algorithms import XGBoostSageMakerEstimator
I initialize the estimator:
%pyspark xgboost_estimator = XGBoostSageMakerEstimator( trainingInstanceType="ml.m3.xlarge", trainingInstanceCount=1, endpointInstanceType="ml.m3.xlarge", endpointInitialInstanceCount=1) xgboost_estimator.setObjective('multi:softprob') xgboost_estimator.setNumRound(25) xgboost_estimator.setNumClasses(2)
I read the csv file with:
training_data = spark.read.csv("s3:https://poc.sagemaker.myfile/myfile.csv", sep=",", header="false", inferSchema="true")
training_data.show() gives:
+---+-------+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+ |_c0| _c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|_c24|_c25|_c26|_c27| +---+-------+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+ | 0| 7.1732| 1| 0| 1| 2| 2| 2| 0| 0| 5| 1| 1| 0| 1| 0| 3| 0| 0| 3| 0| 0| 3| 0| 0| 3| 0| 0| | 0| 1.3087| 1| 0| 1| 2| 1| 1| 0| 0| 2| 1| 1| 0| 2| 0| 3| 0| 0| 3| 0| 0| 3| 0| 0| 3| 0| 0| | 0| 3.3539| 1| 0| 1| 2| 2| 1| 0| 0| 6| 1| 1| 0| 0| 0| 3| 0| 0| 3| 0| 0| 3| 0| 0| 3| 0| 0| | 0| 1.9767| 1| 0| 1| 1| 1| 1| 1| 1| 73| 1| 0| 0| 1| 0| 3| 0| 0| 3| 0| 1| 0| 1| 1| 0| 1| 1| | 0| 5.7194| 1| 0| 1| 2| 1| 1| 0| 0| 3| 1| 0| 0| 0| 0| 3| 0| 0| 3| 0| 0| 3| 0| 0| 3| 0| 0| | 0| 9.8398| 3| 1| 1| 2| 1| 1| 0| 0| 2| 1| 1| 0| 1| 0| 3| 0| 0| 3| 0| 2| 1| 1| 2| 1| 1| 1| | 0| 2.4942| 1| 0| 1| 2| 1| 1| 0| 0| 377| 1| 1| 0| 2| 0| 3| 0| 0| 3| 0| 0| 3| 0| 0| 3| 0| 0| | 0| 7.9179| 4| 1| 1| 2| 1| 1| 0| 0| 4| 1| 1| 0| 2| 0| 3| 0| 0| 3| 0| 2| 0| 1| 2| 1| 1| 1|
When I try to fit the xgboost model with:
xgboost_model = xgboost_estimator.fit(training_data)
The following exception returns:
Traceback (most recent call last): File "/tmp/zeppelin_pyspark-8068283221541252178.py", line 367, in <module> raise Exception(traceback.format_exc()) Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-8068283221541252178.py", line 360, in <module> exec(code, _zcUserQueryNameSpace) File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/site-packages/sagemaker_pyspark/SageMakerEstimator.py", line 253, in fit return self._call_java("fit", dataset) File "/usr/local/lib/python2.7/site-packages/sagemaker_pyspark/wrapper.py", line 76, in _call_java java_value = super(SageMakerJavaWrapper, self)._call_java(name, *java_args) File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 51, in _call_java return _java2py(sc, m(*java_args)) File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o130.fit. : java.io.IOException: Illegal schema for libsvm data, schema=StructType(StructField(_c0,IntegerType,true), StructField(_c1,DoubleType,true), StructField(_c2,IntegerType,true), StructField(_c3,IntegerType,true), StructField(_c4,IntegerType,true), StructField(_c5,IntegerType,true), StructField(_c6,IntegerType,true), StructField(_c7,IntegerType,true), StructField(_c8,IntegerType,true), StructField(_c9,IntegerType,true), StructField(_c10,IntegerType,true), StructField(_c11,IntegerType,true), StructField(_c12,IntegerType,true), StructField(_c13,IntegerType,true), StructField(_c14,IntegerType,true), StructField(_c15,IntegerType,true), StructField(_c16,IntegerType,true), StructField(_c17,IntegerType,true), StructField(_c18,IntegerType,true), StructField(_c19,IntegerType,true), StructField(_c20,IntegerType,true), StructField(_c21,IntegerType,true), StructField(_c22,IntegerType,true), StructField(_c23,IntegerType,true), StructField(_c24,IntegerType,true), StructField(_c25,IntegerType,true), StructField(_c26,IntegerType,true), StructField(_c27,IntegerType,true)) at org.apache.spark.ml.source.libsvm.LibSVMFileFormat.verifySchema(LibSVMRelation.scala:86) at org.apache.spark.ml.source.libsvm.LibSVMFileFormat.prepareWrite(LibSVMRelation.scala:122) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) at com.amazonaws.services.sagemaker.sparksdk.internal.DataUploader.writeData(DataUploader.scala:111) at com.amazonaws.services.sagemaker.sparksdk.internal.DataUploader.uploadData(DataUploader.scala:90) at com.amazonaws.services.sagemaker.sparksdk.SageMakerEstimator.fit(SageMakerEstimator.scala:299) at com.amazonaws.services.sagemaker.sparksdk.SageMakerEstimator.fit(SageMakerEstimator.scala:175) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748)
Did I miss some steps so that the estimator use the libsvm libraries to process the csv input?
The text was updated successfully, but these errors were encountered: