Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support calling Java parser in SQLFlow #1171

Conversation

tonyyang-svail
Copy link
Collaborator

@tonyyang-svail tonyyang-svail commented Nov 12, 2019

This PR depends on #1170.

This PR is a part of #1126.

@tonyyang-svail tonyyang-svail changed the title Integrate java parser Support calling Java parser in SQLFlow Nov 12, 2019
@tonyyang-svail
Copy link
Collaborator Author

tonyyang-svail commented Nov 12, 2019

Calling Java parser takes about 1100ms for HiveQL parser and 800ms for Calcite parser. I am not sure if it's normal for Java command-line programs. If it is not normal, we'd better do some profiling on the Java program. @Yancey1989 @weiguoz Do you have experience in profiling?

You can reproduce the result the following command.

$ echo "select 1" > /tmp/input.sql
$ time java -cp /opt/sqlflow/parser/parser-1.0-SNAPSHOT-jar-with-dependencies.jar org.sqlflow.parser.ParserAdaptorCmd -p calcite -i /tmp/input.sql -o /tmp/output.json

real	0m0.922s
user	0m0.690s
sys	0m0.530s

@typhoonzero
Copy link
Collaborator

@tonyyang-svail can you write a test case that starts jvm once and run the parsing for 1000 times to see if the major time consumed is caused by starting the jvm?

@tonyyang-svail
Copy link
Collaborator Author

tonyyang-svail commented Nov 12, 2019

@typhoonzero I have tested it by changing the command-line argument calcite to i_dont_exists, which would let the program exit before doing anything useful. And it takes about 137ms, which the original time is 900+ms. So should I assume the parsing algorithm is the bottleneck?

$ time java -cp /opt/sqlflow/parser/parser-1.0-SNAPSHOT-jar-with-dependencies.jar org.sqlflow.parser.ParserAdaptorCmd -p i_dont_exist -i /tmp/input.sql -o /tmp/output.json

invalid parser type i_dont_exist
real	0m0.137s
user	0m0.060s
sys	0m0.080s

@tonyyang-svail tonyyang-svail mentioned this pull request Nov 12, 2019
@typhoonzero
Copy link
Collaborator

@tonyyang-svail seems so. Maybe we can merge this first and do more profiling or performance fixes in another PR.

"-cp", "/opt/sqlflow/parser/parser-1.0-SNAPSHOT-jar-with-dependencies.jar",
"org.sqlflow.parser.ParserAdaptorCmd",
"-p", typ,
"-i", inputFile,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can refine this later using the Linux pipe to avoid writing temporary files on the disk.

@weiguoz
Copy link
Collaborator

weiguoz commented Nov 12, 2019

Calling Java parser takes about 1100ms for HiveQL parser and 800ms for Calcite parser. I am not sure if it's normal for Java command-line programs. If it is not normal, we'd better do some profiling on the Java program. @Yancey1989 @weiguoz Do you have experience in profiling?

You can reproduce the result the following command.

$ echo "select 1" > /tmp/input.sql
$ time java -cp /opt/sqlflow/parser/parser-1.0-SNAPSHOT-jar-with-dependencies.jar org.sqlflow.parser.ParserAdaptorCmd -p calcite -i /tmp/input.sql -o /tmp/output.json

real	0m0.922s
user	0m0.690s
sys	0m0.530s

I think the time is mainly costed in the startup phase after running some tests on my mac.
CalciteParserAdaptor.ParseAndSplit(string sql)

# (test cases) costs: millisecond
30,000 16,253
3,000 4,576

https://github.com/weiguoz/sqlflow/blob/testParsePerf/java/parser/src/test/java/org/sqlflow/parser/CalciteParserAdaptorTest.java#L118-L120

To avoid startup the slow JVM:

  1. Calling by gRPC
  2. Java Native Interface Golang Interface -- I didn't take a try.

@weiguoz weiguoz self-requested a review November 12, 2019 11:32
Copy link
Collaborator

@weiguoz weiguoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Let's improve the calcite calling performance in the next PR

@tonyyang-svail
Copy link
Collaborator Author

I think the time is mainly costed in the startup phase after running some tests on my mac.
CalciteParserAdaptor.ParseAndSplit(string sql)

@weiguoz Wooow, you are awesome!

@tonyyang-svail tonyyang-svail merged commit f3e2456 into sql-machine-learning:develop Nov 12, 2019
@tonyyang-svail tonyyang-svail deleted the integrate_java_parser branch November 12, 2019 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants