Skip to content
/ livyc Public

Apache Spark as a Service with Apache Livy Client

License

Notifications You must be signed in to change notification settings

Wittline/livyc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

livyc

Apache Livy Client

Install library

pip install livyc

Import library

from livyc import livyc

Setting livy configuration

data_livy = {
    "livy_server_url": "localhost",
    "port": "8998",
    "jars": ["org.postgresql:postgresql:42.3.1"]
}

Let's try launch a pySpark script to Apache Livy Server

params = {"host": "localhost", "port":"5432", "database": "db", "table":"staging", "user": "postgres", "password": "pg12345"}
pyspark_script = """

    from pyspark.sql.functions import udf, col, explode
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType
    from pyspark.sql import Row
    from pyspark.sql import SparkSession


    df = spark.read.format("jdbc") \
        .option("url", "jdbc:postgresql:https://{host}:{port}/{database}") \
        .option("driver", "org.postgresql.Driver") \
        .option("dbtable", "{table}") \
        .option("user", "{user}") \
        .option("password", "{password}") \
        .load()
        
    n_rows = df.count()

    spark.stop()
"""

Creating an livyc Object

lvy = livyc.LivyC(data_livy)

Creating a new session to Apache Livy Server

session = lvy.create_session()

Send and execute script in the Apache Livy server

lvy.run_script(session, pyspark_script.format(**params))

Accesing to the variable "n_rows" available in the session

lvy.read_variable(session, "n_rows")

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

License

This project is licensed under the terms of the MIT License.