GitHub - aroch/protobuf-dataframe: A package that lets you run PySpark SQL on your Protobuf data

Installation

(Optional) Create a virtual environment in PyCharm or CLI
Install from pip:

pip install protodf

Features

Convert a Protobuf descriptor into a Spark schema:

from protodf import schema_for

schema = schema_for(message_type().DESCRIPTOR)

You can use the created schema to transform a Protobuf message from bytes into a Spark Row:

Create a function which your type:

from protodf import message_to_row

def specific_message_bytes_to_row(pb_bytes):
    # import your protobuf here
    msg = message_type.FromString(pb_bytes)
    row = message_to_row(message_type().DESCRIPTOR, msg)
    return row

Turn it into a UDF:

specific_message_bytes_to_row_udf = udf(specific_message_bytes_to_row, schema)

Use the UDF:

df = df.withColumn("event", specific_message_bytes_to_row_udf(col("value")))

Now you can query your Protobuf with regular SQL! Nested messages, repeated, etc are all supported!

df.select("event.field_name", "event.nested_message.field")

Explore the example

In main.py file you would see an example usage for the package

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
protodf		protodf
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Features

Explore the example

About

Releases 1

Packages

Contributors 4

Languages

License

aroch/protobuf-dataframe

Folders and files

Latest commit

History

Repository files navigation

Installation

Features

Explore the example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

Packages