Skip to content
forked from lelit/pglast

PostgreSQL Languages AST and statements prettifier: master branch covers PG10, v2 branch covers PG12, v3 covers PG13

Notifications You must be signed in to change notification settings

jaylmiller/pglast

 
 

Repository files navigation

pglast

PostgreSQL Languages AST and statements prettifier

Author: Lele Gaifax
Contact: [email protected]
License:GNU General Public License version 3 or later
Status: Build status Documentation status
Version: 3

This is a Python 3 module that exposes the parse tree of a PostgreSQL statement (extracted by the almost standard PG parser repackaged as a standalone static library by libpg_query) as set of interconnected nodes, usually called an abstract syntax tree.

Introduction

At the lower level the module exposes several libpg_query functions, parse_sql_json(), parse_sql_protobuf(), parse_plpgsql_json(), fingerprint(), scan, split() and deparse_protobuf(); the first two take an SQL statement and return the correspondent parse tree respectively as a JSON encoded value and a Protobuf encoded value; the third function takes a PLpgSQL statement and returns the parse tree as JSON, the fourth returns a sequence of tokens that compose a SQL statement, the fifth returns a signature of the given statement, the sixth returns a sequence of the single statements and the last one accepts a Protobuf-serialized statement and reproduce the original SQL statement.

One more function, parse_sql(), returns the syntax tree represented by a hierarchy of instances of the classes implemented in the pglast.ast module.

At a higher level that tree is represented by three Python classes, a Node that represents a single node, a List that wraps a sequence of nodes and a Scalar for plain values such a strings, integers, booleans or none.

Every node is identified by a tag, a string label that characterizes its content that is exposed as a set of attributes as well as with a dictionary-like interface (technically they implements both a __getattr__ method and a __getitem__ method). When asked for an attribute, the node returns an instance of the base classes, i.e. another Node, or a List or a Scalar, depending on the data type of that item. When the node does not contain the requested attribute it returns a singleton Missing marker instance.

A List wraps a plain Python list and may contains a sequence of Node instances, or in some cases other sub-lists, that can be accessed with the usual syntax, or iterated.

Finally, a Scalar carries a single value of some type, accessible through its value attribute.

On top of that, the module implements two serializations, one that transforms a Node into a raw textual representation and another that returns a prettified representation. The latter is exposed by the pgpp CLI tool, see the Usage section in the documentation for an example.

Installation

As usual, the easiest way is with pip:

$ pip install pglast

Alternatively you can clone the repository:

$ git clone https://github.com/lelit/pglast.git --recursive

and install from there:

$ pip install ./pglast

Development

There is a set of makefiles implementing the most common operations, a make help will show a brief table of contents. A comprehensive test suite, based on pytest, covers 98% of the source lines.

Documentation

Latest documentation is hosted by Read the Docs at https://pglast.readthedocs.io/en/latest/

History

Version 1

I needed a better SQL reformatter than the one implemented by sqlparse, and was annoyed by a few glitches (subselects in particular) that ruins the otherwise excellent job it does, considering that it is a generic library that tries to swallow many different SQL dialects.

When I found psqlparse I decided to try implementing a PostgreSQL focused tool: at the beginning it's been easier than I feared, but I quickly hit some shortcomings in that implementation, so I opted for writing my own solution restarting from scratch, with the following goals:

  • target only Python 3.4+
  • target PostgreSQL 10+
  • use a more dynamic approach to represent the parse tree, with a twofold advantage:
    1. it is much less boring to code, because there's no need to write one Python class for each PostgreSQL node tag
    2. the representation is version agnostic, it can be adapted to newer/older Elephants in a snap
  • allow exploration of parse tree in both directions, because I realized that some kinds of nodes require that knowledge to determine their textual representation
  • avoid introducing arbitrary renames of tags and attributes, so what you read in PostgreSQL documentation/sources is available without the hassle of guessing how a symbol has been mapped
  • use a zero copy approach, keeping the original parse tree returned from the underlying libpg_query functions and have each node just borrow a reference to its own subtree

Version 2

In late 2019, Ronan Dunklau opened PR #62 against libpg_query, that reimplemented the build machinery of the library to make it easier (read, semi-automatic) to support PostgreSQL 12, and PR #36 to bring pglast in line.

Since that version of PostgreSQL inevitably introduced some backward incompatibilities, I bumped the major version of pglast to better reflect the fact.

As I'm writing this, the fate of PR #62 is still unclear, so for the time being I switched the libpg_query submodule to Ronan's fork.

I'm going to keep version 1 aligned to the original Lukas' PG 10 branch.

Important

This version requires Python 3.6 or greater, due to usage of f-strings.

Version 3

In early 2021, Lukas put a considerable effort into evolving his library to target PostgreSQL 13. He introduced a richer protobuf-based AST serialization protocol, rewriting the underlying machinery so that the same code is used to generate either a JSON or a protobuf stream.

The approach has obvious advantages, but unfortunately both formats come with different shortcomings, and I was not able to adapt pglast. The JSON serialization has changed in a way that it not anymore sufficient to rebuild the original AST because some attributes now carry an implicit structure, that requires additional information to understand the content (see issue #82). OTOH, the Protobuf format is clumsy, at least on the Python side: the Google's compiler creates a huge and unreadable module, while other implementations (see pyrobuf, cprotobuf and betterproto) suffer of different issues (see issue #210).

After several attempts, I decided to follow a more rewarding way and implement a native Python wrapper layer on top of PG parser's nodes.

About

PostgreSQL Languages AST and statements prettifier: master branch covers PG10, v2 branch covers PG12, v3 covers PG13

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.1%
  • Other 0.9%