# -*- coding: utf-8 -*- from setuptools import setup packages = \ ['gribtoarrow'] package_data = \ {'': ['*']} install_requires = \ ['cmake>=3.28.1,<4.0.0', 'pyarrow>=14.0.2,<15.0.0', 'pybind11-stubgen>=2.4.2,<3.0.0', 'pybind11>=2.11.1,<3.0.0'] setup_kwargs = { 'name': 'gribtoarrow', 'version': '0.1.9', 'description': 'gribtoarrow is a python module to export data in the GRIB format to Apache Arrow', 'long_description': '## Goal\n\nGribToArrow is a C / C++ project which uses C++20 and creates a python module to simplify working with files in the GRIB Format \n(GRIdded Binary or General Regularly-distributed Information in Binary form).\n\nUnder the hood it uses the ECMWF eccodes library. This is wrapped in C++ and exposed to python via the library pybind11 and Apache Arrow. If\nyou want a simple pythonic way to interact with GRIB data then give this module a try.\n\nHaving worked with Meteorologic data using the ECMWF tooling for a while I became familar with the structure of Grib files and many of the \ncommand line tools provided by ECMWF. However extracting the data was a pain due to having to rely on legacy c code written by someone who\nhad left the company a while ago and was poorly structured and lacking in tests. In addition the existing codebase was inflexible and \nrequired preprocessing and lots of glue at the shell level, meaning the main logic couldn\'t be tested particulary well due to\nmissing tools on the CI/CD servers and a black box executable program.\n\nGribToArrow was created to overcome these problems. It can be installed using CMAKE or in a more pythonic way using poetry.\n\nGribToArrow aims to abstract away the low level detail and create a python binding which exposes the data in arrow format. The Apache Arrow \nformat is rapidly becoming a key component of most modern data eco-systems. Exposing the Grib data in a modern column based format allows for \nrapid high level development. Operations such as filtering, calculations, joining, aggregating, renaming, projecting, transposing and \nsaving to files / databases becomes a breeze via the ease of integrating with high level libraries such as polars, pandas and duckdb. \nAdditionally due to the way Apache Arrow is created the integration is typically known as zero copy meaning data can be passed between \nany tool which can read Apache Arrow at zero cost.\n\nWhat does this mean in reality ? It means you can mix and match tools. If you start using this library with polars but find some functionality\nis missing such as geospatial functions you can keep the existing logic in polars and pass the dataframe to a tool such as duckdb to perform\nthe geospatial elements of your processing and then pass this back to polars if required (At the time of writing geoparquet is a work in \nprogress and once this is completed work on geopolars will commence. However duckdb has integrated many of the geospatial functions from postGIS).\n\n\nThe python module is comprised of the following:\n\ngribtoarrow (Class)\n This class is equivalent to a reader object it implements an iterator to simplify data access.\n A fluent API is provided to allow for functionality such as adding locations of interest based on latitude / longitude. \n\ngribmessage (Class)\n This is an object which will be returned by each iteration of the iterator.\n This exposes methods such as getData(), getDataWithLocatinos() and many methods to get attribute values such as paramId, shortName etc..\n\n\nA sample usage of the Library in python is given below. In this code a config CSV is read with polars.\nThe CSV contains a list of latitude / longitudes where we want to know the nearest equivalent values for those locations in the Grib file.\ne.g. This might be a list of all the major world cities.\nThe polars table is converted to arrow and passed to the GribToArrow method which returns our reader / iterator object.\nNext a simple list comprehension is used to extract all the details from every message and the results are saved to a parquet file.\nAs can been seen a lot of work was accomplished in just 14 lines of python. In addition to the low amount of code required\nwe also benefit from quick performance. \n\n import polars as pl\n from gribtoarrow import GribToArrow\n\n locations = (\n pl.read_csv("/Users/hugo/Development/cpp/grib_to_arrow/locastions.csv", has_header=False)\n .with_columns([pl.col(\'column_7\').alias(\'lat\'),pl.col(\'column_8\').alias(\'lon\')]\n ).to_arrow()\n\n reader = ( \n GribToArrow("/Users/hugo/Development/cpp/grib_to_arrow/big.grib")\n .withLocations(arrow_locations)\n )\n data = [pl.from_arrow(message.getDataWithLocations()) for message in reader]\n df = pl.concat(data)\n print(f"done all data extracted {len(df)} rows from grib")\n df.write_parquet("/Users/hugo/Development/cpp/grib_to_arrow/output.parquet")\n\n## Performance\nThe module is fast since it operates entirely in memory. In addition it releases the GIL to allow python threading. Currently it doesn\'t \nuse threading in the C++ layer, this is because the author created the project using OSX and the default compiler on OSX doesn\'t include OMP, \nthis can be done if required although for the reasons detailed below may not be needed. \nSince the main usage is from python it is anticipated that just releasing the GIL will be sufficient.\nIn addition since everything is extracted in memory and made available to arrow and hence the vast ecosystem of tools such as polars,\npandas and duckdb then multiprocessing and partitioning of files parquet can be utilised to also achieve a high degree of parallism.\nA test on a 2023 MacBook Pro extracted 230 million rows from a concatenated grib and wrote this to a parquet file in 6 seconds.\n\n## Core functionality\n\nThe main entry point is the GribReader class, which takes a string path to a grib file in the constructor.\n\nIn addition GribReader has a fluent API which includes the following methods :-\n\n- withLocations -> Pass an arrow table to this function which includes the columns "lat" and "lon" and the results will be filtered to the \nnearest location based on the provided co-ordinates. e.g. You might have a grib file at 0.5 resolution for every location of earth. Logically a lot\nof those locations will be at sea, so you could use this facility and specify a list of latitutdes and longitudes to restricte the amount of results\nreturned.\n\n- withConversions -> Pass an arrow table with columns "parameterId", "addition_value", "subtraction_value", "multiplication_value", "division_value".\nThe values will be used to perform computations on the data. e.g. The underlying grib might contain a parameter where the data is in Kelvin but \nyou want the values to be in Celcius. Passing a config table with these values will enable the conversions to be performed early in the data \npipeline using the Apache Arrow Compute Kernel / module. See the tests folder for examples.\n\nGrib reader is iterable so can be used in any for loop / generator / list comprehension etc..\nEach iteratation of the reader will return a GribMessage. \n\nGribMessage also provides methods to get attribute based fields and the data.\n\n## Creating the Python module\n\nAt the time of writing the project can be built two ways.\n\n- CMAKE. The paths are currently hardcoded into the CMAKE file so you will need to amend the paths to match the location of the libraries on \nyour system. It should noted that after you have installed arrow and pyarrow you can find the include and library paths using these commands.\npyarrow.get_library_dirs() and pyarrow.get_include()\n\n- Poetry. A pyproject.toml file is present in the repo which simplies the build.\n Simply execute the commands below\n\n\n poetry build\n poetry install\n\n### Install Dependencies - This will depend on if you are building the project with CMAKE or poetry.\n\n- Install ECCODES -> build from source\n- Install arrow -> use homebrew on osx (we need both arrow and pyarrow)\n\nIf using cmake you will also need to install the follow\n\n- pip install pyarrow (or use venv but remember to activate it when testing)\n- pip install polars (if you want to run the samples / tests). At the lowest level you can interact with the results using pyArrow or any \ntools which can work with the Apache Arrow ecosystem e.g. Pandas, Polars, Duckdb, Vaex etc..\n\nIf you are using poetry this will be taken care of for you.\n\n### Clone This project\n\nClone this project using git\n\nThen cd into the folder and clone pybind11 at the root level of the folder.\n\n### Clone pybind11\n\nThe module is created using pybind11. Rather than adding this as source to this repository you should instead clone the latest version into this \nrepo. This can be done with a command similar to the one below, note it might not be exactly this command git might suggest to use a sub-project \nor similar, follow the git recommendation.\n\nIn the project folder, clone pybind11\n\ngit clone https://github.com/pybind/pybind11.git\nLook at .gitignore file, pybind11 repo is ignored as we just read it.\n\n### Compile\n\nThe recommended way to create the project is to use poetry.\n\nUse an IDE / Plugin such as visual studio code \n\nOr...\n\nOpen a terminal and cd into the project folder then run\n\nmkdir build\ncd build\ncmake ..\nmake \nIn the build directory, you should have a compiled module with the name similar to:\n\ngribtoarrow.cpython-312-x86_64-linux-gnu.so\n\nor on OSX\n\ngribtoarrow.cpython-312-darwin.so\n\nWhere 312 is your python version (In the case above the author is running python 3.12)\n\n### Run\n\nIf you have used poetry then a wheel will be built which contains gribtoarrow and eccodes.\nThe resulting ELF files on linux should be linked correctly and have the rpath set so there is no need to set LD_LIBRARY_PATH \n\nOSX should also work the same way.\n\nIf you have any issues shout out.\n\nNote it might be necessary to import pyarrow prior to gribtoarrow\n\ne.g. You might need to\n\nimport pyarrow\nimport gribtoarrow\n\nThis will be necessary if Poetry build the project with a different version of arrow / pyarrow than is normally installed\n\nIf this is the case your will see something like the below\n\n>>> import gribtoarrow\nTraceback (most recent call last):\n File "", line 1, in \nImportError: dlopen(/Users/hugo/Development/cpp/grib_to_arrow/dist/temp/gribtoarrow.cpython-312-darwin.so, 0x0002): Library not loaded: @rpath/libarrow_python.dylib\n\nThis is because arrow and pyarrow are needed in the build and are linked against. However poetry uses venv so gribtoarrow is build\nagainst a version in a poetry venv which can\'t later be found.\n\nI might be able to play with @rpaths / @rpath-link but for now just import the system installed pyarrow first.\n\n## Poetry Building\n\nThe poetry build performs the following steps:\n\n- Defines cmake as a dependency (this is required to build eccodes and may not be installed on the system in question)\n\nDownloads eccodes from ECMWF\nCompiles ECCODES (into a folder called temp_eccodes)\nCompiles this project using PyBind11\n Note the following (The build of this project is done in build.py)\n In build.py we set rpath. This is the runtime path which is baked into the shared object which will be used to look \n for any dependencies. The rpath is set to a folder called "eccodes" which should be in the same location the the gribtopython.so\n The eccodes libraries are copied into a folder in "dist"\n A folder called "lib" is also created in dist into which eccodes_memfs is copied\n This is done so that the wheel is bundled with all of it\'s dependencies and everything can be found without the need \n to specify environment variables such as LD_LIBRARY_PATH\n\nA visual representation of the files inside the wheel is given below\n\n .\n ├── eccodes\n │ ├── libeccodes.dylib\n │ └── libeccodes_memfs.dylib\n ├── gribtoarrow.cpython-312-darwin.so\n └── lib\n └── libeccodes_memfs.dylib\n\nAs can be seen the main dynamic library gribtoarrow.cpython-312-darwin.so (will have a different name on linux) is at the root level.\nIn order for gribtoarrow to be able to use libeccodes -Wl,-rpath,$ORIGIN/eccodes in set in the linker.\n$ORIGIN in a linux specific option an basically means the current location (which when installed will be part of site-packages)\nSo in effect gribtoarrow looks in a child folder called eccodes for libeccodes. \nNote how libeccodes_memfs is present twice and also in a location called "lib".\nThis isn\'t documented on ECMWFs page and it appread that libeccoes either has it\'s own rpath or look in some locations one of which\nis ../lib (this was found via the use of strace)\n\nIn the case of OSX the logic and the wheel is basically the same except rather than using $ORIGIN in the rpath @loader_path is\nused instead.\n\n\n## Documention\n\n Doc string have been added to grib_to_arrow and it should be possible to generate documents using Sphinx.\n', 'author': 'Hugo Pendlebury', 'author_email': 'None', 'maintainer': 'None', 'maintainer_email': 'None', 'url': 'None', 'packages': packages, 'package_data': package_data, 'install_requires': install_requires, 'python_requires': '>=3.9,<4.0', } from build import * build(setup_kwargs) setup(**setup_kwargs)