In a snapshot, orchex
is a library for the orchestration of data workflows, including hierarchical extraction, transformation with pseudonymisation, automated documentation, and secure sharing mechanisms.
For a closer look, you can explore the core module's primary code located at orchex/dataextract.py
, where you'll find the implementation of the main data classes: DataSource
and DataExtract
.
-
DataSource
This class contains several methods to facilitiate data extraction from a data source and create a dataframe object.
Supported data sources:
- SQL code
- SQL file
- Table Storage database
- csv file
-
DataExtract
This class allows the user to combine multiple
DataSources
objects at a single entity, enabling seamless execution of the same operation to multiple differentDataSources
such as pseudonymisation.The data from the
DataExtract
will be stored in the following filestructue:{name}-{YYYYmmDDHHMM}-{id} ├── {name}-{YYYYmmDDHHMM}-{id}-PRIVATE.pkl ├── {name}-{YYYYmmDDHHMM}-{id}-PUBLIC/ │ ├── data │ │ └──{data_source_name}.csv │ ├──img │ ├──docs │ ├── README.md │ └── img
This class allows for 3 different ways of saving the data:
-
save()
: saves a.pkl
file of the class . Recommended for personal use. NOT sharing data -
export()
: creates pseudonymised.csv
files. Best way to share data -
archive()
: creates a.zip
file with all the created folders and uploads them to Azure Blob Storage.
-
📝 NOTE: In both classes there is the functionality to create a markdown report with all the class info.
-
Python version 3.12^
is required -
If you wish to create
DataSources
and/orDataExtracts
using SQL code then, ODBC drivers should be installed. Please follow the instructions on the following page based on your OS (v17+ is recommended): -
To extract data from the database some azure specific variables are required to be stored in a
.env
file. If you don't have those information please contact Simon
orchex
uses poetry
(do not use pip
or conda
).
To create the environment:
-
poetry env use 3.12 poetry config virtualenvs.in-project true poetry install # to activate the env poetry shell
-
poetry env use 3.12 poetry config virtualenvs.in-project true poetry config --local installer.no-binary pyodbc poetry install # to activate the env poetry shell
-
export PYTHON_KEYRING_BACKEND=keyring.backends.fail.Keyring poetry env use 3.12 poetry config virtualenvs.in-project true poetry install # to activate the env poetry shell
❗ NOTE: if you get the following error
This error originates from the build backend, and is likely not a problem with poetry but with multidict (6.0.4) not supporting PEP 517 builds. You can verify this by running 'pip wheel --use-pep517 "multidict (==6.0.4)"'.
Run:
poetry shell pip install --upgrade pip MULTIDICT_NO_EXTENSIONS=1 pip install multidict poetry add inflect poetry add pyodbc # if package are not reinstalled then run: poetry update
Example run, where foo
a function:
from orchex.dataextract import DataExtract
data_extract = DataExtract(
name="model-agnostic-data-extract",
description="""A model-agnostic extract of Eedi data.""",
container_path="data"
)
topic_pathway_collection_ids = (4, 5, 6, 7, 9, 10, 11)
answers_ds =data_extract.get_or_set_data_source(
"answers",
foo,
topic_pathway_collection_ids=topic_pathway_collection_ids
)
print(answers_ds.head())
Previously we would have installed the package globally using pip install -e .
, using poetry
you simply add a dependency to the local package.
-
Clone the repository:
git clone [email protected]:Eedi/orchex.git
-
In your other repository, add the following to the
pyproject.toml
:orchex = {path = <path-to-orchex>, develop=true}
Example:
orchex
was cloned in the parent directory of the current project.orchex = {path = "../orchex", develop = true}
The develop flag should mean that your installation will be automatically updated when
orchex
is editted. -
Some environments variables (
.env
and.sheets
) are required for some components. Contact Simon for details. -
You can now import this package:
from orchex.dataextract import DataExtract, DataSource
-
If you then update this package it should update automatically (if
develop = true
). If this does not happen you should be able to just runpoetry update orchex
but you may need to reinstall your poetry environment. To do so:- Close any IDEs (i.e. VS Code) that might be using the environment. (Otherwise the following will fail.)
- Run
poetry env list
to get the name of the environment. - Remove the environment
poetry env remove orchex-fYa19ibp-py3.12
- Go and delete where the environment folder is (e.g.
E:\packages\poetry\virtualenvs
). This is necessary otherwise the next step will just reinstalled some cached versions. - Reinstall
poetry install
- In VS Code you may need to manually select the new environment.
Ctrl-Shift-P
, then clickEnter interpreter path...
.