Skip to content

Eedi/orchex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

  1. Overview 📖
  2. Setup 🧑‍🔬
  3. Run 🏃
  4. Using orchex in other repositories

Overview 📖

In a snapshot, orchex is a library for the orchestration of data workflows, including hierarchical extraction, transformation with pseudonymisation, automated documentation, and secure sharing mechanisms.

For a closer look, you can explore the core module's primary code located at orchex/dataextract.py, where you'll find the implementation of the main data classes: DataSourceand DataExtract.

  • DataSource

    This class contains several methods to facilitiate data extraction from a data source and create a dataframe object.

    Supported data sources:

    • SQL code
    • SQL file
    • Table Storage database
    • csv file
  • DataExtract

    This class allows the user to combine multiple DataSources objects at a single entity, enabling seamless execution of the same operation to multiple different DataSources such as pseudonymisation.

    The data from the DataExtract will be stored in the following filestructue:

    {name}-{YYYYmmDDHHMM}-{id}
    ├── {name}-{YYYYmmDDHHMM}-{id}-PRIVATE.pkl
    ├── {name}-{YYYYmmDDHHMM}-{id}-PUBLIC/
    │   ├── data
    │   │   └──{data_source_name}.csv
    │   ├──img
    │   ├──docs
    │       ├── README.md
    │       └── img
    

    This class allows for 3 different ways of saving the data:

    • save(): saves a .pkl file of the class . Recommended for personal use. NOT sharing data

    • export(): creates pseudonymised .csv files. Best way to share data

    • archive(): creates a .zip file with all the created folders and uploads them to Azure Blob Storage.

📝 NOTE: In both classes there is the functionality to create a markdown report with all the class info.

2. Setup 🧑‍🔬

2.1 Prerequisites 📋

  • Python 🐍

    Python version 3.12^ is required

  • ODBC Driver (if running SQL code) 💻

    If you wish to create DataSources and/or DataExtracts using SQL code then, ODBC drivers should be installed. Please follow the instructions on the following page based on your OS (v17+ is recommended):

  • .env file 📃

    To extract data from the database some azure specific variables are required to be stored in a .env file. If you don't have those information please contact Simon

2.2 Installation

Poetry

orchex uses poetry (do not use pip or conda). To create the environment:

  • Windows

    poetry env use 3.12
    poetry config virtualenvs.in-project true
    poetry install
    
    # to activate the env
    poetry shell
  • MacOS

    poetry env use 3.12
    poetry config virtualenvs.in-project true
    
    poetry config --local installer.no-binary pyodbc
    
    poetry install
    
    # to activate the env
    poetry shell
  • Linux/ Eedi VM

    export PYTHON_KEYRING_BACKEND=keyring.backends.fail.Keyring
    
    poetry env use 3.12
    poetry config virtualenvs.in-project true
    
    poetry install
    
    # to activate the env
    poetry shell

    NOTE: if you get the following error

    This error originates from the build backend, and is likely not a problem with poetry but with multidict (6.0.4) not supporting PEP 517 builds. You can verify this by running 'pip wheel --use-pep517 "multidict (==6.0.4)"'.

    Run:

    poetry shell
    pip install --upgrade pip
    MULTIDICT_NO_EXTENSIONS=1 pip install multidict
    poetry add inflect
    poetry add pyodbc
    
    # if package are not reinstalled then run: 
    poetry update

Run 🏃

Example run, where foo a function:

from orchex.dataextract import DataExtract

data_extract = DataExtract(
        name="model-agnostic-data-extract",
        description="""A model-agnostic extract of Eedi data.""",
        container_path="data"
)

topic_pathway_collection_ids = (4, 5, 6, 7, 9, 10, 11)
answers_ds =data_extract.get_or_set_data_source(
    "answers", 
    foo,
    topic_pathway_collection_ids=topic_pathway_collection_ids
)
print(answers_ds.head())

Using orchex in other repositories

Previously we would have installed the package globally using pip install -e ., using poetry you simply add a dependency to the local package.

  1. Clone the repository:

    git clone [email protected]:Eedi/orchex.git
  2. In your other repository, add the following to the pyproject.toml:

    orchex = {path = <path-to-orchex>, develop=true}

    Example: orchex was cloned in the parent directory of the current project.

    orchex = {path = "../orchex", develop = true}

    The develop flag should mean that your installation will be automatically updated when orchex is editted.

  3. Some environments variables (.env and .sheets) are required for some components. Contact Simon for details.

  4. You can now import this package:

    from orchex.dataextract import DataExtract, DataSource
  5. If you then update this package it should update automatically (if develop = true). If this does not happen you should be able to just run poetry update orchex but you may need to reinstall your poetry environment. To do so:

    • Close any IDEs (i.e. VS Code) that might be using the environment. (Otherwise the following will fail.)
    • Run poetry env list to get the name of the environment.
    • Remove the environment poetry env remove orchex-fYa19ibp-py3.12
    • Go and delete where the environment folder is (e.g. E:\packages\poetry\virtualenvs). This is necessary otherwise the next step will just reinstalled some cached versions.
    • Reinstall poetry install
    • In VS Code you may need to manually select the new environment. Ctrl-Shift-P, then click Enter interpreter path....

About

Python package to load eedi data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages