-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
added readme and example datasets (#32)
- Loading branch information
1 parent
1407219
commit c9393fa
Showing
3 changed files
with
200 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
# Introduction | ||
**py4phi** is a simple solution for the complex problem of dealing with sensitive data. | ||
|
||
In the modern IT world, sharing a dataset with sensitive data is common, especially if a team working on it is wide. It can be used for various purposes, including building a ML/DL model, simple business analysis, etc. Of course, in most companies, different restrictions are applied on the data, including row-level security, column hashing, or encrypting, but this requires at least some knowledge of data engineering libraries and can be a challenging and time-consuming task. At the same time, employees with access to sensitive parts of the data may not have such expertise, which is where **py4phi** can be helpful. | ||
# Functionality | ||
**py4phi** offers the following functionality to solve the problem mentioned above and more: | ||
**Encrypt a dataset column-wise.** | ||
**Decrypt a dataset column-wise.** | ||
**Encrypt any folder or machine learning model** | ||
**Decrypt any folder or machine learning model** | ||
**Perform principal component analysis on a dataset** | ||
**Perform correlation analysis for feature selection on a dataset** | ||
You can use **py4phi** both in Python code and through your terminal via the convenient CLI interface. | ||
#Setup and prerequisites | ||
In order to install the library [from PyPi](), just run | ||
```shell | ||
pip install py4phi | ||
``` | ||
|
||
### **py4phi** is compatible with the following engines for data processing and encryption: | ||
* [Pandas](https://github.com/pandas-dev/pandas) | ||
* [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) | ||
* [Polars](https://pola.rs/) | ||
|
||
NOTE: Default engine for CLI is Pandas, whereas for library - PySpark. | ||
|
||
Nevertheless, you'll need a JDK installed and a JAVA_HOME environment variable set in order | ||
to work with the PySpark engine. | ||
However, you can still work with Pandas or Polars without JDK, if you wish. | ||
# Usage | ||
You can integrate **py4phi** in your existing pandas/pyspark/polars data pipeline by | ||
initializing from a DataFrame or loading from a file. Currently, CSV and Parquet | ||
file types are supported. | ||
|
||
Encryption and decryption of the datasets are facilitated by the use of configs. | ||
Each column gets its own encryption key and a nonce, which are saved in the configs. These resulting files can be further encrypted for even more safety. | ||
|
||
Therefore, you can encrypt only sensitive columns, send the outputs, for example, to the data analysis team, | ||
and keep the data safe. Later, data can be decrypted using configs on-demand. | ||
Moreover, you do not need deep knowledge of the underlying engines (pandas, etc.) | ||
and don't need to write long scripts to encrypt data and save the keys. | ||
|
||
The following example showcases the encryption process of a dataset.csv file. \ | ||
(you can find it in the [/examples](https://github.com/volodymyrkir/py4phi/tree/main/examples) folder) \ | ||
The output dataset, along with the decryption configs, is then saved to the "test_folder" directory under CWD, | ||
|
||
```python | ||
from py4phi.core import from_path | ||
|
||
controller = from_path( | ||
'./dataset.csv', | ||
'csv', | ||
engine='pyspark', | ||
log_level='DEBUG', | ||
header=True # pyspark read option | ||
) | ||
controller.print_current_df() | ||
controller.encrypt(columns_to_encrypt=['Staff involved', 'ACF']) | ||
controller.print_current_df() | ||
controller.save_encrypted( | ||
output_name='my_encrypted_file', | ||
save_location='./test_folder/', # results will be saved under CWD/test_folder/py4phi_encrypted_outputs | ||
save_format='PARQUET', | ||
) | ||
|
||
``` | ||
|
||
To decrypt these outputs, you can use: | ||
```python | ||
import pandas as pd | ||
from py4phi.core import from_dataframe | ||
|
||
df = pd.read_parquet('./test_folder/py4phi_encrypted_outputs/my_encrypted_file.parquet') | ||
controller = from_dataframe( | ||
df, | ||
log_level='DEBUG' | ||
) | ||
controller.print_current_df() | ||
controller.decrypt( | ||
columns_to_decrypt=['Staff involved', 'ACF'], | ||
configs_path='./test_folder/py4phi_encrypted_outputs', | ||
) | ||
controller.print_current_df() | ||
controller.save_decrypted( | ||
output_name='my_decrypted_file', | ||
save_location='./test_folder', | ||
save_format='csv' | ||
) | ||
``` | ||
|
||
This example also shows how to initialize **py4phi** from a (pandas, in this case) DataFrame. | ||
|
||
Similar workflow through a terminal can be executed with the following CLI commands: | ||
```shell | ||
py4phi encrypt-and-save -i ./dataset.csv -c ACF -c 'Staff involved' -e pyspark -p -o ./ -r header True | ||
py4phi decrypt-and-save -i ./py4phi_encrypted_outputs/output_dataset.csv -e pyspark -c ACF -c 'Staff involved' -p -o ./ -r header True | ||
``` | ||
|
||
To encrypt and decrypt a folder or a ML/DL model, you can use: | ||
```python | ||
from py4phi.core import encrypt_model, decrypt_model | ||
encrypt_model( | ||
'./test_folder', | ||
encrypt_config=False #or True | ||
) | ||
|
||
decrypt_model( | ||
'./test_folder', | ||
config_encrypted=False # or True | ||
) | ||
``` | ||
After encryption, all files whithin the specified folder will be not readable. | ||
This can be used for easy one-line model encryption. | ||
|
||
The same actions can be taken in a terminal: | ||
```shell | ||
# encrypt model/folder, do not encrypt config. Note that encryption is done inplace. Please save original before encryption. | ||
py4phi encrypt-model -p ./py4phi_encrypted_outputs/ -d | ||
|
||
# decrypt model/folder when config is not encrypted | ||
py4phi decrypt-model -p ./py4phi_encrypted_outputs/ -c | ||
``` | ||
# Analytics usage | ||
Apart from the main encrypt/decrypt functionality, one may be interested in reducing | ||
the dimensionality of a dataset or performing correlation analysis of the feature (feature selection). | ||
In a typical scenario, this requires a lot of effort from the data analyst. | ||
Instead, a person with access to the sensitive data | ||
can perform a lightweight **PCA/feature selection** in a couple of code lines or terminal commands. | ||
|
||
**NOTE**: This functionality is a quick top-level analysis, diving deeper into a dataset's feature analysis will always bring more profit. | ||
|
||
To perform principal component analysis with Python, use: | ||
|
||
```python | ||
from py4phi.core import from_path | ||
controller = from_path('Titanic.parquet', file_type='parquet', engine='pyspark') | ||
controller.perform_pca( | ||
target_feature='Survived', | ||
ignore_columns=['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], | ||
save_reduced=False | ||
) | ||
``` | ||
|
||
Via terminal: | ||
|
||
```shell | ||
py4phi perform-pca -i ./dataset.csv --target 'Staff involved' -c ACF | ||
``` | ||
|
||
To perform feature selection with Python, use: | ||
|
||
```python | ||
from py4phi.core import from_path | ||
controller = from_path('Titanic.parquet', file_type='parquet', engine='polars') | ||
controller.perform_feature_selection( | ||
target_feature='Survived', | ||
target_correlation_threshold=0.2, | ||
features_correlation_threshold=0.2, | ||
drop_recommended=False | ||
) | ||
``` | ||
|
||
Via terminal: | ||
|
||
```shell | ||
py4phi feature-selection -i ./Titanic.parquet --target Survived --target_corr_threshold 0.3 --feature_corr_threshold 0.55 | ||
``` | ||
|
||
Please look into the [/examples](https://github.com/volodymyrkir/py4phi/tree/main/examples) folder for more examples. | ||
It also contains training datasets. |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
Staff involved,ACF,ACL,AHRQ,ARPA-H,ASPR,CDC,CMS,FDA,HRSA,IHS,NIH,OS,SAMHSA,TOTAL | ||
Staff normally paid from or shifted to administrative funds appropriated in authorizing legislation.,40,0,0,0,0,89,1924,0,62,0,0,1565,0,3680 | ||
Staff normally paid from or shifted to carryover funding or advanced appropriations,599,9,26,88,531,2379,531,12504,1141,8233,5,1059,48,27153 | ||
Staff normally paid from or shifted to reimbursable funding for which the reimbursement is not paid from funds provided by the lapsed FY24 appropriation.,13,3,0,0,0,85,0,59,0,6853,0,24,0,7037 | ||
Staff normally paid from or shifted to user fees appropriated in authorizing legislation.,0,0,0,0,0,0,556,31,38,0,0,0,0,625 | ||
Commissioned Corps (excepted),6,0,5,0,112,739,84,359,73,0,173,137,41,1729 | ||
HHS officers appointed by the President (exempt),2,0,0,0,1,0,1,1,0,0,1,8,1,15 | ||
"Activities required to ensure that fully funded programs continue operation, and that funded entitlement benefits are paid",27,0,0,0,0,313,121,183,69,0,71,310,0,1094 | ||
Law enforcement activities.,0,0,0,0,0,0,0,0,0,0,67,0,0,67 | ||
Orderly phase-down and suspension of operations,43,12,6,0,3,460,0,59,66,0,252,202,38,1141 | ||
Other,0,0,0,0,0,61,0,0,0,0,69,79,0,209 | ||
Staff to be furloughed.,856,172,250,0,377,10120,3329,5059,1257,0,15139,2360,677,39596 | ||
"Subtotal, authorized by law",1586,196,287,88,1024,14246,6546,18255,2706,15086,15777,5744,805,82346 | ||
Direct medical services provided through clinics and hospitals that the OPDIV operates,0,0,0,0,0,1,0,0,7,0,2314,0,0,2322 | ||
Other,0,0,0,0,54,530,0,1236,0,0,0,1,3,1824 | ||
Maintain computer data,2,0,0,0,0,161,0,14,5,0,197,49,0,428 | ||
Maintenance of animals & protection of inanimate government property,0,0,0,0,0,88,0,12,0,0,718,80,2,900 | ||
Other,0,0,0,0,0,66,0,0,3,0,0,0,0,69 | ||
Protect ongoing medical experiments.,0,0,0,0,0,59,0,127,7,0,649,0,0,842 | ||
"Subtotal, safety of human life and protection of property",2,0,0,0,54,905,0,1389,22,0,3878,130,5,6385 | ||
Total on board staffing,1588,196,287,88,1078,15151,6546,19644,2728,15086,19655,5874,810,88731 | ||
Total number of staff to be retained,732,24,37,88,701,5031,3217,14585,1471,15086,4516,3514,133,49135 | ||
Exempt,654,12,26,88,532,2553,3012,12595,1241,15086,6,2656,49,38510 | ||
Excepted,78,12,11,0,169,2478,205,1990,230,0,4510,858,84,10625 | ||
Number of staff to be furloughed,856,172,250,0,377,10120,3329,5059,1257,0,15139,2360,677,39596 | ||
Percent Retained,0.46,0.12,0.13,1,0.65,0.33,0.49,0.74,0.54,1,0.23,0.6,0.16,0.55 | ||
Percent Exempt,0.41,0.06,0.09,1,0.49,0.17,0.46,0.64,0.45,1,0,0.45,0.06,0.43 | ||
Percent Excepted,0.05,0.06,0.04,0,0.16,0.16,0.03,0.1,0.08,0,0.23,0.15,0.1,0.12 | ||
Percent Furloughed,0.54,0.88,0.87,0,0.35,0.67,0.51,0.26,0.46,0,0.77,0.4,0.84,0.45 | ||
Life & Property as % Excepted,0,0,0,0,0.05,0.06,0,0.07,0.01,0,0.2,0.02,0.01,0.07 |