Skip to content
This repository has been archived by the owner on Jul 12, 2024. It is now read-only.

Feat/prism #248

Merged
merged 33 commits into from
Dec 20, 2022
Merged

Feat/prism #248

merged 33 commits into from
Dec 20, 2022

Conversation

fabrizio-credo
Copy link
Contributor

Prototype for Prism object

Overall folder/file structure

  • comparators: Contains the basic logic to compare different type of results. Comparators can be a function of the result container type, and potentially the evaluators

    • comparator.py: abstract class
    • metric_comparator.py: takes a list of Lens objects, extract metric containers and compares metrics across all the Lens objects. Comparisons are done respect to a reference Lens object. Currently the reference is defined by model name, to be changed in the future if necessary. @esherman-credo I simplified the logic, the all vs all comparison produced a lot of results that were probably not going to get used, we can reinstate it if needs be. I have added the possibility to select different operations.
  • compare.py: this is a type of task that prism can orchestrate. ATM it's only calling metric_comparator, in the future compare.Compare will instantiate the necessary comparators (table, metric, etc...) depending on the existing results containers.

  • prism.py: the orchestrator, this is the equivalent of Lens but at a higher level. The purpose of Prism is to coordinate the various tasks. Currently when the user instantiates it, they can pass a dictionary of tasks to perform on the Lens objects. The dictionary is in the format {'task_name:{'parameter':value}}, where parameters are specific to the type of task to perform.

Missing

  • Docstrings in general
  • Abstract classes for task objects (like compare.Compare). I haven't created an abstract class for this yet, I am wondering if there is a need for it, or if the variety of the object will be too big for an abstract class to be useful.
  • Tests

To test the run you can use the following code:


from credoai.lens import Lens
from credoai.artifacts import TabularData, ClassificationModel
from credoai.evaluators import *
from credoai.governance import Governance
import numpy as np


# imports for example data and model training
from credoai.datasets import fetch_creditdefault
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

data = fetch_creditdefault()
df = data["data"]
df["target"] = data["target"].astype(int)

# fit model
model1 = RandomForestClassifier(random_state=42)
X = df.drop(columns=["SEX", "target"])
y = df["target"]
sensitive_features = df["SEX"]
(
    X_train,
    X_test,
    y_train,
    y_test,
    sensitive_features_train,
    sensitive_features_test,
) = train_test_split(X, y, sensitive_features, random_state=42)
model1.fit(X_train, y_train)

model2 = XGBClassifier(tree_method='hist', random_state=42, n_estimators=2, enable_categorical=True)
model2.fit(X_train, y_train)



# ## Set up artifacts

credo_RF1 = ClassificationModel(
    'credit_RF1_classifier',
    model1
)

credo_XGB = ClassificationModel(
    'credit_XGB_classifier',
    model2
)

train_data = TabularData(
    name = "UCI-credit-default-train",
    X=X_train,
    y=y_train,
    sensitive_features=sensitive_features_train
)
test_data = TabularData(
    name='UCI-credit-default-test',
    X=X_test,
    y=y_test,
    sensitive_features=sensitive_features_test
)


# pipeline scan be specifed using a sklearn-like style
metrics = ['accuracy_score', 'roc_auc_score']
pipeline = [
    (Performance(metrics)),
    (ModelFairness(metrics)),
]

pipeline2 = [
    (Performance(metrics)),
    (ModelFairness(metrics)),
]


# and added to lens in one fell swoop
lens_rf1 = Lens(
    model=credo_RF1,
    assessment_data=test_data,
    # training_data=train_data,
    pipeline=pipeline
)



lens_xgb = Lens(
    model=credo_XGB,
    assessment_data=test_data,
    # training_data=train_data,
    pipeline=pipeline2
)


from credoai.prism import Prism

prism_test = Prism(
    [lens_rf1, lens_xgb], 
    tasks={'compare':{'ref':'credit_RF1_classifier'}}
)

prism_test.execute()

prism_test.get_results()


@fabrizio-credo fabrizio-credo added the enhancement New feature or request label Nov 15, 2022
@fabrizio-credo fabrizio-credo self-assigned this Nov 15, 2022
Copy link
Collaborator

@IanAtCredo IanAtCredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. Bullet point high level critiques:

  • I feel like the compare.py class is unneeded and can be rolled into Prism. Happy to discuss and be wrong here though.
  • The way a reference is defined and IDed is not flexible enough. Only works for cross-model comparisons, when really we want cross "context" comparisons. If the same model is run on different samples of data, we may want to compare those.
  • The user interface of Prism, which is supposed to simplify everything, instead obfuscates.
  • Documentation needed for most functions (though easy to follow anyway)

credoai/prism/comparators/comparator.py Show resolved Hide resolved
credoai/prism/comparators/metric_comparator.py Outdated Show resolved Hide resolved
credoai/prism/comparators/comparator.py Outdated Show resolved Hide resolved
credoai/prism/prism.py Outdated Show resolved Hide resolved
credoai/prism/comparators/metric_comparator.py Outdated Show resolved Hide resolved
credoai/prism/compare.py Outdated Show resolved Hide resolved
credoai/prism/prism.py Outdated Show resolved Hide resolved
credoai/prism/comparators/metric_comparator.py Outdated Show resolved Hide resolved
@fabrizio-credo
Copy link
Contributor Author

@IanAtCredo Overall responses to the high level critique:

I feel like the compare.py class is unneeded and can be rolled into Prism. Happy to discuss and be wrong here though.

Initially I was of the same opinion, however I feel we need this kind of objects due to the variety of tasks that Prism will have to handle in the future (not just comparisons). If everything gets put into Prism I feel it will eventually ballon up pretty quickly. I intended for Prism to only handle orchestration logic, including the translation from user input into the tasks to perform.
This is based mostly on feeling at the moment, I would like to create a sampling task following this schema and see how we feel about it at the end.

The way a reference is defined and IDed is not flexible enough. Only works for cross-model comparisons, when really we want cross "context" comparisons. If the same model is run on different samples of data, we may want to compare those.

Agreed, this was meant as a temporary solution. I think Data/Model/SensitiveFeature define the context ATM, I wonder what is the best way of conveying this information. The simplest is a combo of Model_Name + Data_Name + Feature_Name/Value, this will get ugly pretty fast.

Potential solution: leverage PepelineStep -> Create a context property which is a function of metadata, and which associate a hash to the step. In my mind each step coincides with a context. All Prism tasks will use hashes for internal referencing (this makes creating DF easier, rather than having to move around Dictionaries).
We can create a couple of Prism methods that can then translate to/from human readable output/input.

The user interface of Prism, which is supposed to simplify everything, instead obfuscates.
Documentation needed for most functions (though easy to follow anyway)

Agreed, I wasn't extremely happy about it either. We can ask the user to pass the actual (task) class, like Compare. So the param tasks in Prism will be something like this:

[
 {
   task: Compare,
   params: {ref: ...}
 },
]

another argument to keep the task objects 😛 . In any case, whether we map user input string to behavior, or we pass tasks directly, a certain amount of knowledge is necessary on user side.
Stylistically this is more similar to what we are currently doing in Lens when we build the pipeline.

@IanAtCredo
Copy link
Collaborator

Few thoughts:

Why does the user need to specify tasks at all? Or at the level of granularity you are describing? Why not just "here are a few models, either aggregate or compare them". I can buy that if you say "aggregate them", implicitly a list of tasks are created, but should that be the main way the user interacts? Perhaps some super-user will want specific access to the tasks (which we can allow), but that's secondary to the primary user interaction.

Data/Model/SensitiveFeature as context - yeah. I think we may want to define this somewhere. I was originally thinking that a run of Lens is the context. Why do we need to infer the context? Can't we just say "we took a bucket of what you said was one thing and compared it to another." To generalize this to evidence separated from Lens is trickier within your current implementation, but could be done by a list of evidence-lists, rather than inferring separation by IDs.

@fabrizio-credo
Copy link
Contributor Author

fabrizio-credo commented Nov 28, 2022

@IanAtCredo Update:

  • Formalized the concept of Task
  • Updated method of passing tasks to Prism, same logic as for passing evaluators to Prism
  • Updated PipelineStep to create a unique id for each step dependent on Evaluator name, model name, datasets names, sensitive feature.
  • Update comparator task and metric comparators to leverage the new id to match evidences for comparisons
  • Updated documentation all around
  • Updated the mode in which reference is defined by user:
    • User specifies which type of object they want to reference: model/assessment_data/training_data
    • User specifies the name of said object

Copy link
Collaborator

@IanAtCredo IanAtCredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if changes need to be made, but I feel like the current "identifier" implementation is quite shaky and will lead to problems down the road. Let's discuss whether changes are necessary.

We can also call this an experimental feature, accept some brittleness and see if it is required by others.

@@ -35,6 +36,27 @@ def __post_init__(self):
self.evaluator.metadata = self.metadata
self.metadata["evaluator"] = self.evaluator.name

@property
def identifier(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a while trying to make this more robust. As it is, this seems quite brittle! We're passing an ID that is made up on-the-fly here, to be referenced via a hardcoded set of ID_Cols. Doesn't seem good.

Tried to replace with this:

    @property
    def id(self):
        eval_info = self.evaluator.get_info()["metadata"]
        return "~".join(f"{k}:{v}" for k, v in eval_info.items())

which makes use of another function we already have to create metadata (I also changed "identifier" => "id")

This results in outputs like this:
'evaluator:Performance~model_name:credit_RF1_classifier~assessment_dataset_name:UCI-credit-default-test'

This can't be used later however because the metadata isn't consistent amongst evaluators. However, I'm confused why that should be an issue. Shouldn't we be able to do comparisons evaluator-by-evaluator when they match up? The issue now is coming up because everything is being coerced into one dataframe.

I tried to remove the explicit separation of ID into ID_Cols, but then saw that the ID_Cols were used so that evaluation could be directed to be over the ref class, like "model". I was confused about that implementation as well. Once we know that we have the same ID between two lens instances, shouldn't we be able to compare? Why do we need a ref class at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference was split in ref_type and ref to allow the flexibility of comparing either by model or by whichever dataset (training/assessment) is available. This is to accommodate 2 type of use cases:

  1. Many models comparisons: multiple iterations of the same model vs initial case, or different models against a reference one.
  2. Multiple dataset comparisons: comparing result when dataset change, over time or due to some form of sampling for example.

If Metadata structure was always constant then I would agree that we can simply pass the self.evaluator.get_container_info(), but as it stands I would have to still do extra steps to standardize the output in a similar way to what I am already doing.

Standardization is also good for results displaying, I think breaking down the id in labelled columns helps user understanding the results, rather then having to read a long and variable id.

be instantiated with their necessary artifacts. One or multiple Lens objects
can be provided, each Task will validate that the amount of objects provided
is suitable for its requirement.
task : Task
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said tasks work as evaluators, but here I see I can only pass one task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I consciously constrained the power of Prism, task as we imagined them would be of 2 types:

  1. Calculations on Lens results, produce numerical results (compare for example)
  2. Operations on the instances, for example creating new ones based on sampling of datasets

I imagine that at the end of a Prism run some form of numerical results need to be produced.
That being the case, a Type 2 task would require another task of type 1 to be run after, so that numerical output can be produced. This would require to enforce order of execution (or to add task dependency on other tasks).

I decided to avoid the complication by allowing for one task to call another one. So for example an hypothetical cross-validation task would:

  1. Create the Lens instances by splitting dataset appropriately
  2. Run compare to produce numerical results

ATM I don't imagine the need to run super complex sequences of tasks, this might be a reflection on the fact that we don't have an actual use case for Prism yet.

from typing import Callable, List, Literal, Optional

import numpy as np
from pandas import DataFrame, concat
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference for import package as is because I find it more explicit. We import the entire package regardless and you can easily see that this is "pd.DataFrame", rather than some other DataFrame. It's also the convention I see everywhere with pandas and numpy.

@IanAtCredo IanAtCredo self-requested a review December 16, 2022 17:59
Copy link
Collaborator

@IanAtCredo IanAtCredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I really don't think this ID implementation is great. Or, at least, the ID specification conflicts with other "sources of truth" about artifacts reflected in get_container_info. I also don't think parsing an ID string is reasonable - why not query the evaluator directly during the compare step?

All that said, I don't think we should invest more time in this. It works, let's move on.

@fabrizio-credo fabrizio-credo changed the title [WIP] Feat/prism Feat/prism Dec 19, 2022
@github-actions
Copy link

Coverage

Coverage Report
FileStmtsMissCoverMissing
credoai
   __init__.py30100% 
credoai/artifacts
   __init__.py70100% 
credoai/artifacts/data
   __init__.py00100% 
   base_data.py1071289%55, 155, 158, 173, 180, 187, 191, 195, 199, 211, 214, 221
   comparison_data.py631379%53, 60, 71, 76, 81, 90, 96, 100, 105, 114, 147, 153, 156
   tabular_data.py40685%52, 73, 77, 96, 98, 105
credoai/artifacts/model
   __init__.py00100% 
   base_model.py36294%56, 88
   classification_model.py230100% 
   comparison_model.py110100% 
   constants_model.py20100% 
   regression_model.py11464%43–45, 48
credoai/evaluators
   __init__.py150100% 
   data_fairness.py1471292%83–90, 205, 260–261, 287, 311, 334–340, 356
   data_profiler.py34294%57, 60
   deepchecks.py40392%113–122
   equity.py1533180%78, 181–184, 204, 230–257, 281–296, 307–309, 358–359
   evaluator.py70790%58, 61, 80, 106, 126, 174, 181
   fairness.py1491292%117, 238, 246–251, 318–327, 329, 341–344
   feature_drift.py59198%66
   identity_verification.py112298%144–145
   model_profiler.py741185%128–131, 145–148, 182–183, 192–193, 231
   performance.py1191488%110, 137–143, 232–241, 243, 260–263
   privacy.py118497%410, 447–449
   ranking_fairness.py1341490%136–137, 157, 178, 184–185, 382–404, 409–439
   security.py96199%297
   shap.py871484%119, 127–128, 138–144, 170–171, 253–254, 284–292
   survival_fairness.py675025%29–33, 36–48, 53–64, 67–78, 81–99, 102, 105, 108
credoai/evaluators/utils
   __init__.py30100% 
   fairlearn.py18289%46, 59
   utils.py8188%9
   validation.py812865%14, 34–35, 37–39, 46, 67–74, 80–86, 89, 95–98, 105, 108, 111, 114–115, 119–121
credoai/governance
   __init__.py10100% 
credoai/lens
   __init__.py20100% 
   lens.py2011394%53, 195–196, 232–237, 294, 336, 360, 442, 457, 461, 473
   pipeline_creator.py601280%20–21, 37, 79–91
   utils.py392828%20–27, 49–52, 71–82, 99, 106–109, 128–135
credoai/modules
   __init__.py30100% 
   constants_deepchecks.py20100% 
   constants_metrics.py190100% 
   constants_threshold_metrics.py30100% 
   metric_utils.py241825%15–30, 34–55
   metrics.py61789%62, 66, 69–70, 73, 83, 120
   metrics_credoai.py1354765%68–69, 73, 93–102, 107–109, 132–160, 176–179, 206, 230–231, 294–296, 372–378, 414–415, 485–486
   stats.py392828%11–14, 17–22, 25–27, 30–35, 38–52, 55–60
   stats_utils.py5340%5–8
credoai/prism
   __init__.py30100% 
   compare.py35294%70, 86
   prism.py36489%46, 48, 59, 86
   task.py17288%30, 37
credoai/prism/comparators
   __init_.py00100% 
   comparator.py17382%34, 42, 47
   metric_comparator.py44295%125, 131
credoai/utils
   __init__.py50100% 
   common.py1024061%55, 68–69, 75, 84–91, 96–104, 120–126, 131, 136–141, 152–159, 186
   constants.py20100% 
   dataset_utils.py613543%23, 26–31, 50, 54–55, 88–119
   logging.py551376%10–11, 14, 19–20, 23, 27, 44, 58–62
   model_utils.py301163%14–19, 29–30, 35–40
   version_check.py11191%16
TOTAL289951582% 

@fabrizio-credo fabrizio-credo merged commit 7f49358 into develop Dec 20, 2022
@fabrizio-credo fabrizio-credo deleted the feat/prism branch December 20, 2022 00:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants