Feature/mlflow (Eclectic-Sheep#159)

* feat: added mlflow logger * feat: unified get_logger methods * feat: generalized model register * feat: removed signature * feat: added mlflow register model to sac, sac_decoupled and droq * feat: added model manager to dreamers and sac_ae * feat: added model manager to p2e algorithms * fix: removed order dependencies between configs and code when registering models * fix: avoid p2e exploration models registered during finetuning * Feature/add build agents (Eclectic-Sheep#153) * [skip ci] Update README.md * [skip ci] Update README.md * feat: renamed build_models function into build_agent * feat: added build_agent() function to all the algorithms * feat: added build_agent() to evaluate() functions --------- Co-authored-by: Federico Belotti <[email protected]> * feat: split model manager configs * feat: added script to register models from checkpoints * fix: bugs * fix: configs * fix: configs + registration model script * feat: added ensembles creation to build agent function (Eclectic-Sheep#154) * feat: added possibility to select experiment and run where to upload the models * fix: bugs * feat: added configs to artifact when model is registered from checkpoint * docs: update logs_and_checkpoints how to * feat: added model_manager howto * docs: update * docs: update * fix: added 'from __future__ import annotations' * feat: added mlflow model manager tutorial in examples * fix: bugs * fix: access to cnn and mlp keys * fix: experiment and run names * fix: bugs * feat: MlflowModelManager.register_best_models() function * fix: p2e build_agent * docs: update * fix: mlflow model manager * fix: mlflow model manager register best models --------- Co-authored-by: Federico Belotti <[email protected]>
LocaAlex · Nov 28, 2023 · 18064e3 · 18064e3
1 parent ad59960
commit 18064e3
Show file tree

Hide file tree

Showing 95 changed files with 3,689 additions and 761 deletions.
diff --git a/.gitignore b/.gitignore
@@ -167,4 +167,7 @@ pytest_*
 !sheeprl/configs/env
 .diambra*
 .hydra
-.pypirc
+.pypirc
+mlruns
+mlartifacts
+examples/models
diff --git a/examples/model_manager.ipynb b/examples/model_manager.ipynb
diff --git a/howto/logs_and_checkpoints.md b/howto/logs_and_checkpoints.md
@@ -7,6 +7,10 @@ By default the logging of metrics is enabled with the following settings:
 ```yaml
 # ./sheeprl/configs/metric/default.yaml
 
+defaults:
+ - _self_
+ - /logger@logger: tensorboard
+
 log_every: 5000
 disable_timer: False
 
@@ -33,6 +37,7 @@ aggregator:
 ```
 where 
 
+* `logger` is the configuration of the logger you want to use for logging. There are two possible values: `tensorboard` (default) and `mlflow`, but one can define and choose its own logger.
 * `log_every` is the number of policy steps (number of steps played in the environment, e.g. if one has 2 processes with 4 environments per process then the policy steps are 2*4=8) between two consecutive logging operations. For more info about the policy steps, check the [Work with Steps Tutorial](./work_with_steps.md).
 * `disable_timer` is a boolean flag that enables/disables the timer to measure both the time spent in the environment and the time spent during the agent training. The timer class used can be found [here](../sheeprl/utils/timer.py).
 * `log_level` is the level of logging: $0$ means no log (it disables also the timer), whereas $1$ means logging everything.
@@ -41,6 +46,64 @@ where
 
 So, if one wants to disable everything related to logging, he/she can set `log_level` to $0$ if one wants to disable the timer, he/she can set `disable_timer` to `True`.
 
+### Loggers
+Two loggers are made available: the Tensorboard logger and the MLFlow one. In any case, it is possible to define or choose another logger.
+The configurations of the loggers are under the `./sheeprl/configs/logger/` folder.
+
+#### Tensorboard
+Let us start with the Tensorboard logger, which is the default logger used in SheepRL.
+
+```yaml
+# ./sheeprl/configs/logger/tensorboard.yaml
+
+# For more information, check https://lightning.ai/docs/fabric/stable/api/generated/lightning.fabric.loggers.TensorBoardLogger.html
+_target_: lightning.fabric.loggers.TensorBoardLogger
+name: ${run_name}
+root_dir: logs/runs/${root_dir}
+version: null
+default_hp_metric: True
+prefix: ""
+sub_dir: null
+```
+As shown in the configurations, it is necessary to specify the `_target_` class to instantiate. For the Tensorboard logger, it is necessary to specify the `name` and the `root_dir` arguments equal to the `run_name` and `logs/runs/<root_dir>` parameters, respectively, because we want that all the logs and files (configs, checkpoint, videos, ...) are under the same folder for a specific experiment.
+
+> **Note**
+>
+> In general we want the path of the logs files to be in the same folder created by Hydra when the experiment is launched, so make sure to properly define the `root_dir` and `name` parameters of the logger so that it is within the folder created by hydra (defined by the `hydra.run.dir` parameter). The tensorboard logger will save the logs in the `<root_dir>/<name>/<version>/<sub_dir>/` folder (if `sub_dir` is defined, otherwise in the `<root_dir>/<name>/<version>/` folder).
+
+The documentation of the TensorboardLogger class can be found [here](https://lightning.ai/docs/fabric/stable/api/generated/lightning.fabric.loggers.TensorBoardLogger.html).
+
+#### MLFlow
+Another possibility provided by SheepRL is [MLFlow](https://mlflow.org/docs/2.8.0/index.html).
+
+```yaml
+# ./sheeprl/configs/logger/mlflow.yaml
+
+# For more information, check https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.mlflow.html#lightning.pytorch.loggers.mlflow.MLFlowLogger
+_target_: lightning.pytorch.loggers.MLFlowLogger
+experiment_name: ${exp_name}
+tracking_uri: ${oc.env:MLFLOW_TRACKING_URI}
+run_name: ${algo.name}_${env.id}_${now:%Y-%m-%d_%H-%M-%S}
+tags: null
+save_dir: null
+prefix: ""
+artifact_location: null
+run_id: null
+log_model: false
+```
+
+The parameters that can be specified for creating the MLFlow logger are explained [here](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.mlflow.html#lightning.pytorch.loggers.mlflow.MLFlowLogger).
+
+You can specify the MLFlow logger instead of the Tensorboard one in the CLI, by adding the `[email protected]=mlflow` argument. In this way, hydra will take the configurations defined in the `./sheeprl/configs/logger/mlflow.yaml` file.
+
+```bash
+python sheeprl.py exp=ppo exp_name=ppo-cartpole [email protected]=mlflow
+```
+
+> **Note**
+>
+> If you are using an MLFlow server, you can specify the `tracking_uri` in the config file or with the `MLFLOW_TRACKING_URI` environment variable (that is the default value in the configs).
+
 ### Logged metrics
 
 Every algorithm should specify a set of default metrics to log, called `AGGREGATOR_KEYS`, under its own `utils.py` file. For instance, the default metrics logged by DreamerV2 are the following:

diff --git a/howto/model_manager.md b/howto/model_manager.md
@@ -0,0 +1,103 @@
+# Model Manager
+
+SheepRL makes it possible to register trained models on MLFLow, so as to keep track of model versions and stages.
+
+## Register models with training
+The configurations of the model manager are placed in the `./sheeprl/configs/model_manager/` folder, and the default configuration is defined as follows:
+```yaml
+# ./sheeprl/configs/model_manager/default.yaml
+
+disabled: True
+models: {}
+```
+Since the algorithms have different models, then the `models` parameter is set to an empty python dictionary, and each agent will define its own configuration. The `disabled` parameter indicates whether or not the user wants to register the agent when the training is finished (`False` means that the agent will be registered, otherwise not).
+
+> **Note**
+>
+> The model manager can be used even if the chosen logger is Tensorboard, the only requirement is that an instance of the MLFlow server is running and is accessible, moreover, it is necessary to specify its URI in the `MLFLOW_TRACKING_URI` environment variable.
+
+To better understand how to define the configurations of the models you want to register, take a look at the DreamerV3 model manager configuration:
+```yaml
+# ./sheeprl/configs/model_manager/dreamer_v3.yaml
+
+defaults:
+ - default
+ - _self_
+
+models: 
+ world_model:
+ model_name: "${exp_name}_world_model"
+ description: "DreamerV3 World Model used in ${env.id} Environment"
+ tags: {}
+ actor:
+ model_name: "${exp_name}_actor"
+ description: "DreamerV3 Actor used in ${env.id} Environment"
+ tags: {}
+ critic:
+ model_name: "${exp_name}_critic"
+ description: "DreamerV3 Critic used in ${env.id} Environment"
+ tags: {}
+ target_critic:
+ model_name: "${exp_name}_target_critic"
+ description: "DreamerV3 Target Critic used in ${env.id} Environment"
+ tags: {}
+ moments:
+ model_name: "${exp_name}_moments"
+ description: "DreamerV3 Moments used in ${env.id} Environment"
+ tags: {}
+```
+For each model, it is necessary to define the `model_name`, the `description`, and the `tags` (i.e., a python dictionary with strings as keys and values). The keys that can be specified are defined by the `MODELS_TO_REGISTER` variable in the `./sheeprl/algos/<algo_name>/utils.py`. For DreamerV3, it is defined as follows: `MODELS_TO_REGISTER = {"world_model", "actor", "critic", "target_critic", "moments"}`.
+If you do not want to log some models, then, you just need to remove it from the configuration file.
+
+> **Note**
+>
+> The name of the models in the `MODELS_TO_REGISTER` variable is equal to the name of the variables of the models in the `./sheeprl/algos/<algo_name>/<algo_name>.py` file.
+>
+> Make sure that the models specified in the configuration file are a subset of the models defined by the `MODELS_TO_REGISTER` variable.
+
+## Register models from checkpoints
+Another possibility is to register the models after the training, by manually selecting the checkpoint where to retrieve the agent. To do this, it is possible to run the `sheeprl_model_manager.py` script by properly specifying the `checkpoint_path`, the `model_manager`, and the MLFlow-related configurations.
+The default configurations are defined in the `./sheeprl/configs/model_manager_config.yaml` file, that is reported below:
+```yaml
+# ./sheeprl/configs/model_manager_config.yaml
+
+# @package _global_
+defaults:
+ - _self_
+ - model_manager: ???
+ - override hydra/hydra_logging: disabled
+ - override hydra/job_logging: disabled
+
+hydra:
+ output_subdir: null
+ run:
+ dir: .
+
+checkpoint_path: ???
+run:
+ id: null
+ name: ${now:%Y-%m-%d_%H-%M-%S}_${exp_name}
+experiment:
+ id: null
+ name: ${exp_name}_${now:%Y-%m-%d_%H-%M-%S}
+tracking_uri: ${oc.env:MLFLOW_TRACKING_URI}
+```
+
+As before, it is necessary to specify the `model_manager` configurations (the models we want to register with names, descriptions, and tags). Moreover, it is mandatory to set the `checkpoint_path`, which must be the path to the `ckpt` file created during the training. Finally, the `run` and `experiment` parameters contain the MLFlow configurations:
+* If you set the `run.id` to a value different from `null`, then all the other parameters are ignored, indeed, the models will be logged and registered under the run with the specified ID.
+* If you want to create a new run (with a name equal to `run.name`) and put it into an existing experiment, then you have to set `run.id=null` and `experiment.id=<experiment_id>`.
+* If you set `experiment.id=null` and `run.id=null`, then a new experiment and a new run are created with the specified names.
+
+> **Note**
+>
+> Also, in this case, the models specified in the `model_manager` configuration must be a subset of the `MODELS_TO_REGISTER` variable.
+
+For instance, you can register the DreamerV3 models from a checkpoint with the following command:
+
+```bash
+python sheeprl_model_manager.py model_manager=dreamer_v3 checkpoint_path=/path/to/checkpoint.ckpt
+```
+
+## Delete, Transition and Download Models
+The MLFlow model manager enables the deletion of the registered models, moving them from one stage to another or downloading them.
+[This notebook](../examples/model_manager.ipynb) contains a tutorial on how to use the MLFlow model manager. We recommend taking a look to see what APIs the model manager makes available.