Skip to content

Commit

Permalink
Feature/mlflow (Eclectic-Sheep#159)
Browse files Browse the repository at this point in the history
* feat: added mlflow logger

* feat: unified get_logger methods

* feat: generalized model register

* feat: removed signature

* feat: added mlflow register model to sac, sac_decoupled and droq

* feat: added model manager to dreamers and sac_ae

* feat: added model manager to p2e algorithms

* fix: removed order dependencies between configs and code when registering models

* fix: avoid p2e exploration models registered during finetuning

* Feature/add build agents (Eclectic-Sheep#153)

* [skip ci] Update README.md

* [skip ci] Update README.md

* feat: renamed build_models function into build_agent

* feat: added build_agent() function to all the algorithms

* feat: added build_agent() to evaluate() functions

---------

Co-authored-by: Federico Belotti <[email protected]>

* feat: split model manager configs

* feat: added script to register models from checkpoints

* fix: bugs

* fix: configs

* fix: configs + registration model script

* feat: added ensembles creation to build agent function (Eclectic-Sheep#154)

* feat: added possibility to select experiment and run where to upload the models

* fix: bugs

* feat: added configs to artifact when model is registered from checkpoint

* docs: update logs_and_checkpoints how to

* feat: added model_manager howto

* docs: update

* docs: update

* fix: added 'from __future__ import annotations'

* feat: added mlflow model manager tutorial in examples

* fix: bugs

* fix: access to cnn and mlp keys

* fix: experiment and run names

* fix: bugs

* feat: MlflowModelManager.register_best_models() function

* fix: p2e build_agent

* docs: update

* fix: mlflow model manager

* fix: mlflow model manager register best models

---------

Co-authored-by: Federico Belotti <[email protected]>
  • Loading branch information
michele-milesi and belerico committed Nov 28, 2023
1 parent ad59960 commit 18064e3
Show file tree
Hide file tree
Showing 95 changed files with 3,689 additions and 761 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -167,4 +167,7 @@ pytest_*
!sheeprl/configs/env
.diambra*
.hydra
.pypirc
.pypirc
mlruns
mlartifacts
examples/models
978 changes: 978 additions & 0 deletions examples/model_manager.ipynb

Large diffs are not rendered by default.

63 changes: 63 additions & 0 deletions howto/logs_and_checkpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ By default the logging of metrics is enabled with the following settings:
```yaml
# ./sheeprl/configs/metric/default.yaml

defaults:
- _self_
- /logger@logger: tensorboard

log_every: 5000
disable_timer: False

Expand All @@ -33,6 +37,7 @@ aggregator:
```
where

* `logger` is the configuration of the logger you want to use for logging. There are two possible values: `tensorboard` (default) and `mlflow`, but one can define and choose its own logger.
* `log_every` is the number of policy steps (number of steps played in the environment, e.g. if one has 2 processes with 4 environments per process then the policy steps are 2*4=8) between two consecutive logging operations. For more info about the policy steps, check the [Work with Steps Tutorial](./work_with_steps.md).
* `disable_timer` is a boolean flag that enables/disables the timer to measure both the time spent in the environment and the time spent during the agent training. The timer class used can be found [here](../sheeprl/utils/timer.py).
* `log_level` is the level of logging: $0$ means no log (it disables also the timer), whereas $1$ means logging everything.
Expand All @@ -41,6 +46,64 @@ where

So, if one wants to disable everything related to logging, he/she can set `log_level` to $0$ if one wants to disable the timer, he/she can set `disable_timer` to `True`.

### Loggers
Two loggers are made available: the Tensorboard logger and the MLFlow one. In any case, it is possible to define or choose another logger.
The configurations of the loggers are under the `./sheeprl/configs/logger/` folder.

#### Tensorboard
Let us start with the Tensorboard logger, which is the default logger used in SheepRL.

```yaml
# ./sheeprl/configs/logger/tensorboard.yaml

# For more information, check https://lightning.ai/docs/fabric/stable/api/generated/lightning.fabric.loggers.TensorBoardLogger.html
_target_: lightning.fabric.loggers.TensorBoardLogger
name: ${run_name}
root_dir: logs/runs/${root_dir}
version: null
default_hp_metric: True
prefix: ""
sub_dir: null
```
As shown in the configurations, it is necessary to specify the `_target_` class to instantiate. For the Tensorboard logger, it is necessary to specify the `name` and the `root_dir` arguments equal to the `run_name` and `logs/runs/<root_dir>` parameters, respectively, because we want that all the logs and files (configs, checkpoint, videos, ...) are under the same folder for a specific experiment.

> **Note**
>
> In general we want the path of the logs files to be in the same folder created by Hydra when the experiment is launched, so make sure to properly define the `root_dir` and `name` parameters of the logger so that it is within the folder created by hydra (defined by the `hydra.run.dir` parameter). The tensorboard logger will save the logs in the `<root_dir>/<name>/<version>/<sub_dir>/` folder (if `sub_dir` is defined, otherwise in the `<root_dir>/<name>/<version>/` folder).
The documentation of the TensorboardLogger class can be found [here](https://lightning.ai/docs/fabric/stable/api/generated/lightning.fabric.loggers.TensorBoardLogger.html).

#### MLFlow
Another possibility provided by SheepRL is [MLFlow](https://mlflow.org/docs/2.8.0/index.html).

```yaml
# ./sheeprl/configs/logger/mlflow.yaml

# For more information, check https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.mlflow.html#lightning.pytorch.loggers.mlflow.MLFlowLogger
_target_: lightning.pytorch.loggers.MLFlowLogger
experiment_name: ${exp_name}
tracking_uri: ${oc.env:MLFLOW_TRACKING_URI}
run_name: ${algo.name}_${env.id}_${now:%Y-%m-%d_%H-%M-%S}
tags: null
save_dir: null
prefix: ""
artifact_location: null
run_id: null
log_model: false
```

The parameters that can be specified for creating the MLFlow logger are explained [here](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.mlflow.html#lightning.pytorch.loggers.mlflow.MLFlowLogger).

You can specify the MLFlow logger instead of the Tensorboard one in the CLI, by adding the `[email protected]=mlflow` argument. In this way, hydra will take the configurations defined in the `./sheeprl/configs/logger/mlflow.yaml` file.

```bash
python sheeprl.py exp=ppo exp_name=ppo-cartpole [email protected]=mlflow
```

> **Note**
>
> If you are using an MLFlow server, you can specify the `tracking_uri` in the config file or with the `MLFLOW_TRACKING_URI` environment variable (that is the default value in the configs).
### Logged metrics

Every algorithm should specify a set of default metrics to log, called `AGGREGATOR_KEYS`, under its own `utils.py` file. For instance, the default metrics logged by DreamerV2 are the following:
Expand Down
103 changes: 103 additions & 0 deletions howto/model_manager.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Model Manager

SheepRL makes it possible to register trained models on MLFLow, so as to keep track of model versions and stages.

## Register models with training
The configurations of the model manager are placed in the `./sheeprl/configs/model_manager/` folder, and the default configuration is defined as follows:
```yaml
# ./sheeprl/configs/model_manager/default.yaml

disabled: True
models: {}
```
Since the algorithms have different models, then the `models` parameter is set to an empty python dictionary, and each agent will define its own configuration. The `disabled` parameter indicates whether or not the user wants to register the agent when the training is finished (`False` means that the agent will be registered, otherwise not).

> **Note**
>
> The model manager can be used even if the chosen logger is Tensorboard, the only requirement is that an instance of the MLFlow server is running and is accessible, moreover, it is necessary to specify its URI in the `MLFLOW_TRACKING_URI` environment variable.
To better understand how to define the configurations of the models you want to register, take a look at the DreamerV3 model manager configuration:
```yaml
# ./sheeprl/configs/model_manager/dreamer_v3.yaml

defaults:
- default
- _self_

models:
world_model:
model_name: "${exp_name}_world_model"
description: "DreamerV3 World Model used in ${env.id} Environment"
tags: {}
actor:
model_name: "${exp_name}_actor"
description: "DreamerV3 Actor used in ${env.id} Environment"
tags: {}
critic:
model_name: "${exp_name}_critic"
description: "DreamerV3 Critic used in ${env.id} Environment"
tags: {}
target_critic:
model_name: "${exp_name}_target_critic"
description: "DreamerV3 Target Critic used in ${env.id} Environment"
tags: {}
moments:
model_name: "${exp_name}_moments"
description: "DreamerV3 Moments used in ${env.id} Environment"
tags: {}
```
For each model, it is necessary to define the `model_name`, the `description`, and the `tags` (i.e., a python dictionary with strings as keys and values). The keys that can be specified are defined by the `MODELS_TO_REGISTER` variable in the `./sheeprl/algos/<algo_name>/utils.py`. For DreamerV3, it is defined as follows: `MODELS_TO_REGISTER = {"world_model", "actor", "critic", "target_critic", "moments"}`.
If you do not want to log some models, then, you just need to remove it from the configuration file.

> **Note**
>
> The name of the models in the `MODELS_TO_REGISTER` variable is equal to the name of the variables of the models in the `./sheeprl/algos/<algo_name>/<algo_name>.py` file.
>
> Make sure that the models specified in the configuration file are a subset of the models defined by the `MODELS_TO_REGISTER` variable.
## Register models from checkpoints
Another possibility is to register the models after the training, by manually selecting the checkpoint where to retrieve the agent. To do this, it is possible to run the `sheeprl_model_manager.py` script by properly specifying the `checkpoint_path`, the `model_manager`, and the MLFlow-related configurations.
The default configurations are defined in the `./sheeprl/configs/model_manager_config.yaml` file, that is reported below:
```yaml
# ./sheeprl/configs/model_manager_config.yaml

# @package _global_
defaults:
- _self_
- model_manager: ???
- override hydra/hydra_logging: disabled
- override hydra/job_logging: disabled

hydra:
output_subdir: null
run:
dir: .

checkpoint_path: ???
run:
id: null
name: ${now:%Y-%m-%d_%H-%M-%S}_${exp_name}
experiment:
id: null
name: ${exp_name}_${now:%Y-%m-%d_%H-%M-%S}
tracking_uri: ${oc.env:MLFLOW_TRACKING_URI}
```

As before, it is necessary to specify the `model_manager` configurations (the models we want to register with names, descriptions, and tags). Moreover, it is mandatory to set the `checkpoint_path`, which must be the path to the `ckpt` file created during the training. Finally, the `run` and `experiment` parameters contain the MLFlow configurations:
* If you set the `run.id` to a value different from `null`, then all the other parameters are ignored, indeed, the models will be logged and registered under the run with the specified ID.
* If you want to create a new run (with a name equal to `run.name`) and put it into an existing experiment, then you have to set `run.id=null` and `experiment.id=<experiment_id>`.
* If you set `experiment.id=null` and `run.id=null`, then a new experiment and a new run are created with the specified names.

> **Note**
>
> Also, in this case, the models specified in the `model_manager` configuration must be a subset of the `MODELS_TO_REGISTER` variable.
For instance, you can register the DreamerV3 models from a checkpoint with the following command:

```bash
python sheeprl_model_manager.py model_manager=dreamer_v3 checkpoint_path=/path/to/checkpoint.ckpt
```

## Delete, Transition and Download Models
The MLFlow model manager enables the deletion of the registered models, moving them from one stage to another or downloading them.
[This notebook](../examples/model_manager.ipynb) contains a tutorial on how to use the MLFlow model manager. We recommend taking a look to see what APIs the model manager makes available.
Loading

0 comments on commit 18064e3

Please sign in to comment.