Clearer separation between the trainer and the algorithm and refactoring of policy classes #1034
Open
3 of 9 tasks
Labels
blocked
Can't be worked on for now
breaking changes
Changes in public interfaces. Includes small changes or changes in keys
major
Large changes that cannot or should not be broken down into smaller ones
refactoring
No change to functionality
tentative
Up to discussion, may be dismissed
Milestone
While Tianshou's idea of encapsulating everything learning related into a single method of a base policy class is a nice idea to keep algorithms more or less self contained, it has lead to a cluttered trainer and the need for kwargs during the update call to remain functional (see #949). Additionally, while tianshou uses RL jargon such as policies, they’re not really doing what, in RL terms, policies are doing, i.e., mapping from states to actions. Rather, tianshou interweaves the RL policy with parts of RL algorithms. Also, the trainer implements some parts which might be algorithm specific, like sampling transitions, and then passing them to tianshou-policies for updates.
In order to create a cleaner framework, I think the following changes would improve things:
Separate trainer from algorithm logic
As already mentioned, the trainer currently holds the logic for training loops, logging, and additionally holds some algorithm logic. Half of the if statements and asserts in the
__next__
function are necessary because the trainer doesn’t really know what type of RL it’s doing. Instead, I think it makes much more sense to move algorithm logic to new algorithm classes and let the trainer solely be responsible for running the algorithm, logging, and checkpointing. There would also be only one trainer, instead of the OnPolicyTrainer, etc. Eventually, the trainer may even serve as the factory for the high level API.Separate algorithms from policies
While this adds an additional layer of abstraction/complexity, things will eventually become cleaner. Right now, loads of keywords and arguments can be given to the trainer and they might not even be compatible. A BaseAlgorithm class would allow for more explicit arguments given to a certain algorithm and additionally allow for unconventional algorithms such as ARS. This would also allow for the algorithm to be independent of code regarding the type of policy (continuous/discrete) The policy class then is merely the function that maps from observations to actions.
The text was updated successfully, but these errors were encountered: