Major modeling refactoring #165

fedebotu · 2024-04-23T20:24:10Z

Description

This PR is for a major, long-due refactoring to the RL4CO codebase 😄

Motivation and Context

So far, we had mostly overfitted RL4CO to the autoregressive Attention Model structure (encoder-decoder). However, there are several models that do not necessarily follow this, such as DeepACO. Implementing such a model requires changes in the structure, which then starts to become non-standardized anymore, and it could be hard for newcomers to implement a different model type. For this reason, some rethinking of the library on the modeling side is necessary!

Tip

Note that in RL4CO we refer to model as the RL algorithm and policy as the neural network that given an instance gives back a sequence of actions $\pi_0, \pi_1, \dots, \pi_N$., i.e. the solution. In other words: model is a LightningModule that trains the policy which is a nn.Module.

New structure

With the new structure, the aim is to categorize NCO approaches (which are not necessarily trained with RL!) into the following: 1) constructive, 2) improvement, 3) transductive.

1) Constructive (policy)

Input: instance
Output: solution
Constructive NCO pre-train a policy to amortize the inference. "Constructive" means that a solution is created from scratch by the model. We can also categorize constructive NCO in two sub-categories depending on the role of encoder and decoder:

1a) Autoregressive (AR)

Autoregressive approaches use a decoder that outputs log probabilities for the current solution. These approaches generate a solution step by step, similar to e.g. LLMs. They have an encoder-decoder structure (i.e. AM). Some models may not have an encoder at all and just re-encode at each step (e.g. BQ-NCO).

1b) NonAutoregressive (NAR)

The difference between AR and NAR approaches is that NAR only use an encoder (they just encode in one shot) and generate for example a heatmap, which can then be decoded simply by using it as a probability distribution or by using some search method on top (e.g. DeepACO).

2) Improvement (policy)

Input: instance, current solution
Output: improved solution

These methods differ w.r.t. constructive NCO since they can obtain better solutions similarly to how local search algorithms work - they can improve the solutions over time. This is different from decoding strategies or similar in constructive methods since these policies are trained for performing improvement operations.

Note: You may have a look here for the basic constructive NCO policy structure! ;)

3) Transductive (model)

Input: instance, (parameters $\theta$)
Output: solution, (updated $\theta^*$)

Tip

Read the definition of inductive vs transductive RL. In inductive RL, we train to generalize to new instances. In transductive RL we train (or finetune) to solve only specific ones.

Transductive models are learning algorithms that optimize on a specific instance: they improve solutions by updating policy parameters $\theta$_, which means that we are running optimization (backprop) during online testing. Transductive learning can be performed with different policies: for example EAS updates (a part of) AR policies parameters to obtain better solutions, but I guess there are ways (or papers out there I don't know of) that optimize at test time.

Category	Input	Output	Description
Constructive	Instance	Solution	Amortized policy generates solutions from scratch. Can be categorized into Autoregressive (AR) and NonAutoregressive (NAR) approaches.
Improvement	Instance, Current Solution	Improved Solution	Policies trained to improve existing solutions iteratively, akin to local search algorithms. Different from constructive methods as they focus on refining solutions rather than generating them from scratch.
Transductive	Instance, (Parameters)	Solution, (Updated Parameters)	Updates policy parameters during online testing to improve solutions. Can utilize various policies for optimization, such as EAS updates for AR policies.

In practice, here is what the structure looks right now:

rl4co/
└── models/
    ├── common/
    │   ├── constructive/
    │   │   ├── base.py 
    │   │   ├── autoregressive/
    │   │   │   ├── encoder.py
    │   │   │   ├── decoder.py
    │   │   │   └── policy.py
    │   │   └── nonautoregressive/
    │   │       ├── encoder.py
    │   │       ├── decoder.py
    │   │       └── policy.py
    │   ├── improvement/
    │   │   └── base.py # TBD
    │   └── transductive/
    │       └── base.py
    ├── nn # generic neural network
    ├── rl # generic RL models
    └── zoo # literature

Changelog

[Major!] New structure: constructive, improvement, ~~search~~ transductive*!
Standardize embedding_dim -> embed_dim (see PyTorch
Policies now do not require env_name as a mandatory parameter
Add new decoding strategy: evaluate which simply takes in an action if provided and gets it log probs
Remove evaluate_action since it can be simply done via the above!
Add entropy calculation as operation
Add decoder hook
Minor cleanups

Types of changes

New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of examples)

TODO

Docstrings
Documentation

Extra

Standardize critic as policy.encoder + value_head (this way any model should be able to have a critic)

Special thanks to @LTluttmann for your help and feedback~

Do you have some ideas / feedback on the above PR?
CC: @Furffico @henry-yeh @ahottung @bokveizen
Also tagging @yining043 for the coming improvement methods

…tion

fedebotu · 2024-04-24T12:12:49Z

Talking to some, it seems that the naming "Transductive" instead of "Search", since search is too broad in scope and the line is a bit blurred in what each algorithm specifically does. Transductive means "directly optimize the parameters specifically for an instance" which conveys the meaning more easily!

bokveizen · 2024-04-24T12:16:19Z

Talking to some, it seems that the naming "Transductive" instead of "Search", since search is too broad in scope and the line is a bit blurred in what each algorithm specifically does. Transductive means "directly optimize the parameters specifically for an instance" which conveys the meaning more easily!

Yep! I remember you mentioned this before, and that was what I used :-)

rl4co/models/zoo/pomo/model.py

ngastzepeda

Great job! I included a few comments and suggestions, but nothing mayor or important :)

rl4co/models/common/constructive/autoregressive/decoder.py

rl4co/models/common/constructive/autoregressive/encoder.py

rl4co/models/common/constructive/autoregressive/policy.py

rl4co/models/common/constructive/base.py

tests/test_policy.py

rl4co/models/zoo/ham/policy.py

rl4co/models/zoo/matnet/decoder.py

rl4co/models/zoo/pomo/model.py

Co-authored-by: ahottung <[email protected]>

fedebotu · 2024-04-27T07:06:14Z

I noticed doing the metaclasses that NonAutoregressive[...] things are directly callable. We should modify such that the GNN model belongs to zoo and it will be called from there

henry-yeh

NAR refactoring needed @Furffico

rl4co/models/common/constructive/nonautoregressive/encoder.py

rl4co/models/common/constructive/base.py

rl4co/models/zoo/am/decoder.py

cbhua

LGTM! Leaving some minor comments there.

rl4co/models/zoo/nargnn/encoder.py

rl4co/models/zoo/nargnn/policy.py

cbhua · 2024-04-28T09:59:12Z

rl4co/utils/ops.py

+    return entropy
+
+
+# TODO: modularize inside the envs


This means to add a num_starts paragram in the init td from environments right?

Yep, theoretically, it can be obtained through the environments

Co-authored-by: cbhua <[email protected]>

cbhua · 2024-04-28T12:11:45Z

A quick abstract look to the current RL4CO structure.

fedebotu · 2024-04-28T12:54:34Z

A quick abstract look to the current RL4CO structure.

Nice!
Careful though because "Transductive" are RL algorithms to "finetune" policies on specific instances, like EAS

cbhua · 2024-04-29T00:33:17Z

docs/_content/api/models/common/__init__.md

+```{eval-rst}
+.. tip::
+   Note that in RL4CO we distinguish the RL algorithms and the actors via the following naming:
+
+   * **Model:** Refers to the reinforcement learning algorithm encapsulated within a `LightningModule`. This module is responsible for training the policy.
+   * **Policy:** Implemented as a `nn.Module`, this neural network (often referred to as the *actor*) takes an instance and outputs a sequence of actions, :math:`\pi = \pi_0, \pi_1, \dots, \pi_N`, which constitutes the solution.
+
+   Here, :math:`\pi_i` represents the action taken at step :math:`i`, forming a sequence that leads to the optimal or near-optimal solution for the given instance.
+```


We could mention here or somewhere else that abstract classes under rl4co/models/common are not expected to be directly initialized. For example, if you want to use an autoregressive policy, you may want to init an AM model instead of the AutoregressivePolicy(), same as NAR, improvement, and transductive classes.

cbhua

LGTM! Nice documentation.

fedebotu · 2024-04-29T07:31:21Z

Important

Thanks for your revisions! We are planning to merge the PR into main tomorrow - if you have some additional comments / modification / bugfixes please let us know!

LTluttmann

great job on the refactoring! I only have one minor comment regarding the configuration of different model-policy combinations. Maybe we can add the example to the hydra tutorial

LTluttmann · 2024-04-28T08:40:13Z

configs/model/am-ppo.yaml

regarding different algorithm-architecture combinations, it might be better to configure those combinations using hydra. In fact, using hydras nested instantiation we can already do something like this.

# @package _global_ model: _target_: rl4co.models.PPO policy: _target_: rl4co.models.AttentionModelPolicy env_name: ${env.name} ppo_epochs: 4 metrics: train: ["loss", "reward", "surrogate_loss", "value_loss", "entropy_bonus"]

Might be beneficial to note this in the docs / examples

fedebotu added 12 commits April 24, 2024 04:16

[Feat] add entropy calculation

81074f0

[Feat] action logprob evaluation

bb64ef1

[Minor] remove unused_kwarg for clarity

44c4901

[Rename] embedding_dim -> embed_dim (PyTorch naming convention)

fbd4941

[Move] move common one level up

6e07985

[Refactor] classify NCO as constructive (AR,NAR), improvement, search

f30c32d

[Refactor] follow major refactoring

3a16b7c

[Refactor] cleaner implementation; eval via policy itself

3ec285e

[Refactor] make env_name an optional kwarg

796d54a

[Tests] adapt to refactoring

faab06e

[Refactor] new structure; env_name as optional; embed_dim standardiza…

5d04dfa

…tion

[Tests] minor fix

4e6351c

ahottung reviewed Apr 24, 2024

View reviewed changes

rl4co/models/zoo/pomo/model.py Show resolved Hide resolved

ahottung added 2 commits April 24, 2024 20:03

Fixing best solution gathering for POMO

10cc4ee

Fixing bug introduced in last commit

81a3bf9

ngastzepeda approved these changes Apr 25, 2024

View reviewed changes

cbhua mentioned this pull request Apr 25, 2024

Major environment refactoring (draft version) #166

Closed

8 tasks

ahottung reviewed Apr 26, 2024

View reviewed changes

rl4co/models/zoo/pomo/model.py Outdated Show resolved Hide resolved

fedebotu and others added 2 commits April 27, 2024 15:45

Merge remote-tracking branch 'origin/main' into refactor-base

7034172

[BugFix] default POMO parameters

3644acb

Co-authored-by: ahottung <[email protected]>

henry-yeh reviewed Apr 27, 2024

View reviewed changes

rl4co/models/common/constructive/nonautoregressive/encoder.py Show resolved Hide resolved

rl4co/models/common/constructive/nonautoregressive/encoder.py Outdated Show resolved Hide resolved

rl4co/models/common/constructive/nonautoregressive/encoder.py Outdated Show resolved Hide resolved

fedebotu added 6 commits April 27, 2024 16:36

[Rename] Search -> Transductive

cd62442

[Feat] add NARGNN (as in DeepACO) as a separate policy and encoder

4180997

[Refactor] abstract classes with abc.ABCMeta

e783679

[Refactor] abstract classes with abc.ABCMeta

5a4740f

[Feat] modular Critic network

3adbef4

[Rename] PPOModel -> AMPPO

db06207

[Refactor] separate A2C from classic REINFORCE #93

9ef3254

fedebotu marked this pull request as ready for review April 28, 2024 03:55

cbhua reviewed Apr 28, 2024

View reviewed changes

rl4co/models/common/constructive/base.py Show resolved Hide resolved

cbhua reviewed Apr 28, 2024

View reviewed changes

rl4co/models/zoo/am/decoder.py Outdated Show resolved Hide resolved

fedebotu added 6 commits April 28, 2024 16:42

Merge remote-tracking branch 'origin/main' into refactor-base

ca44680

[Minor] force env_name as str for clarity

2c91457

[Tests] avoid testing render

6da8691

[Doc] add docstrings

04ed94a

[BugFix] env_name not passed to base class

b7fe9b3

[Doc] update to latest version

3558d57

cbhua reviewed Apr 28, 2024

View reviewed changes

fedebotu and others added 2 commits April 28, 2024 19:28

[Minor] woopsie, remove added exampels

c3089fb

[Minor] fix NAR; raise log error if any param is found in decoder

c1e19e8

Co-authored-by: cbhua <[email protected]>

fedebotu added 2 commits April 28, 2024 23:33

[Doc] fix docstrings

90956af

[Doc] documentation update and improvements

cfaf43d

cbhua reviewed Apr 29, 2024

View reviewed changes

cbhua approved these changes Apr 29, 2024

View reviewed changes

LTluttmann approved these changes Apr 29, 2024

View reviewed changes

fedebotu merged commit a33cb2d into main Apr 30, 2024
24 checks passed

cbhua deleted the refactor-base branch May 10, 2024 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major modeling refactoring #165

Major modeling refactoring #165

fedebotu commented Apr 23, 2024 •

edited

Loading

fedebotu commented Apr 24, 2024

bokveizen commented Apr 24, 2024

ngastzepeda left a comment

fedebotu commented Apr 27, 2024

henry-yeh left a comment

cbhua left a comment

cbhua Apr 28, 2024

fedebotu Apr 28, 2024 •

edited

Loading

cbhua commented Apr 28, 2024 •

edited

Loading

fedebotu commented Apr 28, 2024

cbhua Apr 29, 2024

cbhua left a comment

fedebotu commented Apr 29, 2024

LTluttmann left a comment

LTluttmann Apr 28, 2024

Major modeling refactoring #165

Major modeling refactoring #165

Conversation

fedebotu commented Apr 23, 2024 • edited Loading

Description

Motivation and Context

New structure

1) Constructive (policy)

1a) Autoregressive (AR)

1b) NonAutoregressive (NAR)

2) Improvement (policy)

3) Transductive (model)

Changelog

Types of changes

TODO

Extra

fedebotu commented Apr 24, 2024

bokveizen commented Apr 24, 2024

ngastzepeda left a comment

Choose a reason for hiding this comment

fedebotu commented Apr 27, 2024

henry-yeh left a comment

Choose a reason for hiding this comment

cbhua left a comment

Choose a reason for hiding this comment

cbhua Apr 28, 2024

Choose a reason for hiding this comment

fedebotu Apr 28, 2024 • edited Loading

Choose a reason for hiding this comment

cbhua commented Apr 28, 2024 • edited Loading

fedebotu commented Apr 28, 2024

cbhua Apr 29, 2024

Choose a reason for hiding this comment

cbhua left a comment

Choose a reason for hiding this comment

fedebotu commented Apr 29, 2024

LTluttmann left a comment

Choose a reason for hiding this comment

LTluttmann Apr 28, 2024

Choose a reason for hiding this comment

fedebotu commented Apr 23, 2024 •

edited

Loading

fedebotu Apr 28, 2024 •

edited

Loading

cbhua commented Apr 28, 2024 •

edited

Loading