Implementation of the SQIL algorithm #744

RedTachyon · 2023-07-04T19:25:09Z

Description

Fixes #740

Right now it's a basic implementation based on SB3.

It seems to work on a basic level (i.e. I trained it on CartPole and it converged), but still needs a bunch of cleanup and testing.

One note is that there's quite a few #type: ignore[...] annotations. I tried to minimize them, but a good chunk of the code is either modifying SB3 code, or closely interfacing with it, and SB3 seems to have more lax type checking.

Testing

WiP

codecov · 2023-07-04T21:01:39Z

Codecov Report

Merging #744 (d2124a2) into master (2743c28) will increase coverage by 0.04%.
Report is 1 commits behind head on master.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #744      +/-   ##
==========================================
+ Coverage   96.33%   96.38%   +0.04%     
==========================================
  Files          93       95       +2     
  Lines        8789     8901     +112     
==========================================
+ Hits         8467     8579     +112     
  Misses        322      322

Files Changed	Coverage Δ
src/imitation/data/rollout.py	`100.00% <ø> (ø)`
src/imitation/algorithms/adversarial/common.py	`96.83% <100.00%> (ø)`
src/imitation/algorithms/base.py	`98.73% <100.00%> (-0.05%)`	⬇️
src/imitation/algorithms/bc.py	`98.33% <100.00%> (ø)`
src/imitation/algorithms/density.py	`94.48% <100.00%> (ø)`
src/imitation/algorithms/sqil.py	`100.00% <100.00%> (ø)`
src/imitation/data/types.py	`98.21% <100.00%> (+0.01%)`	⬆️
src/imitation/testing/expert_trajectories.py	`100.00% <100.00%> (ø)`
src/imitation/util/util.py	`99.19% <100.00%> (+0.01%)`	⬆️
tests/algorithms/test_sqil.py	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

AdamGleave

High-level review, can take a closer look once PR is further along!

src/imitation/algorithms/sqil.py

Remove redundant parameter

* Pin SB3 version to 1.7.0 (#738) * Update conftest.py (#742) * Custom environment tutorial (#746) * Custom environment tutorial draft * Update the docs website * Clean notebook * Text clarification and new environment * Decrease training duration to hopefully make CI happy * Clarify that BC itself does not learn rewards --------- Co-authored-by: Ariel Kwiatkowski <[email protected]> * Tutorial on comparing algorithm performance (#747) * Add a new tutorial * Update index.rst * Improvements to the tutorial * Some more caution words * Fix typos --------- Co-authored-by: Ariel Kwiatkowski <[email protected]> --------- Co-authored-by: Adam Gleave <[email protected]>

…on/740-sqil

AdamGleave

Thanks for the implementation! Algorithm looks correct. SQIL is a strange beast but at least it's quite simple conceptually.

I think the implementation could be simplified / code duplication could be reduced by moving most of the logic into a ReplayBuffer wrapper, then setting DQN to use that replay buffer (probably with replay_buffer_class). I think you can still do this in composition (which I agree seems right approach -- SQIL is not an RL algorithm, so probably shouldn't inherit from DQN, and multiple inheritance gets messy). I may be missing some subtlety though, but if right this would let us eliminate many lines of code making it much easier to read. Could also pave the way to having this support any OffPolicyAlgorithm rather than just DQN which would be neat.

Other main area is it'd be nice to have slightly more comprehensive tests. Although to be fair the algorithm is so trivial there's not that much to actually test (beyond correct functioning of DQN). Checking it makes some progress on a simple environment might add something. If returns improving is too flaky (at least without too expensive a # of timesteps), could also check the Q-network is moving in right direction (e.g. assigns higher Q-value to expert demos than randomly chosen observation/action pairs)?

docs/index.rst

src/imitation/algorithms/sqil.py

tests/algorithms/test_sqil.py

AdamGleave · 2023-07-07T02:07:18Z

docs/tutorials/10_train_sqil.ipynb

+ "\n",
+ "Soft Q Imitation Learning ([SQIL](https://arxiv.org/abs/1905.11108)) is a simple algorithm that can be used to clone expert behavior.\n",
+ "It's fundamentally a modification of the DQN algorithm. At each training step, whenever we sample a batch of data from the replay buffer,\n",
+ "we also sample a batch of expert data. Expert demonstrations are assigned a reward of 1, while the agent's own transitions are assigned a reward of 0.\n",


This is an accurate description but does highlight the algorithm is a bit bizarre. If the agent perfectly mimicked expert demos, they'd still get assigned a reward of 0. Whereas at least with stuff like AIRL/GAIL, they'd get the same reward (as discriminator could no longer distinguish them).

It is bizarre, but that's probably unavoidable, at least in continuous observation spaces (i.e. everything that's not tabular) -- if we tried to relabel generated data if it matches the demonstrations, that would basically never happen because we're comparing floats for equality.

This makes me wonder if this method would be particularly vulnerable to adversarial attacks, since there's a big difference between being in a state [1.1234] vs state [1.1235]. A larger network could probably overfit to that, which would be less likely with a more dense reward

docs/tutorials/10_train_sqil.ipynb

…test

…ft".

…iness.

AdamGleave · 2023-08-03T23:51:48Z

Asking @jas-ho to look at fixing the failing/flaky tests

src/imitation/algorithms/sqil.py

jas-ho · 2023-08-08T12:30:53Z

docs/algorithms/sqil.rst

+ demonstrations=rollouts,
+ policy="MlpPolicy",
+ )
+ # Hint: set to 1_000_000 to match the expert performance.


100_000 was already sufficient to reach expert performance (tried only a couple of times though)

src/imitation/algorithms/sqil.py

jas-ho · 2023-08-08T13:39:57Z

tests/algorithms/test_sqil.py

+ dqn_kwargs=dict(
+ learning_starts=500,
+ learning_rate=0.002,
+ batch_size=220,


I searched for good hyperparams using optuna. With these hyperparams I ended up with 1 failure out of 64 on my machine whereas before it was as many as 1 out of 5 (and on the CI pipeline it seems to have failed almost every time for the last ten-ish runs).

I've also tried pushing the number of episodes in evaluate_policy up to 100 but it did not improve further so I reverted it.

@AdamGleave based on the hyperparam search and the manual testing I did I do not think the flakiness here points to a bug. Therefore, I think our best option is to actually fix the seeds for this specific test. Given how slow the CI pipeline is I think it's not good to have even a 2% remaining rate of flakiness.

see 2bf467d which passes on CI (except for codecov)
I also ran pytest --flake-finder tests/algorithms/test_sqil.py --flake-runs=16 -n 8 locally and found no failures for test_sqil_performance and test_sqil_demonstration_buffer which were the ones failing in CI previously

Thanks for doing the hyperparam sweep! Yeah, let's fix the seed. Given it tests for a significant improvement in reward it should be a non-trivial test even with only a single seed (if we were just checking for any improvement then it'd be 50/50 whether it passed even if the algorithm was no better than random). We could also @pytest.mark.parametrize the seed for extra robustness if the test runs quick enough.

test_sqil_performance already takes ~20 sec on my machine. Given that I did not cherry-pick the seed I don't think additional parametrization improves robustness enough to trade off favorably against decreased dev velocity

AdamGleave

Thanks for tuning hyperparams, agree seems unlikely to be a bug given high success rate after tuning.

Happy for you to make the other changes you suggested; please request a re-review from me & ernestum once done.

tests/algorithms/test_sqil.py

- also set default tb_log_name to "SQIL"

- also: adjust imports to conform with style guide

jas-ho · 2023-08-09T07:46:58Z

Thanks for tuning hyperparams, agree seems unlikely to be a bug given high success rate after tuning.

Happy for you to make the other changes you suggested; please request a re-review from me & ernestum once done.

I addressed your comments @AdamGleave . From my side it's ready for final review.

ernestum · 2023-08-09T16:13:24Z

@AdamGleave will you review this or should I have a look?

AdamGleave

LGTM

AdamGleave

Please decide if you want to make this change or not prior to merging @jas-ho , but I don't need to re-review just for that.

src/imitation/algorithms/sqil.py

jas-ho · 2023-08-10T06:06:47Z

Please decide if you want to make this change or not prior to merging @jas-ho , but I don't need to re-review just for that.

That change was implemented already so it might have just been a gh display issue. -> LGTM :)

ernestum

LGTM

RedTachyon added 3 commits July 4, 2023 20:46

Initial version of the SQIL implementation

1935c99

Pin SB3 version to 1.7.0 (#738) (#745)

2d4151e

Another redundant type warning

993a0d7

AdamGleave reviewed Jul 5, 2023

View reviewed changes

RedTachyon and others added 17 commits July 5, 2023 16:51

Correctly set the expert rewards to 1

899a5d8

Remove redundant parameter

Update typing, add some tests

73064ac

Update sqil.py

b6c9d26

Style fixes

42d5468

Test updates

86825d8

Add a test to check the buffer

95a2661

Formatting, docstring

67662b4

Improve test coverage

68f693b

Some documentation updates (not complete)

1b5338b

Add a SQIL tutorial

3c78336

Reduce tutorial runtime

c303af1

Add SQIL description in docs, try to add it to the right places

bf81940

Merge branch 'master' into redtachyon/740-sqil

0f95524

Fix docs

5da56f3

Merge remote-tracking branch 'HCAI/redtachyon/740-sqil' into redtachy…

e410c39

…on/740-sqil

Blacken a tutorial

d8f3c30

AdamGleave reviewed Jul 7, 2023

View reviewed changes

RedTachyon added 7 commits July 7, 2023 12:51

Reorder things in docs

ae43a75

Change the SQIL structure to instead subclass the replay buffer, new …

5b23f84

…test

Add an empty line

bc8152b

Simplify the arguments

7d56e6a

Cover another edge case, another test, fixes

4e3f156

Fix a circular import issue

d018cbd

Add a performance test - might be slow?

29cdbfa

ernestum added 2 commits July 18, 2023 17:15

Notebook formatting fixes.

9c5b91c

Fix typing error in SQIL implementation.

f8584c3

ernestum force-pushed the redtachyon/740-sqil branch from 18a2f11 to f8584c3 Compare July 18, 2023 15:33

ernestum added 6 commits July 18, 2023 19:08

Fix isort issue.

02f3191

Clarify that our variant of the SQIL implementation is not really "so…

649de46

…ft".

Fix link in experts documentation.

c72b088

Remove support for transition mappings.

8277a5c

Remove data_loader from SQIL test cases.

a0af5c5

Bump number of demonstrations in SQIL performance test to reduce flak…

4ccea30

…iness.

jas-ho reviewed Aug 8, 2023

View reviewed changes

src/imitation/algorithms/sqil.py Outdated Show resolved Hide resolved

jas-ho added 2 commits August 8, 2023 15:32

Adapt hyperparameters in test_sqil_performance to reduce flakiness

68cbce8

Fix seeds for flaky test_sqil_performance

2bf467d

jas-ho mentioned this pull request Aug 8, 2023

Generalize SQIL to work with other off-policy algos #767

Closed

jas-ho reviewed Aug 8, 2023

View reviewed changes

Increase coverage in test_sqil.py

ccda686

AdamGleave reviewed Aug 9, 2023

View reviewed changes

tests/algorithms/test_sqil.py Outdated Show resolved Hide resolved

tests/algorithms/test_sqil.py Outdated Show resolved Hide resolved

jas-ho added 3 commits August 9, 2023 09:28

Pass kwargs to SQIL.train to DQN.learn

91b226a

- also set default tb_log_name to "SQIL"

Pass parameters as kwargs for multi-ary methods in sqil.py

5cbb6b2

Make test for exceptions raised by SQIL constructor more specific

d2124a2

- also: adjust imports to conform with style guide

jas-ho requested review from ernestum and AdamGleave August 9, 2023 07:45

AdamGleave approved these changes Aug 9, 2023

View reviewed changes

AdamGleave reviewed Aug 9, 2023

View reviewed changes

src/imitation/algorithms/sqil.py Show resolved Hide resolved

ernestum approved these changes Aug 10, 2023

View reviewed changes

ernestum merged commit fd4d8f0 into master Aug 10, 2023
15 checks passed

ernestum deleted the redtachyon/740-sqil branch August 10, 2023 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of the SQIL algorithm #744

Implementation of the SQIL algorithm #744

RedTachyon commented Jul 4, 2023 •

edited by AdamGleave

Loading

codecov bot commented Jul 4, 2023 •

edited

Loading

AdamGleave left a comment

AdamGleave left a comment

AdamGleave Jul 7, 2023

RedTachyon Jul 7, 2023

AdamGleave commented Aug 3, 2023

jas-ho Aug 8, 2023

jas-ho Aug 8, 2023

jas-ho Aug 8, 2023

jas-ho Aug 8, 2023

jas-ho Aug 8, 2023

AdamGleave Aug 9, 2023

jas-ho Aug 9, 2023

AdamGleave left a comment

jas-ho commented Aug 9, 2023

ernestum commented Aug 9, 2023

AdamGleave left a comment

AdamGleave left a comment

jas-ho commented Aug 10, 2023

ernestum left a comment

Implementation of the SQIL algorithm #744

Implementation of the SQIL algorithm #744

Conversation

RedTachyon commented Jul 4, 2023 • edited by AdamGleave Loading

Description

Testing

codecov bot commented Jul 4, 2023 • edited Loading

Codecov Report

AdamGleave left a comment

Choose a reason for hiding this comment

AdamGleave left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdamGleave commented Aug 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdamGleave left a comment

Choose a reason for hiding this comment

jas-ho commented Aug 9, 2023

ernestum commented Aug 9, 2023

AdamGleave left a comment

Choose a reason for hiding this comment

AdamGleave left a comment

Choose a reason for hiding this comment

jas-ho commented Aug 10, 2023

ernestum left a comment

Choose a reason for hiding this comment

RedTachyon commented Jul 4, 2023 •

edited by AdamGleave

Loading

codecov bot commented Jul 4, 2023 •

edited

Loading