Offline Policy Comparison with Confidence (opcc)

It's a benchmark comprising "policy comparison queries"(pcq) to evaluate uncertainty estimation in offline reinforcement learning.

Research Paper: https://arxiv.org/abs/2205.10739

Website/Docs: https://koulanurag.dev/opcc

Installation

1. Setup Mujoco

Download mujoco 210 and unzip in ~/.mujoco
Add following to .consolerc/.zshrc and source it.
```
export MUJOCO_PY_MUJOCO_PATH=$HOME/.mujoco/mujoco210/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.mujoco/mujoco210/bin
```
(You can also refer here for step-by-step instructions on mujoco installation)

2. Setup Python 3.7+ and optionally(recommended) create a virtualenv (refer here)

3. Python package and dependencies could be installed using:

pip3 install --upgrade 'pip<=23.0.1'
pip3 install --upgrade 'setuptools<=66'
pip3 install --upgrade 'wheel<=0.38.4'
pip3 install git+https://github.com/koulanurag/opcc@main#egg=opcc

4. Install Pytorch [>= 1.8.0]

Usage

Queries:

import opcc
import numpy as np
from sklearn import metrics

env_name = 'HalfCheetah-v2'
dataset_name = 'random'

# ########################################################
# Policy Comparison Queries (PCQ) (Section : 3.1 in paper)
# ########################################################
# Queries are dictionaries with policies as keys and
# corresponding queries as values.
queries = opcc.get_queries(env_name)


def random_predictor(obs_a, obs_b, action_a, action_b,
                     policy_a, policy_b, horizon):
    answer = np.random.randint(low=0, high=2, size=len(obs_a)).tolist()
    confidence = np.random.rand(len(obs_a)).tolist()
    return answer, confidence


targets = []
predictions = []
confidences = []
# Batch iteration through Queries :
for (policy_a_id, policy_b_id), query_batch in queries.items():
    # retrieve policies
    policy_a, _ = opcc.get_policy(*policy_a_id)
    policy_b, _ = opcc.get_policy(*policy_b_id)

    # query-a
    obs_a = query_batch['obs_a']
    action_a = query_batch['action_a']

    # query-b
    obs_b = query_batch['obs_b']
    action_b = query_batch['action_b']

    # horizon for policy evaluation
    horizon = query_batch['horizon']

    # ground truth binary vector:
    # (Q(obs_a, action_a, policy_a, horizon)
    # <  Q(obs_b, action_b, policy_b, horizon))
    target = query_batch['target'].tolist()
    targets += target

    # Let's make predictions for the given queries.
    # One can use any mechanism to predict the corresponding
    # answer to queries, and we simply use a random predictor
    # over here for demonstration purposes
    p, c = random_predictor(obs_a, obs_b, action_a, action_b,
                            policy_a, policy_b, horizon)
    predictions += p
    confidences += c

Evaluation Metrics:

# #########################################
# (Section 3.3 in paper)
# #########################################
loss = np.logical_xor(predictions, targets)  # we use 0-1 loss for demo

# List of tuples (coverage, selective_risks, tau)
coverage_sr_tau = []
tau_interval=0.01
for tau in np.arange(0, 1 + 2 * tau_interval, tau_interval):
  non_abstain_filter = confidences >= tau
  if any(non_abstain_filter):
    selective_risk = np.sum(loss[non_abstain_filter])
    selective_risk /= np.sum(non_abstain_filter)
    coverage = np.mean(non_abstain_filter)
    coverage_sr_tau.append((coverage, selective_risk, tau))
  else:
    # 0 risk for 0 coverage
    coverage_sr_tau.append((0, 0, tau))

coverages, selective_risks, taus = list(zip(*sorted(coverage_sr_tau)))
assert selective_risks[0] == 0 and coverages[0] == 0 , "no coverage not found"
assert coverages[-1] == 1, 'complete coverage not found'

# AURCC ( Area Under Risk-Coverage Curve): Ideally, we would like it to be 0
aurcc = metrics.auc(x=coverages,y=selective_risks)

# Reverse-pair-proportion
rpp = np.logical_and(np.expand_dims(loss, 1)
                     < np.expand_dims(loss, 1).transpose(),
                     np.expand_dims(confidences, 1)
                     < np.expand_dims(confidences, 1).transpose()).mean()

# Coverage Resolution (cr_k) : Ideally, we would like it to be 1
k = 10
bins = [_ for _ in np.arange(0, 1, 1 / k)]
cr_k = np.unique(np.digitize(coverages, bins)).size / len(bins)

print("aurcc: {}, rpp: {}, cr_{}:{}".format(aurcc, rpp, k, cr_k))

Dataset:

# ###########################################
# Datasets: (Section 4 in paper - step (1) )
# ###########################################

import opcc

env_name = 'HalfCheetah-v2'

# list all dataset names corresponding to an env
dataset_names = opcc.get_dataset_names(env_name)

dataset_name = 'random'
# This is a very-slim wrapper over D4RL datasets.
dataset = opcc.get_qlearning_dataset(env_name, dataset_name)

Policy Usage:

import opcc, gym, torch

env_name = "HalfCheetah-v2"
model, model_info = opcc.get_policy(env_name, pre_trained=1)

done = False
env = gym.make(env_name)

obs = env.reset()
while not done:
    action = model(torch.tensor(obs).unsqueeze(0))
    action = action.data.cpu().numpy()[0].astype('float32')
    obs, reward, done, step_info = env.step(action)
    env.render()

Benchmark Information

We borrow dataset's from D4RL
Queries can be visualized HERE
Baselines can be found here HERE

🔅 d4rl:maze2d

Datasets:

Environment Name	Datasets	Query-Count
`d4rl:maze2d-open-v0`	`1k, 10k, 100k, 1m`	`1500`
`d4rl:maze2d-medium-v1`	`1k, 10k, 100k, 1m`	`1500`
`d4rl:maze2d-umaze-v1`	`1k, 10k, 100k, 1m`	`1500`
`d4rl:maze2d-large-v1`	`1k, 10k, 100k, 1m`	`121`^*

Pre-trained policy performance:

Environment Name	`pre_trained=1` (best)	`pre_trained=2`	`pre_trained=3`	`pre_trained=4` (worst)
`d4rl:maze2d-open-v0`	122.2±10.61	104.9±22.19	18.05±14.85	4.85±8.62
`d4rl:maze2d-medium-v1`	245.55±272.75	203.75±252.61	256.65±260.16	258.55±262.81
`d4rl:maze2d-umaze-v1`	235.5±35.45	197.75±58.21	23.4±73.24	3.2±9.65
`d4rl:maze2d-large-v1`	231.35±268.37	160.8±201.97	50.65±76.94	9.95±9.95

🔅 mujoco(gym)

Datasets:

Environment Name	Datasets	Query-Count
`HalfCheetah-v2`	`random, expert, medium, medium-replay, medium-expert`	`1500`
`Hopper-v2`	`random, expert, medium, medium-replay, medium-expert`	`1500`
`Walker2d-v2`	`random, expert, medium, medium-replay, medium-expert`	`1500`

Pre-trained Policy performance:

Environment Name	`pre_trained=1` (best)	`pre_trained=2`	`pre_trained=3`	`pre_trained=4` (worst)
`HalfCheetah-v2`	1169.13±80.45	1044.39±112.61	785.88±303.59	94.79±40.88
`Hopper-v2`	1995.84±794.71	1466.71±497.1	1832.43±560.86	236.51±1.09
`Walker2d-v2`	2506.9±689.45	811.28±321.66	387.01±42.82	162.7±102.14

Testing Package:

Install : pip install -e ".[test]"
Run: pytest -v

Testing is computationally expensive as we validate ground truth value estimates and corresponding labels. These can be disabled by setting following flags:

export SKIP_QUERY_TARGET_TESTS=1 # disable target estimation and label validation  
export SKIP_Q_LEARNING_DATASET_TEST=1  # disable test for checking dataset existence
export SKIP_SEQUENCE_DATASET_TEST=1 # disables test for checking sequence dataset

Development:

Install : pip install -e ".[all]"

Generate-Queries:

% Mujoco (Gym) Environment
python scripts/generate_queries.py --env-name HalfCheetah-v2 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.1 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb
python scripts/generate_queries.py --env-name Hopper-v2 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.1 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb
python scripts/generate_queries.py --env-name Walker2d-v2 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.1 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb

% Maze Environment
python scripts/generate_queries.py --env-name d4rl:maze2d-large-v1 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.2 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb
python scripts/generate_queries.py --env-name d4rl:maze2d-umaze-v1 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.2 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb
python scripts/generate_queries.py --env-name d4rl:maze2d-medium-v1 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.2 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb
python scripts/generate_queries.py --env-name d4rl:maze2d-open-v0 --horizons 10 20 30 40 50 --policy-ids 1 2 3 4 --noise 0.5 --eval-runs 10 --ignore-delta-per-horizons 10 10 10 10 10 --max-trans-count 2000 --ignore-stuck-count 1000 --save-prob 0.6 --per-policy-comb-query 250 --use-wandb

Generate policy performance stats for readme:

python scripts/generate_policy_stats.py --all-envs

Contact

If you have any questions or suggestions , please open an issue on this GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 426 Commits
.github		.github
docs		docs
opcc		opcc
scripts		scripts
tests		tests
.amltignore		.amltignore
.coveragerc		.coveragerc
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Offline Policy Comparison with Confidence (opcc)

Installation

Usage

Queries:

Evaluation Metrics:

Dataset:

Policy Usage:

Benchmark Information

🔅 d4rl:maze2d

Datasets:

Pre-trained policy performance:

🔅 mujoco(gym)

Datasets:

Pre-trained Policy performance:

Testing Package:

Development:

Contact

About

Releases 1

Packages

Languages

License

koulanurag/opcc

Folders and files

Latest commit

History

Repository files navigation

Offline Policy Comparison with Confidence (opcc)

Installation

Usage

Queries:

Evaluation Metrics:

Dataset:

Policy Usage:

Benchmark Information

🔅 d4rl:maze2d

Datasets:

Pre-trained policy performance:

🔅 mujoco(gym)

Datasets:

Pre-trained Policy performance:

Testing Package:

Development:

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages