Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Wang, Yu-Xiang; Agarwal, Alekh; Dudik, Miroslav

Statistics > Machine Learning

arXiv:1612.01205 (stat)

[Submitted on 4 Dec 2016 (v1), last revised 11 Nov 2017 (this version, v2)]

Title:Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Authors:Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudik

View PDF

Abstract:We study the off-policy evaluation problem---estimating the value of a target policy using data collected by another policy---under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG)
Cite as:	arXiv:1612.01205 [stat.ML]
	(or arXiv:1612.01205v2 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.1612.01205
Journal reference:	International Conference on Machine Learning (pp. 3589-3597) (2017)

Submission history

From: Yu-Xiang Wang [view email]
[v1] Sun, 4 Dec 2016 23:24:17 UTC (171 KB)
[v2] Sat, 11 Nov 2017 05:57:11 UTC (362 KB)

Statistics > Machine Learning

Title:Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Submission history

Access Paper:

References & Citations

1 blog link

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Submission history

Access Paper:

References & Citations

1 blog link

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators