\newmdtheoremenv

theoTheorem \NewEnvironcommentblock \BODY \NewEnvirondraftblock \BODY

Learning in Markov Games with Adaptive Adversaries: Policy Regret, Fundamental Barriers, and Efficient Algorithms

Thanh Nguyen-Tang
Department of Computer Science
Johns Hopkins University
Baltimore, MD 21218
[email protected] Raman Arora
Department of Computer Science
Johns Hopkins University
Baltimore, MD 21218
[email protected]
Abstract

We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent that can adapt to the learner’s strategies. While most existing works in Markov games focus on external regret as the learning objective, external regret becomes inadequate when the adversaries are adaptive. In this work, we focus on policy regret – a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy, in hindsight. We show that if the opponent has unbounded memory or if it is non-stationary, then sample-efficient learning is not possible. For memory-bounded and stationary, we show that learning is still statistically hard if the set of feasible strategies for the learner is exponentially large. To guarantee learnability, we introduce a new notion of consistent adaptive adversaries, wherein, the adversary responds similarly to similar strategies of the learner. We provide algorithms that achieve T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG policy regret against memory-bounded, stationary, and consistent adversaries.

1 Introduction

Recent years have witnessed tremendous advances in reinforcement learning for various challenging domains in AI, from the game of Go (Silver et al., 2016, 2017, 2018), real-time strategy games such as StarCraft II (Vinyals et al., 2019) and Dota (Berner et al., 2019), autonomous driving (Shalev-Shwartz et al., 2016), to socially complex games such as hide-and-seek (Baker et al., 2019), capture-the-flag (Jaderberg et al., 2019), and highly tactical games such as poker game Texas hold’ em (Moravčík et al., 2017; Brown and Sandholm, 2018). Notably, most challenging RL applications can be systematically framed as multi-agent reinforcement learning (MARL) wherein multiple strategic agents learn to act in a shared environment (Yang and Wang, 2020; Zhang et al., 2021).

Despite the empirical successes, the theoretical foundations of MARL are underdeveloped, especially in settings where the learner faces adaptive opponents who can strategically adapt and react to the learner’s policies. Consider for example the optimal taxation problem in the AI economist (Zheng et al., 2020), a game that simulates dynamic economies that involve multiple actors (e.g., the government and its citizens) who strategically contribute to the game dynamics. The government agent learns to set a tax rate that optimizes for the economic equality and productivity of its citizens, whereas the citizens who perhaps have their own interests, respond adaptively to tax policies of the government agent (e.g., relocating to states that offer generous tax rates). Such adaptive behavior of participating agents is a crucial component in other applications as well, e.g., mechanism design (Conitzer and Sandholm, 2002; Balcan et al., 2005), optimal auctions (Cole and Roughgarden, 2014; Dütting et al., 2019).

The question of learning against adaptive opponents has been mostly studied under the framework of external regret, wherein the agent is required to compete with the best fixed policy in hindsight (Liu et al., 2022). However, external regret is not adequate to study adaptive opponents as it does not take into account the counterfactual response of the opponents. This motivates us to study MARL using the framework of policy regret (Arora et al., 2012), a counterfactual notion that aims to compete with the return that would have been attained if the agent had followed the best fixed sequence of policy in hindsight. Even though policy regret is now a standard notion to study adaptive adversaries and has been extensively studied in online (bandit) learning (Merhav et al., 2002; Arora et al., 2012; Malik et al., 2022) and repeated games (Arora et al., 2018), it has not received much attention in a multiagent reinforcement learning setting. In this paper, we aim to fill in this gap. We consider two-player Markov games (MGs) (Shapley, 1953; Littman, 1994) as a model for MARL, wherein one agent (the learner) learns to act against an adaptive opponent. We provide a series of negative and positive results for policy regret minimization in Markov games, highlighting the fundamental limits of learning and showcasing key principles underpinning the design of efficient learning algorithms against adaptive adversaries.

Fundamental barriers.

We first show that any learner must incur a linear policy regret against an adaptive opponent who can adapt and remember the learner’s past policies (Theorem 1). When the opponent has a bounded memory span, any learner must require an exponential number of samples Ω((SA)H/ϵ2)Ωsuperscript𝑆𝐴𝐻superscriptitalic-ϵ2\Omega((SA)^{H}/{\epsilon}^{2})roman_Ω ( ( italic_S italic_A ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) to obtain an ϵitalic-ϵ{\epsilon}italic_ϵ-suboptimal policy regret, even with the weakest form of memory wherein the opponent is oblivious (Theorem 2). When the memory-bounded opponent’s response is stationary, i.e., the response function does not vary with episodes, learning is still statistically hard when the learner’s policy set is exponentially large, as in this case the policy regret necessarily scales polynomially with the cardinality of the learner’s policy set (Theorem 3).

Efficient algorithms.

Motivated by these statistical hardness results, we consider a structural condition on the response of the opponents, which we refer to as consistent behavior, wherein the opponent responds similarly to similar sequences of policies (5). We propose two algorithms OPO-OMLE (Algorithm 1) and APE-OVE (Algorithm 3) that obtain T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG policy regret against m𝑚mitalic_m-memory bounded, stationary, and consistent adversaries, for m=1𝑚1m=1italic_m = 1 and m1𝑚1m\geq 1italic_m ≥ 1, respectively.

  • For memory length m=1𝑚1m=1italic_m = 1: We show that OPO-OMLE obtains a policy regret upper bound of 𝒪~(H3S2AB+H5SA2BT)~𝒪superscript𝐻3superscript𝑆2𝐴𝐵superscript𝐻5𝑆superscript𝐴2𝐵𝑇\tilde{{\mathcal{O}}}(H^{3}S^{2}AB+\sqrt{H^{5}SA^{2}BT})over~ start_ARG caligraphic_O end_ARG ( italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_B + square-root start_ARG italic_H start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_S italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B italic_T end_ARG ), when the learner’s policy set is the set of all deterministic Markov policies, where H𝐻Hitalic_H is the episode length, S𝑆Sitalic_S is the number of states, A𝐴Aitalic_A and B𝐵Bitalic_B are the numbers of actions for the learner and the opponent, respectively, and T𝑇Titalic_T is the number of episodes.

  • For general memory length m1𝑚1m\geq 1italic_m ≥ 1: We show that APE-OVE obtains a policy regret upper bound of 𝒪~((m1)H2SAB+H3SAB(SAB(H+S)+H2)Td)~𝒪𝑚1superscript𝐻2𝑆𝐴𝐵superscript𝐻3𝑆𝐴𝐵𝑆𝐴𝐵𝐻𝑆superscript𝐻2𝑇superscript𝑑\tilde{{\mathcal{O}}}\left((m-1)H^{2}SAB+\sqrt{H^{3}SAB}(SAB(H+\sqrt{S})+H^{2}% )\sqrt{\frac{T}{d^{*}}}\right)over~ start_ARG caligraphic_O end_ARG ( ( italic_m - 1 ) italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B + square-root start_ARG italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A italic_B end_ARG ( italic_S italic_A italic_B ( italic_H + square-root start_ARG italic_S end_ARG ) + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG divide start_ARG italic_T end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_ARG ), where dsuperscript𝑑d^{*}italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is an instance-dependent quantity that features the minimum positive visitation probability.

We provide a summary of our main results in Table 1.

Opponent’s Adaptive Behavior Policy Regret
Unbounded memory Ω(T)Ω𝑇\Omega(T)roman_Ω ( italic_T )
m𝑚mitalic_m-memory bounded (m0𝑚0m\geq 0italic_m ≥ 0) Ω(T(SA)H)Ω𝑇superscript𝑆𝐴𝐻\Omega(\sqrt{T(SA)^{H}})roman_Ω ( square-root start_ARG italic_T ( italic_S italic_A ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT end_ARG )
m𝑚mitalic_m-memory bounded + stationary (m1𝑚1m\geq 1italic_m ≥ 1) Ω(min{T,AHS})Ω𝑇superscript𝐴𝐻𝑆\Omega(\min\{T,A^{HS}\})roman_Ω ( roman_min { italic_T , italic_A start_POSTSUPERSCRIPT italic_H italic_S end_POSTSUPERSCRIPT } )
1111-memory bounded + stationary + consistent 𝒪~(H3S2AB+H5SA2BT)~𝒪superscript𝐻3superscript𝑆2𝐴𝐵superscript𝐻5𝑆superscript𝐴2𝐵𝑇\tilde{{\mathcal{O}}}(H^{3}S^{2}AB+\sqrt{H^{5}SA^{2}BT})over~ start_ARG caligraphic_O end_ARG ( italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_B + square-root start_ARG italic_H start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_S italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B italic_T end_ARG )
m𝑚mitalic_m-memory bounded + stationary + consistent 𝒪~((m1)H2SAB+H3SAB(SAB(H+S)+H2)Td)~𝒪𝑚1superscript𝐻2𝑆𝐴𝐵superscript𝐻3𝑆𝐴𝐵𝑆𝐴𝐵𝐻𝑆superscript𝐻2𝑇superscript𝑑\tilde{{\mathcal{O}}}\left((m-1)H^{2}SAB+\sqrt{H^{3}SAB}(SAB(H+\sqrt{S})+H^{2}% )\sqrt{\frac{T}{d^{*}}}\right)over~ start_ARG caligraphic_O end_ARG ( ( italic_m - 1 ) italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B + square-root start_ARG italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A italic_B end_ARG ( italic_S italic_A italic_B ( italic_H + square-root start_ARG italic_S end_ARG ) + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) square-root start_ARG divide start_ARG italic_T end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_ARG )
Table 1: Summary of main results for learning against adaptive adversaries. Learner’s policy set is all deterministic Markov policies. m=0𝑚0m=0italic_m = 0 + stationary corresponds to standard single-agent MDPs.

2 Related work

Learning in Markov games.

Learning problems in Markov games have been studied extensively in the MARL literature. Most existing works focus on learning Nash equilibria either with known dynamics or infinite data (Littman, 1994; Hu and Wellman, 2003; Hansen et al., 2013; Wei et al., 2020), or otherwise in a self-play setting wherein we control all the players (Wei et al., 2017; Bai et al., 2020; Bai and Jin, 2020; Xie et al., 2020; Liu et al., 2021), or in an online setting wherein we control one player to learn against other potentially adversarial players (Brafman and Tennenholtz, 2002; Wei et al., 2020; Tian et al., 2021; Jin et al., 2022). Other related work focuses on exploiting sub-optimal opponents via no-external regret learning (Liu et al., 2022) and studying Stackelberg equilibria in two-player general-sum turn-based MGs, wherein only one player is allowed to take actions in each state (Ramponi and Restelli, 2022).

Policy regret in online learning settings.

Policy regret minimization has been studied mostly in online (bandit) learning problems. It was first studied in a full information setting (Merhav et al., 2002) and extended to the bandit setting and more powerful competitor classes using swap regret and ΦΦ\Phiroman_Φ-regret (Arora et al., 2012). A lower bound of T2/3superscript𝑇23T^{2/3}italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT on policy regret in a bandit setting was provided by Dekel et al. (2014) and was later extended to action space with metric (Koren et al., 2017a, b). A long line of works studies (complete) policy regret in “tallying” bandits, wherein an action’s loss is a function of the number of the action’s pulls in the previous m𝑚mitalic_m rounds (Heidari et al., 2016; Levine et al., 2017; Seznec et al., 2019; Lindner et al., 2021; Awasthi et al., 2022; Malik et al., 2022, 2023).

Beyond online (bandit) learning, policy regret has been studied in several more challenging settings. In Arora et al. (2018) authors study the notion of policy equilibrium in repeated games (Markov games with H=S=1𝐻𝑆1H=S=1italic_H = italic_S = 1) when agents follow no-policy regret algorithms. A more complete characterization of the learnability in online learning with dynamics, where the loss function additionally depends on time-evolving states, was given in Bhatia and Sridharan (2020). Finally, in Dinh et al. (2023), authors study policy regret in online MDP, where an adversary who follows a no-external regret algorithm generates the loss functions, which effectively alleviates policy regret minimization to the standard external regret minimization in online MDPs.

3 Problem setup

Markov games.

In this paper, we use the framework of Markov Games to study an interactive multi-agent decision-making and learning environment (Shapley, 1953). Markov games extend Markov decision processes (MDPs) to multiplayer scenarios, where each agent’s action affects not only the environment but also the subsequent state of the game and the actions of other agents. Formally, a standard two-player Markov Game (MG) is specified by a tuple M=(𝒮,𝒜,,H,P,r)𝑀𝒮𝒜𝐻𝑃𝑟M=({\mathcal{S}},{\mathcal{A}},{\mathcal{B}},H,P,r)italic_M = ( caligraphic_S , caligraphic_A , caligraphic_B , italic_H , italic_P , italic_r ). Here, 𝒮𝒮{\mathcal{S}}caligraphic_S denotes the state space with cardinality |𝒮|=S𝒮𝑆|{\mathcal{S}}|=S| caligraphic_S | = italic_S, 𝒜𝒜{\mathcal{A}}caligraphic_A is the action space of the first player (called learner) with cardinality |𝒜|=A𝒜𝐴|{\mathcal{A}}|=A| caligraphic_A | = italic_A, {\mathcal{B}}caligraphic_B is the action space of the second player (referred to as an opponent or an adversary) with cardinality ||=B𝐵|{\mathcal{B}}|=B| caligraphic_B | = italic_B, H𝐻H\in{\mathbb{N}}italic_H ∈ blackboard_N is the time horizon for each game. P={P1,,PH}𝑃subscript𝑃1subscript𝑃𝐻P=\{P_{1},\ldots,P_{H}\}italic_P = { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT } are the transition kernels with each Ph:𝒮×𝒜×Δ(S):subscript𝑃𝒮𝒜Δ𝑆P_{h}:{\mathcal{S}}\times{\mathcal{A}}\times{\mathcal{B}}\rightarrow\Delta(S)italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A × caligraphic_B → roman_Δ ( italic_S ) specifying the probability of transitioning to the next state given the current state, learner’s action, and adversary’s action (Δ(𝒮)Δ𝒮\Delta({\mathcal{S}})roman_Δ ( caligraphic_S ) denotes the set of all probability distributions over 𝒮𝒮{\mathcal{S}}caligraphic_S). Finally, r={r1,,rH}𝑟subscript𝑟1subscript𝑟𝐻r=\{r_{1},\ldots,r_{H}\}italic_r = { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT } are the (expected) reward functions with each rh:𝒮×𝒜×[0,1]:subscript𝑟𝒮𝒜01r_{h}:{\mathcal{S}}\times{\mathcal{A}}\times{\mathcal{B}}\rightarrow[0,1]italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A × caligraphic_B → [ 0 , 1 ]. For simplicity, we assume the learner knows the reward function.111Our results immediately generalize to unknown reward functions, as learning the transitions is more difficult than learning the reward functions in tabular MGs.

Each episode begins in a fixed initial state s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. At step h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], the learner observes the state shsubscript𝑠s_{h}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and picks her action ah𝒜subscript𝑎𝒜a_{h}\in{\mathcal{A}}italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ caligraphic_A while the opponent/adversary picks an action bhsubscript𝑏b_{h}\in{\mathcal{B}}italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ caligraphic_B. As a result, the learner observes bhsubscript𝑏b_{h}italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, receives reward rh(sh,ah,bh)subscript𝑟subscript𝑠subscript𝑎subscript𝑏r_{h}(s_{h},a_{h},b_{h})italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) and the environment transitions to sh+1Ph(|sh,ah,bh)s_{h+1}\sim P_{h}(\cdot|s_{h},a_{h},b_{h})italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ). The episode terminates after H𝐻Hitalic_H steps.

Policies and value functions.

A learner’s policy (also referred to as strategy) is any tuple π={πh}h[H]𝜋subscriptsubscript𝜋delimited-[]𝐻\pi=\{\pi_{h}\}_{h\in[H]}italic_π = { italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT where πh:(𝒮×𝒜)h1×𝒮Δ(𝒜):subscript𝜋superscript𝒮𝒜1𝒮Δ𝒜\pi_{h}:({\mathcal{S}}\times{\mathcal{A}})^{h-1}\times{\mathcal{S}}\rightarrow% \Delta({\mathcal{A}})italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : ( caligraphic_S × caligraphic_A ) start_POSTSUPERSCRIPT italic_h - 1 end_POSTSUPERSCRIPT × caligraphic_S → roman_Δ ( caligraphic_A ). A policy π={πh}h[H]𝜋subscriptsubscript𝜋delimited-[]𝐻\pi=\{\pi_{h}\}_{h\in[H]}italic_π = { italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT is said be Markovian if for every h[H],πh:𝒮Δ(𝒜):delimited-[]𝐻subscript𝜋𝒮Δ𝒜h\in[H],\pi_{h}:{\mathcal{S}}\rightarrow\Delta({\mathcal{A}})italic_h ∈ [ italic_H ] , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S → roman_Δ ( caligraphic_A ). Similarly, an adversary’s policy is any tuple μ={μh}h[H]𝜇subscriptsubscript𝜇delimited-[]𝐻\mu=\{\mu_{h}\}_{h\in[H]}italic_μ = { italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT where μh:(𝒮×)h1×𝒮Δ():subscript𝜇superscript𝒮1𝒮Δ\mu_{h}:({\mathcal{S}}\times{\mathcal{B}})^{h-1}\times{\mathcal{S}}\rightarrow% \Delta({\mathcal{B}})italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : ( caligraphic_S × caligraphic_B ) start_POSTSUPERSCRIPT italic_h - 1 end_POSTSUPERSCRIPT × caligraphic_S → roman_Δ ( caligraphic_B ). μ𝜇\muitalic_μ is said to be Markovian if for every hhitalic_h, μh:𝒮Δ():subscript𝜇𝒮Δ\mu_{h}:{\mathcal{S}}\rightarrow\Delta({\mathcal{B}})italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S → roman_Δ ( caligraphic_B ). For simplicity, we will focus only on Markov policies for both the learner and the adversary in this paper. Let ΠΠ\Piroman_Π (respectively, ΨΨ\Psiroman_Ψ) be the set of all feasible policies of the learner (respectively, the adversary). The value of a policy tuple (π,μ)Π×Ψ𝜋𝜇ΠΨ(\pi,\mu)\in\Pi\times\Psi( italic_π , italic_μ ) ∈ roman_Π × roman_Ψ at step hhitalic_h in state s𝑠sitalic_s, denoted by Vhπ,μ(s)superscriptsubscript𝑉𝜋𝜇𝑠V_{h}^{\pi,\mu}(s)italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_μ end_POSTSUPERSCRIPT ( italic_s ) is the expected accumulated reward starting in state s𝑠sitalic_s from step hhitalic_h, if the learner and the adversary follow π𝜋\piitalic_π and μ𝜇\muitalic_μ respectively, i.e., Vhπ,μ(s):=𝔼π,μ[l=hHrl(sl,al,bl)|sh=s]assignsuperscriptsubscript𝑉𝜋𝜇𝑠subscript𝔼𝜋𝜇delimited-[]conditionalsuperscriptsubscript𝑙𝐻subscript𝑟𝑙subscript𝑠𝑙subscript𝑎𝑙subscript𝑏𝑙subscript𝑠𝑠V_{h}^{\pi,\mu}(s):={\mathbb{E}}_{\pi,\mu}[\sum_{l=h}^{H}r_{l}(s_{l},a_{l},b_{% l})|s_{h}=s]italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_μ end_POSTSUPERSCRIPT ( italic_s ) := blackboard_E start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_l = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s ], where the expectation is with respect to the trajectory (s1,a1,b1,r1,,sH,aH,bH,rH)subscript𝑠1subscript𝑎1subscript𝑏1subscript𝑟1subscript𝑠𝐻subscript𝑎𝐻subscript𝑏𝐻subscript𝑟𝐻(s_{1},a_{1},b_{1},r_{1},\ldots,s_{H},a_{H},b_{H},r_{H})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) distributed according to P𝑃Pitalic_P, π𝜋\piitalic_π, and μ𝜇\muitalic_μ. We also denote the action-value function Qhπ,μ(s,a,b):=𝔼π,μ[l=hHrl(sl,al,bl)|(sh,ah,bh)=(s,a,b)]assignsubscriptsuperscript𝑄𝜋𝜇𝑠𝑎𝑏subscript𝔼𝜋𝜇delimited-[]conditionalsuperscriptsubscript𝑙𝐻subscript𝑟𝑙subscript𝑠𝑙subscript𝑎𝑙subscript𝑏𝑙subscript𝑠subscript𝑎subscript𝑏𝑠𝑎𝑏Q^{\pi,\mu}_{h}(s,a,b):={\mathbb{E}}_{\pi,\mu}[\sum_{l=h}^{H}r_{l}(s_{l},a_{l}% ,b_{l})|(s_{h},a_{h},b_{h})=(s,a,b)]italic_Q start_POSTSUPERSCRIPT italic_π , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) := blackboard_E start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_l = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) = ( italic_s , italic_a , italic_b ) ]. Given a V:𝒮:𝑉𝒮V:{\mathcal{S}}\rightarrow{\mathbb{R}}italic_V : caligraphic_S → blackboard_R, we write PhV(s,a,b):=𝔼sPh(|s,a,b)[V(s)]P_{h}V(s,a,b):={\mathbb{E}}_{s^{\prime}\sim P_{h}(\cdot|s,a,b)}[V(s^{\prime})]italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_V ( italic_s , italic_a , italic_b ) := blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s , italic_a , italic_b ) end_POSTSUBSCRIPT [ italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]. For any u:𝒮Δ(𝒜):𝑢𝒮Δ𝒜u:{\mathcal{S}}\rightarrow\Delta({\mathcal{A}})italic_u : caligraphic_S → roman_Δ ( caligraphic_A ), v:𝒮Δ():𝑣𝒮Δv\!:\!{\mathcal{S}}\rightarrow\Delta({\mathcal{B}})italic_v : caligraphic_S → roman_Δ ( caligraphic_B ), Q:𝒮×𝒜×:𝑄𝒮𝒜Q\!:\!{\mathcal{S}}\!\times\!{\mathcal{A}}\!\times\!{\mathcal{B}}\rightarrow{% \mathbb{R}}italic_Q : caligraphic_S × caligraphic_A × caligraphic_B → blackboard_R, denote Q(s,u,v):=𝔼au(|s),bv(|s)[Q(s,a,b)]Q(s,u,v):={\mathbb{E}}_{a\sim u(\cdot|s),b\sim v(\cdot|s)}[Q(s,a,b)]italic_Q ( italic_s , italic_u , italic_v ) := blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_u ( ⋅ | italic_s ) , italic_b ∼ italic_v ( ⋅ | italic_s ) end_POSTSUBSCRIPT [ italic_Q ( italic_s , italic_a , italic_b ) ] for any s𝒮𝑠𝒮s\in{\mathcal{S}}italic_s ∈ caligraphic_S.

Adaptive adversaries.

We allow the adversary to be adaptive, i.e., the adversary can choose their policy in episode t𝑡titalic_t based on the learner’s policies on episodes 1,,t1𝑡1,\ldots,t1 , … , italic_t. We assume that the adversary is deterministic and has unlimited computational power, i.e., the adversary can plan, in advance, using as much computation as needed, as to how they would react in each episode to any sequence of policies. Formally, the adversary defines in advance a sequence of deterministic functions {ft}tsubscriptsubscript𝑓𝑡𝑡superscript\{f_{t}\}_{t\in{\mathbb{N}}^{*}}{ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_N start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, where ft:ΠtΨ:subscript𝑓𝑡superscriptΠ𝑡Ψf_{t}:\Pi^{t}\rightarrow\Psiitalic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : roman_Π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT → roman_Ψ. The input to each response function ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is an entire history of the learner’s policies, including her policy in episode t𝑡titalic_t. Therefore, if the learner follows policies π1,,πtsuperscript𝜋1superscript𝜋𝑡\pi^{1},\ldots,\pi^{t}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, the adversary responds with policy ft(π1,,πt)Ψsubscript𝑓𝑡superscript𝜋1superscript𝜋𝑡Ψf_{t}(\pi^{1},\ldots,\pi^{t})\in\Psiitalic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∈ roman_Ψ in episode t𝑡titalic_t. Since the response function ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT depends on the learner’s policy at round t𝑡titalic_t, our setup is essentially a principal-follower model, akin to Stackelberg games (Letchford et al., 2009; Blum et al., 2014) and mechanism design for learning agents (Braverman et al., 2019). In this context, the principal agent (mechanism designer or learner) publicly declares a strategy before committing to it, allowing the followers to subsequently choose their strategies based on their understanding of the principal’s decisions.

We evaluate the learner’s performance using the notion of policy regret (Merhav et al., 2002; Arora et al., 2012), which compares the return on the first T𝑇Titalic_T episodes to the return of the best fixed sequence of policy in hindsight. Formally, the learner’s policy regret after T𝑇Titalic_T episodes is defined as

PR(T)=supπΠt=1TV1π,ft([π]t)(s1)V1πt,ft(π1,,πt)(s1), where ft([π]t):=ft(π,,πt times).formulae-sequencePR𝑇subscriptsupremum𝜋Πsuperscriptsubscript𝑡1𝑇superscriptsubscript𝑉1𝜋subscript𝑓𝑡superscriptdelimited-[]𝜋𝑡subscript𝑠1superscriptsubscript𝑉1superscript𝜋𝑡subscript𝑓𝑡superscript𝜋1superscript𝜋𝑡subscript𝑠1assign where subscript𝑓𝑡superscriptdelimited-[]𝜋𝑡subscript𝑓𝑡subscript𝜋𝜋𝑡 times\displaystyle{\textrm{PR}}(T)=\sup_{\pi\in\Pi}\sum_{t=1}^{T}V_{1}^{\pi,f_{t}([% \pi]^{t})}(s_{1})-V_{1}^{\pi^{t},f_{t}(\pi^{1},\ldots,\pi^{t})}(s_{1}),\text{ % where }f_{t}([\pi]^{t}):=f_{t}(\underbrace{\pi,\ldots,\pi}_{t\text{ times}}).PR ( italic_T ) = roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( [ italic_π ] start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , where italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( [ italic_π ] start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) := italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( under⏟ start_ARG italic_π , … , italic_π end_ARG start_POSTSUBSCRIPT italic_t times end_POSTSUBSCRIPT ) . (1)

Policy regret has been studied in online (bandit) learning (Merhav et al., 2002; Arora et al., 2012) and repeated games (Arora et al., 2018), yet, to the best of our knowledge, it has never been studied in Markov games. Policy regret differs from the more common definition of external regret defined as R(T)=supπΠt=1TV1π,ft(π1,,πt)(s1)V1πt,ft(π1,,πt)(s1)𝑅𝑇subscriptsupremum𝜋Πsuperscriptsubscript𝑡1𝑇superscriptsubscript𝑉1𝜋subscript𝑓𝑡superscript𝜋1superscript𝜋𝑡subscript𝑠1superscriptsubscript𝑉1superscript𝜋𝑡subscript𝑓𝑡superscript𝜋1superscript𝜋𝑡subscript𝑠1R(T)=\sup_{\pi\in\Pi}\sum_{t=1}^{T}V_{1}^{\pi,f_{t}(\pi^{1},\ldots,\pi^{t})}(s% _{1})-V_{1}^{\pi^{t},f_{t}(\pi^{1},\ldots,\pi^{t})}(s_{1})italic_R ( italic_T ) = roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), which is used in (Liu et al., 2022). However, external regret is inadequate for measuring the learner’s performance against an adaptive adversary. Indeed, when the adversary is adaptive, the quantity V1π,ft(π1,,πt)superscriptsubscript𝑉1𝜋subscript𝑓𝑡superscript𝜋1superscript𝜋𝑡V_{1}^{\pi,f_{t}(\pi^{1},\ldots,\pi^{t})}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT is hardly interpretable anymore – see (Arora et al., 2012) for a more detailed discussion.

As a warm-up, we show in the following example that, policy regret minimization generalizes the standard Nash equilibrium learning problem in zero-sum two-player Markov games.

Example 3.1 (Nash equilibrium).

Consider the adversary with the following behavior: for any Markov policy π𝜋\piitalic_π of the learner, the adversary ignores all the learner’s past policies and respond only to the current policy π𝜋\piitalic_π with a Markov policy f(π)𝑓𝜋f(\pi)italic_f ( italic_π ) such that for all (s,h)𝑠(s,h)( italic_s , italic_h ), Vhπ,f(π)(s)=minμVhπ,μ(s)superscriptsubscript𝑉𝜋𝑓𝜋𝑠subscript𝜇superscriptsubscript𝑉𝜋𝜇𝑠V_{h}^{\pi,f(\pi)}(s)=\min_{\mu}V_{h}^{\pi,\mu}(s)italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ( italic_s ) = roman_min start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_μ end_POSTSUPERSCRIPT ( italic_s ), where the minimum is taken over all the possible Markov policies for the adversary. By Filar and Vrieze (2012), such an f(π)𝑓𝜋f(\pi)italic_f ( italic_π ) exists. In addtion, there also exists a Markov policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that for all (s,h)𝑠(s,h)( italic_s , italic_h ), Vhπ,f(π)(s)=supπVhπ,f(π)(s)=infμsupπVhπ,μ(s)superscriptsubscript𝑉superscript𝜋𝑓superscript𝜋𝑠subscriptsupremum𝜋superscriptsubscript𝑉𝜋𝑓𝜋𝑠subscriptinfimum𝜇subscriptsupremum𝜋superscriptsubscript𝑉𝜋𝜇𝑠V_{h}^{\pi^{*},f(\pi^{*})}(s)=\sup_{\pi}V_{h}^{\pi,f(\pi)}(s)=\inf_{\mu}\sup_{% \pi}V_{h}^{\pi,\mu}(s)italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s ) = roman_sup start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ( italic_s ) = roman_inf start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_μ end_POSTSUPERSCRIPT ( italic_s ). The policies (π,f(π))superscript𝜋𝑓superscript𝜋(\pi^{*},f(\pi^{*}))( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) is a Nash equilibrium (Nash, 1950) of the Markov game. For such an adversary, the policy regret becomes PR(T)=t=1TV1π,f(π)(s1)t=1TV1πt,f(πt)(s1)PR𝑇superscriptsubscript𝑡1𝑇superscriptsubscript𝑉1superscript𝜋𝑓superscript𝜋subscript𝑠1superscriptsubscript𝑡1𝑇superscriptsubscript𝑉1superscript𝜋𝑡𝑓superscript𝜋𝑡subscript𝑠1{\textrm{PR}}(T)=\sum_{t=1}^{T}V_{1}^{\pi^{*},f(\pi^{*})}(s_{1})-\sum_{t=1}^{T% }V_{1}^{\pi^{t},f(\pi^{t})}(s_{1})PR ( italic_T ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). This Nash equilibrium can be computed using, e.g., the Q-ol algorithm of (Tian et al., 2021) with T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG (policy) regret.222Q-ol algorithm solves a problem that is a bit more general than the policy regret minimization in Footnote 2 in that as long as the benchmark is the Nash value V1π,f(π)superscriptsubscript𝑉1superscript𝜋𝑓superscript𝜋V_{1}^{\pi^{*},f(\pi^{*})}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT, regardless of the behavior of the adversary, the said rate for the policy regret is guaranteed. V-learning algorithm of Jin et al. (2021) solves a similar problem but in a self-play setting; it is not immediately clear if their rate remains in the online setting.

Additional notation.

We write fgless-than-or-similar-to𝑓𝑔f\lesssim gitalic_f ≲ italic_g to mean f=𝒪(g)𝑓𝒪𝑔f={\mathcal{O}}(g)italic_f = caligraphic_O ( italic_g ). We use c𝑐citalic_c to represent an absolute constant that can have different values in different appearances.

4 Fundamental barriers for learning against adaptive adversaries

In this section, we show that achieving low policy regret in Markov games against an adaptive adversary is statistically hard when (i) the adversary has an unbounded memory (see Definition 1), or (ii) the adversary is non-stationary, or (iii) the learner’s policy set is exponentially large (even if the adversary is memory-bounded and stationary).

To begin with, we show that any learner must incur a linear policy regret in the general setting.

Theorem 1.

For any learner, there exists an adaptive adversary and a Markov game instance such that PR(T)=Ω(T)PR𝑇Ω𝑇{\textrm{PR}}(T)=\Omega(T)PR ( italic_T ) = roman_Ω ( italic_T ).

The construction in the proof of Theorem 1, shown in Section A.1, takes advantage of the unbounded memory of the adversary, that can remember the policy the learner takes in the first episode. This motivates us to consider memory-bounded adversaries, a situation that is quite similar to the online bandit learning setting of Arora et al. (2012).

Definition 1 (m𝑚mitalic_m-memory bounded adversaries).

An adversary {ft}tsubscriptsubscript𝑓𝑡𝑡superscript\{f_{t}\}_{t\in{\mathbb{N}}^{*}}{ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_N start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is said to be m𝑚mitalic_m-memory bounded for some m0𝑚0m\geq 0italic_m ≥ 0 if for every t𝑡titalic_t and policy sequence π1,,πtsuperscript𝜋1superscript𝜋𝑡\pi^{1},\ldots,\pi^{t}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, we have ft(π1,,πt)=ft(πmin{1,tm+1},,πt)subscript𝑓𝑡superscript𝜋1superscript𝜋𝑡subscript𝑓𝑡superscript𝜋1𝑡𝑚1superscript𝜋𝑡f_{t}(\pi^{1},\ldots,\pi^{t})=f_{t}(\pi^{\min\{1,t-m+1\}},\ldots,\pi^{t})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT roman_min { 1 , italic_t - italic_m + 1 } end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ).

Is it possible to efficiently learn against memory-bounded adversaries? Unlike online bandit learning, we show that learning in Markov games is statistically hard even when the adversary is memory-bounded, even for the weakest case of memory m=0𝑚0m=0italic_m = 0 and the adversary’s policy set ΨΨ\Psiroman_Ψ is small.

Theorem 2.

For any learner and any L𝐿L\in{\mathbb{N}}italic_L ∈ blackboard_N and S,A𝑆𝐴S,Aitalic_S , italic_A, H𝐻Hitalic_H, there exists an oblivious adversary (i.e., m=0𝑚0m=0italic_m = 0) with the policy space ΨΨ\Psiroman_Ψ of cardinality at least L𝐿Litalic_L, a Markov game (with SA+S𝑆𝐴𝑆SA+Sitalic_S italic_A + italic_S states, A𝐴Aitalic_A actions for the learner, B=2S𝐵2𝑆B=2Sitalic_B = 2 italic_S actions for the adversary) such that PR(T)=Ω(T(SA/L)L)PR𝑇Ω𝑇superscript𝑆𝐴𝐿𝐿{\textrm{PR}}(T)=\Omega\left(\sqrt{T(SA/L)^{L}}\right)PR ( italic_T ) = roman_Ω ( square-root start_ARG italic_T ( italic_S italic_A / italic_L ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_ARG ).

Theorem 2 claims that competing even with an oblivious adversary that employs a small set of policies takes an exponential number of samples (e.g., set S=L=H𝑆𝐿𝐻S=L=Hitalic_S = italic_L = italic_H). The construction of the lower bound follows the construction used to prove a lower bound for learning latent MDPs (Kwon et al., 2021) and a reduction of a given latent MDP into a Markov game (Liu et al., 2022); we give complete details in Section A.2. The proof of Theorem 2 utilizes the fact that the sequence of response function an adversary utilizes can be completely arbitrary. It implies that we need to constrain the adversary further beyond being memory-bounded. A natural restriction we consider given the construction is to assume stationarity, i.e. consider adversaries whose response functions do not change over time.

Definition 2 (Stationary adversaries).

An m𝑚mitalic_m-memory bounded adversary is said to be stationary if there exists an f:ΠmΨ:𝑓superscriptΠ𝑚Ψf:\Pi^{m}\rightarrow\Psiitalic_f : roman_Π start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → roman_Ψ such that for all t𝑡titalic_t and π1,,πtsuperscript𝜋1superscript𝜋𝑡\pi^{1},\ldots,\pi^{t}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, we have ft(π1,,πt)=f(πmin{1,tm+1},,πt)subscript𝑓𝑡superscript𝜋1superscript𝜋𝑡𝑓superscript𝜋1𝑡𝑚1superscript𝜋𝑡f_{t}(\pi^{1},\ldots,\pi^{t})=f(\pi^{\min\{1,t-m+1\}},\ldots,\pi^{t})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_f ( italic_π start_POSTSUPERSCRIPT roman_min { 1 , italic_t - italic_m + 1 } end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ).

The stationary behavior is sometimes also referred to as “g-restricted” in the online learning literature– see the related discussion of Malik et al. (2022). In the special case wherein the adversary is both stationary and oblivious (i.e., m=0𝑚0m=0italic_m = 0), the Markov game reduces to the standard single-agent MDP (and the policy regret reduces to standard regret of the MDP) – this setting has been studied in (Zhang et al., 2023). We, therefore, only need to consider m1𝑚1m\geq 1italic_m ≥ 1.

Connections to Stackelberg equilibrium in general-sum Markov games.

While seemingly restrictive, policy regret minimization with m𝑚mitalic_m-memory bounded and stationary adversaries already subsumes the problem of learning Stackelberg equilibrium (Von Stackelberg, 2010) in general-sum Markov games (Ramponi and Restelli, 2022).333Ramponi and Restelli (2022) consider a more restrictive setting of turn-based Markov games, wherein at each state only one player is allowed to take actions. In addition, they require the opponents to respond with only deterministic policies. In general-sum Markov games, the adversary (“follower”) aims at maximizing his own reward function given any policy of the learner (“leader”). That is, the adversary is 1111-memory bounded, and the response function f:ΠΨ:𝑓ΠΨf:\Pi\rightarrow\Psiitalic_f : roman_Π → roman_Ψ corresponds to a function that selects the best response policy to any given policy of the learner. The benchmark maxπΠV1π,f(π)subscript𝜋Πsuperscriptsubscript𝑉1𝜋𝑓𝜋\max_{\pi\in\Pi}V_{1}^{\pi,f(\pi)}roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT in policy regret then becomes the Stackelberg equilibrium.

Is sample-efficient learning possible against m𝑚mitalic_m-memory bounded and stationary adversaries? One can notice an immediate approach to learning against a 1111-memory bounded and stationary adversaries is to simply view the problem as a |Π|Π|\Pi|| roman_Π |-armed bandit problem and apply any state-of-the-art bandit algorithm (Audibert and Bubeck, 2009) to obtain PR(T)=𝒪(HT|Π|)PR𝑇𝒪𝐻𝑇Π{\textrm{PR}}(T)={\mathcal{O}}(H\sqrt{T|\Pi|})PR ( italic_T ) = caligraphic_O ( italic_H square-root start_ARG italic_T | roman_Π | end_ARG ). However, scaling polynomially with the learner’s policy class is not desirable when the class is exponentially large (e.g., when the learner’s policy class is the set of all deterministic policies, then |Π|=Θ(AHS)ΠΘsuperscript𝐴𝐻𝑆|\Pi|=\Theta(A^{HS})| roman_Π | = roman_Θ ( italic_A start_POSTSUPERSCRIPT italic_H italic_S end_POSTSUPERSCRIPT )). And in fact, we cannot avoid polynomial scaling with the cardinality of the learner’s policy class in general.

Theorem 3.

For any learner with policy class ΠΠ\Piroman_Π, there exists a 1111-memory bounded and stationary adversary and a Markov game with B=𝒪(1)𝐵𝒪1B={\mathcal{O}}(1)italic_B = caligraphic_O ( 1 ) such that PR(T)=Ω(min{T,|Π|})PR𝑇Ω𝑇Π{\textrm{PR}}(T)=\Omega\left(\min\{T,|\Pi|\}\right)PR ( italic_T ) = roman_Ω ( roman_min { italic_T , | roman_Π | } ).

Note that the lower bound applies to m=1𝑚1m=1italic_m = 1, and, therefore, to any m1𝑚1m\geq 1italic_m ≥ 1. Proof in Section A.3.

5 Efficient algorithms for learning against adaptive adversaries

Thus far, we have shown that learning against an adaptive adversary in Markov games is statistically hard, even when the adversary is m𝑚mitalic_m-memory bounded and stationary. The reason that stationarity is not sufficient for efficient learning (which the lower bound in Theorem 3 exploits for the construction of a hard instance) comes from the unstructured response of the adversary in the worst case. Even if the learner plays nearly identical sequence of policies differing only on a small number of states and steps, the adversary can essentially respond completely arbitrarily. In other words, knowing the policies that the adversary plays in response to the policies of the learner (i.e., observing the values of the response function f𝑓fitalic_f at specific inputs) reveals zero information about the function f𝑓fitalic_f on previously seen inputs. Thus, the learner is required to explore all the policies in ΠΠ\Piroman_Π to be able to identify an optimal policy. This motivates us to consider an additional structural assumption on how the adversary responds to the learner’s policies. We assume that the adversary is consistent in response to two similar sequences of policies of the learner. In essence, given that the learner plays two sequences of policies that agree on certain states (s𝑠sitalic_s) and steps (hhitalic_h) – then, we assume that the opponent also responds with two sequences of policies that agree on the same states and steps. We refer to this behavior as consistent; a formal definition follows.

Definition 3 (Consistent adversaries).

An m𝑚mitalic_m-memory bounded and stationary adversary f𝑓fitalic_f is said to be consistent if, for any two sequences of learner’s policies π1,,πmsuperscript𝜋1superscript𝜋𝑚\pi^{1},\ldots,\pi^{m}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and ν1,,νmsuperscript𝜈1superscript𝜈𝑚\nu^{1},\ldots,\nu^{m}italic_ν start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_ν start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and any (s,h)𝒮×[H]𝑠𝒮delimited-[]𝐻(s,h)\in{\mathcal{S}}\times[H]( italic_s , italic_h ) ∈ caligraphic_S × [ italic_H ], if πhi(|s)=νhi(|s),i[m]\pi^{i}_{h}(\cdot|s)=\nu^{i}_{h}(\cdot|s),\forall i\in[m]italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) = italic_ν start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) , ∀ italic_i ∈ [ italic_m ], then f(π1,,πm)h(|s)=f(ν1,,νm)h(|s)f(\pi^{1},\ldots,\pi^{m})_{h}(\cdot|s)=f(\nu^{1},\ldots,\nu^{m})_{h}(\cdot|s)italic_f ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) = italic_f ( italic_ν start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_ν start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ). Otherwise, we say that the opponent’s response f𝑓fitalic_f is arbitrary.

We argue that the definition above is natural if we are to consider opponents that are self-interested strategic agents, and not simply a malicious adversary. So, it would be in an opponent’s interest to play in a somewhat consistent manner. Playing optimally after figuring out the learner’s strategy would indeed require playing consistently. An opponent that plays completely arbitrary, while challenging to learn anything from, also does not improve their value function. Some remarks are in order.

Remark 1 (ζ𝜁\zetaitalic_ζ-approximately consistent adversaries).

Our algorithms and results for consistent adversaries easily extend to ζ𝜁\zetaitalic_ζ-approximately consistent adversaries for any fixed constant ζ0𝜁0\zeta\geq 0italic_ζ ≥ 0. An adversary f𝑓fitalic_f is said to be ζ𝜁\zetaitalic_ζ-approximately consistent if, for any π1,,πmsuperscript𝜋1superscript𝜋𝑚\pi^{1},\ldots,\pi^{m}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and ν1,,νmsuperscript𝜈1superscript𝜈𝑚\nu^{1},\ldots,\nu^{m}italic_ν start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_ν start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and any (s,h)𝒮×[H]𝑠𝒮delimited-[]𝐻(s,h)\in{\mathcal{S}}\times[H]( italic_s , italic_h ) ∈ caligraphic_S × [ italic_H ], if πhi(|s)=νhi(|s),i[m]\pi^{i}_{h}(\cdot|s)=\nu^{i}_{h}(\cdot|s),\forall i\in[m]italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) = italic_ν start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) , ∀ italic_i ∈ [ italic_m ], then maxa𝒜|logf(π1,,πm)h(a|s)f(ν1,,νm)h(a|s)|ζsubscript𝑎𝒜𝑓subscriptsuperscript𝜋1superscript𝜋𝑚conditional𝑎𝑠𝑓subscriptsuperscript𝜈1superscript𝜈𝑚conditional𝑎𝑠𝜁\max_{a\in{\mathcal{A}}}\bigg{|}\log\frac{f(\pi^{1},\ldots,\pi^{m})_{h}(a|s)}{% f(\nu^{1},\ldots,\nu^{m})_{h}(a|s)}\bigg{|}\leq\zetaroman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT | roman_log divide start_ARG italic_f ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) end_ARG start_ARG italic_f ( italic_ν start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_ν start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) end_ARG | ≤ italic_ζ. For simplicity, we stick with 3 (i.e., ζ=0𝜁0\zeta=0italic_ζ = 0) to best convey our algorithmic and theoretical ideas.

Remark 2.

While our notion of consistent behaviors is quite natural, it might as well be that there is a more general notion of complexity for the opponent’s response function classes that fully characterizes learnability in this setting. This likely requires the definition of appropriate norms in the input policy space ΠmsuperscriptΠ𝑚\Pi^{m}roman_Π start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and the output policy space ΨΨ\Psiroman_Ψ, and a certain notion of predictability for the opponent’s response function classes (e.g., in the spirit of Eluder dimension (Russo and Van Roy, 2013)), so that the learner can accurately estimate the opponent’s response function, without trying out all possible policies. This question goes beyond the scope of our current work and is left to a future investigation.

Remark 3.

To permit learnability in terms of external regret in Markov games, Liu et al. (2022) consider a policy-revealed setting, wherein the opponent reveals his current strategy to the learner at the end of each episode. No external regret is possible because the benchmark in external regret evaluates the learner’s comparator policy against the same policy that the opponent reveals. For policy regret, however, knowing the opponent’s strategy at the end of the episode gives the learner no advantage in general, as the counterfactual benchmark requires evaluating the learner’s policies against the policy sequence that the opponent would have reacted with. Indeed, our lower bound in Theorem 3 still applies to the policy-revealed setting.

For m𝑚mitalic_m-memory bounded, stationary and consistent adversaries, we present two algorithms, one for m=1𝑚1m=1italic_m = 1 and the other for general m1𝑚1m\geq 1italic_m ≥ 1, with sublinear policy regret. We give special consideration to the case with m=1𝑚1m=1italic_m = 1 as it helps with the exposition of key algorithmic design principles rather simply. For simplicity, we focus on ΠΠ\Piroman_Π being the set of all deterministic policies (i.e., |Π|=Θ(AHS)ΠΘsuperscript𝐴𝐻𝑆|\Pi|=\Theta(A^{HS})| roman_Π | = roman_Θ ( italic_A start_POSTSUPERSCRIPT italic_H italic_S end_POSTSUPERSCRIPT )). Our algorithms and upper bounds easily extend to any general ΠΠ\Piroman_Π with polynomial log-cardinality.

Assumption 5.1.

The learner’s policy class ΠΠ\Piroman_Π is the set of all deterministic policies.

A key component of our algorithms is using maximum likelihood estimation (MLE) (Geer, 2000) to estimate action distributions with which the opponent can respond. As is the convention in MLE analysis, we make a realizability assumption and use bracketing numbers to control the model class.

Assumption 5.2.

For any policy μΨ𝜇Ψ\mu\in\Psiitalic_μ ∈ roman_Ψ that the adversary employs and for all (h,s)[H]×𝒮𝑠delimited-[]𝐻𝒮(h,s)\in[H]\times{\mathcal{S}}( italic_h , italic_s ) ∈ [ italic_H ] × caligraphic_S, assume μh(|s)PΘ:={PθΔ():θΘ}\mu_{h}(\cdot|s)\in P_{\Theta}:=\{P_{\theta}\in\Delta({\mathcal{B}}):\theta\in\Theta\}italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∈ italic_P start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT := { italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_B ) : italic_θ ∈ roman_Θ }, where the set PΘsubscript𝑃ΘP_{\Theta}italic_P start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT has ϵitalic-ϵ{\epsilon}italic_ϵ-bracketing number 𝒩Θ(ϵ)subscript𝒩Θitalic-ϵ{\mathcal{N}}_{\Theta}({\epsilon})caligraphic_N start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_ϵ ) w.r.t. l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm, defined as the minimum number of ϵitalic-ϵ{\epsilon}italic_ϵ-brackets [l,u]:={PθPΘ:lPθu}assign𝑙𝑢conditional-setsubscript𝑃𝜃subscript𝑃Θ𝑙subscript𝑃𝜃𝑢[l,u]:=\{P_{\theta}\in P_{\Theta}:l\leq P_{\theta}\leq u\}[ italic_l , italic_u ] := { italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ italic_P start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT : italic_l ≤ italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ≤ italic_u } with lu1ϵsubscriptnorm𝑙𝑢1italic-ϵ\|l-u\|_{1}\leq{\epsilon}∥ italic_l - italic_u ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_ϵ, that are needed to cover PΘsubscript𝑃ΘP_{\Theta}italic_P start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT.

Intuitively, restricting the adversary to be consistent, allows the learner to predict the opponent’s response from previous episodes to similar settings. The learner can collect the data from what the adversary responds to and learn his response function. Given the consistent behavior, for every (h,s)[H]×𝒮𝑠delimited-[]𝐻𝒮(h,s)\in[H]\times{\mathcal{S}}( italic_h , italic_s ) ∈ [ italic_H ] × caligraphic_S, the number of action distributions μh(|s)\mu_{h}(\cdot|s)italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) that the adversary can respond with cannot exceed the number of possible action distributions πh(|s)\pi_{h}(\cdot|s)italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) that the learner can construct in state s𝑠sitalic_s at step hhitalic_h. Given ΠΠ\Piroman_Π is the set of all deterministic policies, we only need to learn HSA𝐻𝑆𝐴HSAitalic_H italic_S italic_A action distributions that the adversary can respond at any state and step. We begin with the oblivious case of m=1𝑚1m=1italic_m = 1 and end up resolving the general case m1𝑚1m\geq 1italic_m ≥ 1 after.

5.1 Memory of length m=1𝑚1m=1italic_m = 1

We first consider the memory length of m=1𝑚1m=1italic_m = 1 for stationary and consistent adversaries.

Algorithm.

We propose OPO-OMLE (Algorithm 1), which represents Optimistic Policy Optimization with Optimistic Maximum Likelihood Estimation. OPO-OMLE is a variant of the optimistic value iteration algorithm of (Azar et al., 2017), wherein we build an upper confidence bound on the value function V1π,f(π)superscriptsubscript𝑉1𝜋𝑓𝜋V_{1}^{\pi,f(\pi)}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT for any policy π𝜋\piitalic_π, using a bonus function and optimistic MLE (Liu et al., 2023). The upper confidence bound is based on two levels of optimism: a bonus term β𝛽\betaitalic_β that is based on confidence intervals on the transition kernels P𝑃Pitalic_P and the parameter version spaces {Θhsa}subscriptΘ𝑠𝑎\{\Theta_{hsa}\}{ roman_Θ start_POSTSUBSCRIPT italic_h italic_s italic_a end_POSTSUBSCRIPT } of the adversary’s response at each level (h,s,a)𝑠𝑎(h,s,a)( italic_h , italic_s , italic_a ). The parameter version spaces construct a set of parameters that are close to the MLE solution, up to an error α𝛼\alphaitalic_α, in terms of the log-likelihood in the observed actions taken by the adversary.

1:Input: Bonus function β::𝛽\beta:{\mathbb{N}}\rightarrow{\mathbb{R}}italic_β : blackboard_N → blackboard_R, and MLE confidence parameter α𝛼\alphaitalic_α
2:Initialize: ΘhsaΘ,Dhsa,Nh(s,a,b)0,Nh(s,a,b,s)0,(h,s,a,b,s)𝒮×𝒜××𝒮formulae-sequencesubscriptΘ𝑠𝑎Θformulae-sequencesubscript𝐷𝑠𝑎formulae-sequencesubscript𝑁𝑠𝑎𝑏0formulae-sequencesubscript𝑁𝑠𝑎𝑏superscript𝑠0for-all𝑠𝑎𝑏superscript𝑠𝒮𝒜𝒮\Theta_{hsa}\leftarrow\Theta,D_{hsa}\leftarrow\emptyset,N_{h}(s,a,b)\leftarrow 0% ,N_{h}(s,a,b,s^{\prime})\leftarrow 0,\forall(h,s,a,b,s^{\prime})\in{\mathcal{S% }}\times{\mathcal{A}}\times{\mathcal{B}}\times{\mathcal{S}}roman_Θ start_POSTSUBSCRIPT italic_h italic_s italic_a end_POSTSUBSCRIPT ← roman_Θ , italic_D start_POSTSUBSCRIPT italic_h italic_s italic_a end_POSTSUBSCRIPT ← ∅ , italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) ← 0 , italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ← 0 , ∀ ( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_S × caligraphic_A × caligraphic_B × caligraphic_S
3:for episode t=1,,T𝑡1𝑇t=1,\ldots,Titalic_t = 1 , … , italic_T do
4:     πtargmaxπΠDOUBLY_OPTIMISTIC_VALUE_ESTIMATE(N,{Di},{Θi},π,β)superscript𝜋𝑡subscriptargmax𝜋ΠDOUBLY_OPTIMISTIC_VALUE_ESTIMATE𝑁subscript𝐷𝑖subscriptΘ𝑖𝜋𝛽\pi^{t}\in\displaystyle\operatorname*{arg\,max}_{\pi\in\Pi}{\textrm{DOUBLY\_% OPTIMISTIC\_VALUE\_ESTIMATE}}(N,\{D_{i}\},\{\Theta_{i}\},\pi,\beta)italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT DOUBLY_OPTIMISTIC_VALUE_ESTIMATE ( italic_N , { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , { roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_π , italic_β ) (Algorithm 2)
5:     Play πtsuperscript𝜋𝑡\pi^{t}italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (the opponent responds with f(πt)𝑓superscript𝜋𝑡f(\pi^{t})italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )) to observe (s1t,a1t,b1t,r1t,,sHt,aHt,bHt,rHt)superscriptsubscript𝑠1𝑡superscriptsubscript𝑎1𝑡superscriptsubscript𝑏1𝑡superscriptsubscript𝑟1𝑡superscriptsubscript𝑠𝐻𝑡superscriptsubscript𝑎𝐻𝑡superscriptsubscript𝑏𝐻𝑡superscriptsubscript𝑟𝐻𝑡(s_{1}^{t},a_{1}^{t},b_{1}^{t},r_{1}^{t},\ldots,s_{H}^{t},a_{H}^{t},b_{H}^{t},% r_{H}^{t})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
6:     hfor-all\forall h∀ italic_h: Nh(sht,aht,bht)Nh(sht,aht,bht)+1subscript𝑁superscriptsubscript𝑠𝑡superscriptsubscript𝑎𝑡superscriptsubscript𝑏𝑡subscript𝑁superscriptsubscript𝑠𝑡superscriptsubscript𝑎𝑡superscriptsubscript𝑏𝑡1N_{h}(s_{h}^{t},a_{h}^{t},b_{h}^{t})\leftarrow N_{h}(s_{h}^{t},a_{h}^{t},b_{h}% ^{t})+1italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ← italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 1, Nh(sht,aht,bht,sh+1t)Nh(sht,aht,bht,sh+1t)+1subscript𝑁superscriptsubscript𝑠𝑡superscriptsubscript𝑎𝑡superscriptsubscript𝑏𝑡superscriptsubscript𝑠1𝑡subscript𝑁superscriptsubscript𝑠𝑡superscriptsubscript𝑎𝑡superscriptsubscript𝑏𝑡superscriptsubscript𝑠1𝑡1N_{h}(s_{h}^{t},a_{h}^{t},b_{h}^{t},s_{h+1}^{t})\leftarrow N_{h}(s_{h}^{t},a_{% h}^{t},b_{h}^{t},s_{h+1}^{t})+1italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ← italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 1, DhshtahtDhshtaht{bht}subscript𝐷subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscript𝐷subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡superscriptsubscript𝑏𝑡D_{hs^{t}_{h}a^{t}_{h}}\leftarrow D_{hs^{t}_{h}a^{t}_{h}}\cup\{b_{h}^{t}\}italic_D start_POSTSUBSCRIPT italic_h italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_D start_POSTSUBSCRIPT italic_h italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ { italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }, and Θhshtaht{θΘhshtaht:bDhshtahtlogPθ(b)maxθΘhshtahtbDhshtahtlogPθ(b)α}subscriptΘsubscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡conditional-set𝜃subscriptΘsubscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscript𝑏subscript𝐷subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscript𝑃𝜃𝑏subscript𝜃subscriptΘsubscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscript𝑏subscript𝐷subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscript𝑃𝜃𝑏𝛼\Theta_{hs^{t}_{h}a^{t}_{h}}\leftarrow\{\theta\in\Theta_{hs^{t}_{h}a^{t}_{h}}:% \sum_{b\in D_{hs^{t}_{h}a^{t}_{h}}}\log P_{\theta}(b)\geq\max_{\theta\in\Theta% _{hs^{t}_{h}a^{t}_{h}}}\sum_{b\in D_{hs^{t}_{h}a^{t}_{h}}}\log P_{\theta}(b)-\alpha\}roman_Θ start_POSTSUBSCRIPT italic_h italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← { italic_θ ∈ roman_Θ start_POSTSUBSCRIPT italic_h italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ∑ start_POSTSUBSCRIPT italic_b ∈ italic_D start_POSTSUBSCRIPT italic_h italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_b ) ≥ roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUBSCRIPT italic_h italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_b ∈ italic_D start_POSTSUBSCRIPT italic_h italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_b ) - italic_α }
7:end for
8:Output: {πt}t[T]subscriptsuperscript𝜋𝑡𝑡delimited-[]𝑇\{\pi^{t}\}_{t\in[T]}{ italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT
Algorithm 1 Optimistic Policy Optimization with Optimistic MLE (OPO-OMLE)
1:Initialize: V¯H+1π=0superscriptsubscript¯𝑉𝐻1𝜋0\bar{V}_{H+1}^{\pi}=0over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = 0
2:P^h(s|s,a,b)=1Ssubscript^𝑃conditionalsuperscript𝑠𝑠𝑎𝑏1𝑆\hat{P}_{h}(s^{\prime}|s,a,b)=\frac{1}{S}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) = divide start_ARG 1 end_ARG start_ARG italic_S end_ARG if Nh(s,a,b)=0subscript𝑁𝑠𝑎𝑏0N_{h}(s,a,b)=0italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) = 0; otherwise, P^h(s|s,a,b)=Nh(s,a,b,s)/Nh(s,a,b)subscript^𝑃conditionalsuperscript𝑠𝑠𝑎𝑏subscript𝑁𝑠𝑎𝑏superscript𝑠subscript𝑁𝑠𝑎𝑏\hat{P}_{h}(s^{\prime}|s,a,b)=N_{h}(s,a,b,s^{\prime})/N_{h}(s,a,b)over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) = italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b )
3:for h=H,H1,,1𝐻𝐻11h=H,H-1,\ldots,1italic_h = italic_H , italic_H - 1 , … , 1 do
4:     Q¯hπ(s,a,b)=min{[P^hV¯h+1π](s,a,b)+rh(s,a,b)+β(Nh(s,a,b)),Hh+1},(s,a,b)superscriptsubscript¯𝑄𝜋𝑠𝑎𝑏delimited-[]subscript^𝑃superscriptsubscript¯𝑉1𝜋𝑠𝑎𝑏subscript𝑟𝑠𝑎𝑏𝛽subscript𝑁𝑠𝑎𝑏𝐻1for-all𝑠𝑎𝑏\bar{Q}_{h}^{\pi}(s,a,b)=\min\left\{[\hat{P}_{h}\bar{V}_{h+1}^{\pi}](s,a,b)+r_% {h}(s,a,b)+\beta(N_{h}(s,a,b)),H-h+1\right\},\forall(s,a,b)over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) = roman_min { [ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ] ( italic_s , italic_a , italic_b ) + italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) + italic_β ( italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) ) , italic_H - italic_h + 1 } , ∀ ( italic_s , italic_a , italic_b )
5:     V¯hπ(s)=maxθΘhsπh(s)Q¯hπ(s,πh,Pθ),ssuperscriptsubscript¯𝑉𝜋𝑠subscript𝜃subscriptΘ𝑠subscript𝜋𝑠subscriptsuperscript¯𝑄𝜋𝑠subscript𝜋subscript𝑃𝜃for-all𝑠\bar{V}_{h}^{\pi}(s)=\max_{\theta\in\Theta_{hs\pi_{h}(s)}}\bar{Q}^{\pi}_{h}(s,% \pi_{h},P_{\theta}),\forall sover¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUBSCRIPT italic_h italic_s italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , ∀ italic_s \triangleright Optimistic MLE
6:end for
7:Output: V¯1πsuperscriptsubscript¯𝑉1𝜋\bar{V}_{1}^{\pi}over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT
Algorithm 2 DOUBLY_OPTIMISTIC_VALUE_ESTIMATE(N,{Di},{Θi},π,β𝑁subscript𝐷𝑖subscriptΘ𝑖𝜋𝛽N,\{D_{i}\},\{\Theta_{i}\},\pi,\betaitalic_N , { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , { roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_π , italic_β)
Theoretical guarantee.

We now present a theoretical guarantee for OPO-OMLE.

Theorem 4.

In Algorithm 1, choose β(t)=cHι+log|Π|t𝛽𝑡𝑐𝐻𝜄Π𝑡\beta(t)=cH\sqrt{\frac{\iota+\log|\Pi|}{t}}italic_β ( italic_t ) = italic_c italic_H square-root start_ARG divide start_ARG italic_ι + roman_log | roman_Π | end_ARG start_ARG italic_t end_ARG end_ARG, where ι:=log(SABHT/δ)assign𝜄𝑆𝐴𝐵𝐻𝑇𝛿\iota:=\log(SABHT/\delta)italic_ι := roman_log ( italic_S italic_A italic_B italic_H italic_T / italic_δ ), and α=clog(𝒩Θ(1/T)HSAT/δ)𝛼𝑐subscript𝒩Θ1𝑇𝐻𝑆𝐴𝑇𝛿\alpha=c\log({\mathcal{N}}_{\Theta}(1/T)HSAT/\delta)italic_α = italic_c roman_log ( caligraphic_N start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( 1 / italic_T ) italic_H italic_S italic_A italic_T / italic_δ ). With probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

PR(T)=𝒪(H3S2ABιlogT+H2SABT(ι+log|Π|)+H2SATα).PR𝑇𝒪superscript𝐻3superscript𝑆2𝐴𝐵𝜄𝑇superscript𝐻2𝑆𝐴𝐵𝑇𝜄Πsuperscript𝐻2𝑆𝐴𝑇𝛼\displaystyle{\textrm{PR}}(T)={\mathcal{O}}\left(H^{3}S^{2}AB\iota\log T+H^{2}% \sqrt{SABT(\iota+\log|\Pi|)}+H^{2}\sqrt{SAT\alpha}\right).PR ( italic_T ) = caligraphic_O ( italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_B italic_ι roman_log italic_T + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_S italic_A italic_B italic_T ( italic_ι + roman_log | roman_Π | ) end_ARG + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_S italic_A italic_T italic_α end_ARG ) .

Theorem 4 shows that OPO-OMLE achieves T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG-policy regret bounds against 1111-memory bounded, stationary and consistent adversaries in Markov games. Notably, the policy regret depends only on the log-cardinality of the learner’s policy class ΠΠ\Piroman_Π and the log-bracketing number of the set of action distributions with which the adversary responds to the learner. Since |Π|=AHSΠsuperscript𝐴𝐻𝑆|\Pi|=A^{HS}| roman_Π | = italic_A start_POSTSUPERSCRIPT italic_H italic_S end_POSTSUPERSCRIPT, the bound translates into PR(T)=𝒪~(H3S2AB+H5SA2BT)PR𝑇~𝒪superscript𝐻3superscript𝑆2𝐴𝐵superscript𝐻5𝑆superscript𝐴2𝐵𝑇{\textrm{PR}}(T)=\tilde{{\mathcal{O}}}(H^{3}S^{2}AB+\sqrt{H^{5}SA^{2}BT})PR ( italic_T ) = over~ start_ARG caligraphic_O end_ARG ( italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_B + square-root start_ARG italic_H start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_S italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B italic_T end_ARG ).

Finally, comparing the lower bound of Ω(min{H3SAT,HT})Ωsuperscript𝐻3𝑆𝐴𝑇𝐻𝑇\Omega(\min\{\sqrt{H^{3}SAT},HT\})roman_Ω ( roman_min { square-root start_ARG italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A italic_T end_ARG , italic_H italic_T } ) for single-agent MDPs (Domingues et al., 2021), which applies to this setting, the dominating term in our upper bound (Theorem 4) is worse only by a factor of HAB𝐻𝐴𝐵H\sqrt{AB}italic_H square-root start_ARG italic_A italic_B end_ARG – this is due to the need to learn the opponent’s moves.444A H𝐻\sqrt{H}square-root start_ARG italic_H end_ARG factor in HAB𝐻𝐴𝐵H\sqrt{AB}italic_H square-root start_ARG italic_A italic_B end_ARG is perhaps unrelated to the need to learn the opponent’s moves. This factor perhaps can be removed with a more intricate algorithm that takes into account the variance of transition kernels.

5.2 Memory of any fixed length m1𝑚1m\geq 1italic_m ≥ 1

We now consider the general case of stationary and consistent adversaries that have a memory of any fixed length m1𝑚1m\geq 1italic_m ≥ 1. Note that we assume that the learner knows (an upper bound of) m𝑚mitalic_m. Playing against a 1111-memory bounded adversary does not stop the learner from changing her policies often, as the adversary does not remember any policies that the learner has taken previously. However, a sublinear policy regret learner against m𝑚mitalic_m-memory bounded adversaries should switch her policies as less frequently as possible, and at most only sublinear time switches. The reason is that every policy switch will add a constant cost to policy regret, as the benchmark in the policy regret is with the best fixed sequence of policy. This makes the regret minimizer OPO-OMLE unable to generalize from m=1𝑚1m=1italic_m = 1 to any fixed m𝑚mitalic_m. Instead, we propose a low-switching algorithm, in which the learner learns to play exploratory policies repeatedly over consecutive episodes so that the switching cost is reduced. Here, as in Jin et al. (2020), exploratory policies are those with good coverage over the state space from which uniform policy evaluation can be performed to identify near-optimal policies.

Algorithm.

We propose APE-OVE (Algorithm 3), which represents Adaptive Policy Elimination by Optimistic Value Estimation. APE-OVE generalizes the adaptive policy elimination algorithm of (Qiao et al., 2022) for MDPs to Markov games with unknown opponents. The high-level idea of our algorithm is as follows. The learner maintains a version space ΠksuperscriptΠ𝑘\Pi^{k}roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT of remaining high-quality policies after each epoch – which is a sequence of consecutive episodes with an appropriate length (epoch k𝑘kitalic_k has a length of HSAB(m1+Tk)𝐻𝑆𝐴𝐵𝑚1subscript𝑇𝑘HSAB(m-1+T_{k})italic_H italic_S italic_A italic_B ( italic_m - 1 + italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) in APE-OVE).

  • Layerwise exploration (5 of Algorithm 3): Within each epoch, the learner performs layerwise exploration (Algorithm 4), wherein we devise high-coverage sampling policies πkhsabsuperscript𝜋𝑘𝑠𝑎𝑏\pi^{khsab}italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT that aim at exploring (s,a,b)𝑠𝑎𝑏(s,a,b)( italic_s , italic_a , italic_b ) in step hhitalic_h and epoch k𝑘kitalic_k, starting from the lowest layer h=11h=1italic_h = 1 up to the highest layer h=H𝐻h=Hitalic_h = italic_H. However, some states might not be visited frequently by any policy, thus taking a large amount of exploration. They, fortunately, do not significantly affect the value functions of any policy and thus can be identified (by storing in 𝒰ksuperscript𝒰𝑘{\mathcal{U}}^{k}caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT) and removed from exploration quickly (via the truncated transition kernel estimates P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG obtained in Algorithm 5). Layerwise exploration requires value estimation uniformly over all policies. However, the learner does not know the adversary’s response f𝑓fitalic_f. To address this, we use optimistic value estimation via the optimistic MLE in the collected data of the adversary’s moves (Algorithm 6).

  • Version space refinement (6 of Algorithm 3): After the layerwise exploration, we refine the version space of policies that the learner can choose from at the next epoch using the optimistic value estimation based on the empirical transition kernels P^ksuperscript^𝑃𝑘\hat{P}^{k}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, the parameter version space ΘksuperscriptΘ𝑘\Theta^{k}roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and the set of infrequent transition samples 𝒰ksuperscript𝒰𝑘{\mathcal{U}}^{k}caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT given any reward function r𝑟ritalic_r. The version space is designed in such a way that the expected value for the learner to play any policy π𝜋\piitalic_π from the version space is guaranteed to be no worse than 𝒪~(1/Tk)~𝒪1subscript𝑇𝑘\tilde{{\mathcal{O}}}(1/\sqrt{T_{k}})over~ start_ARG caligraphic_O end_ARG ( 1 / square-root start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) compared to the optimal, with high probability.

Note that we do not directly use the reward function r𝑟ritalic_r in the version space refinement. Instead, we use a truncated reward function r𝒰ksubscript𝑟superscript𝒰𝑘r_{{\mathcal{U}}^{k}}italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT that is zero for any (h,s,a,b,s)𝑠𝑎𝑏superscript𝑠(h,s,a,b,s^{\prime})( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in the infrequent transition set 𝒰ksuperscript𝒰𝑘{\mathcal{U}}^{k}caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. This truncated design is critical to our analysis and the subsequent guarantees, e.g., see Lemma B.10. For the truncated reward functions, the backup step in Algorithm 6 should be understood as: Q¯hπ(s,a,b)=𝔼sP^hk(|s,a,b)[rh(s,a,b)1{(h,s,a,b,s)𝒰k}+V¯h+1π(s)],(s,a,b)\bar{Q}^{\pi}_{h}(s,a,b)={\mathbb{E}}_{s^{\prime}\sim\hat{P}^{k}_{h}(\cdot|s,a% ,b)}\left[r_{h}(s,a,b)1\{(h,s,a,b,s^{\prime})\notin{\mathcal{U}}^{k}\}+\bar{V}% ^{\pi}_{h+1}(s^{\prime})\right],\forall(s,a,b)over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s , italic_a , italic_b ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) 1 { ( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∉ caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } + over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] , ∀ ( italic_s , italic_a , italic_b ).

We now present a theoretical guarantee for APE-OVE. We bound policy regret in terms of an instance-dependent quantity, namely minimum positive visitation probability, defined as follows.

1:Input: number of episodes T𝑇Titalic_T, reward function r𝑟ritalic_r
2:Parameters: α:=log(𝒩Θ(1/T)HSAT/δ)assign𝛼subscript𝒩Θ1𝑇𝐻𝑆𝐴𝑇𝛿\alpha:=\log({\mathcal{N}}_{\Theta}(1/T)HSAT/\delta)italic_α := roman_log ( caligraphic_N start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( 1 / italic_T ) italic_H italic_S italic_A italic_T / italic_δ ), T¯:=min{t:(m1)loglogt+tTHSAB}assign¯𝑇:𝑡𝑚1𝑡𝑡𝑇𝐻𝑆𝐴𝐵\bar{T}:=\min\{t\in{\mathbb{N}}:(m-1)\log\log t+t\geq\frac{T}{HSAB}\}over¯ start_ARG italic_T end_ARG := roman_min { italic_t ∈ blackboard_N : ( italic_m - 1 ) roman_log roman_log italic_t + italic_t ≥ divide start_ARG italic_T end_ARG start_ARG italic_H italic_S italic_A italic_B end_ARG }, K=𝒪(loglogT¯)𝐾𝒪¯𝑇K={\mathcal{O}}(\log\log\bar{T})italic_K = caligraphic_O ( roman_log roman_log over¯ start_ARG italic_T end_ARG ), and Tk:=T¯112k,k[K]formulae-sequenceassignsubscript𝑇𝑘superscript¯𝑇11superscript2𝑘for-all𝑘delimited-[]𝐾T_{k}:=\bar{T}^{1-\frac{1}{2^{k}}},\forall k\in[K]italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := over¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT , ∀ italic_k ∈ [ italic_K ]
3:Initialize: Π1=Π,Θ1=Θformulae-sequencesuperscriptΠ1ΠsuperscriptΘ1Θ\Pi^{1}=\Pi,\Theta^{1}=\Thetaroman_Π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = roman_Π , roman_Θ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = roman_Θ
4:for epoch k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K do
5:     P^k,Θk,𝒰k=LAYERWISE_EXPLORATION(Πk,Tk)superscript^𝑃𝑘superscriptΘ𝑘superscript𝒰𝑘LAYERWISE_EXPLORATIONsuperscriptΠ𝑘subscript𝑇𝑘\hat{P}^{k},\Theta^{k},{\mathcal{U}}^{k}={\textrm{LAYERWISE\_EXPLORATION}}(\Pi% ^{k},T_{k})over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = LAYERWISE_EXPLORATION ( roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (Algorithm 4)
6:     Πk+1:={πΠ:V¯π(r𝒰k,P^k,Θk)maxπΠV¯π(r𝒰k,P^k,Θk)cH2SABα/(dTk)}assignsuperscriptΠ𝑘1conditional-set𝜋Πsuperscript¯𝑉𝜋subscript𝑟superscript𝒰𝑘superscript^𝑃𝑘superscriptΘ𝑘subscript𝜋Πsuperscript¯𝑉𝜋subscript𝑟superscript𝒰𝑘superscript^𝑃𝑘superscriptΘ𝑘𝑐superscript𝐻2𝑆𝐴𝐵𝛼superscript𝑑subscript𝑇𝑘\Pi^{k+1}:=\displaystyle\left\{\pi\in\Pi:\bar{V}^{\pi}(r_{{\mathcal{U}}^{k}},% \hat{P}^{k},\Theta^{k})\geq\max_{\pi\in\Pi}\bar{V}^{\pi}(r_{{\mathcal{U}}^{k}}% ,\hat{P}^{k},\Theta^{k})-cH^{2}SAB\sqrt{\alpha/(d^{*}T_{k})}\right\}roman_Π start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT := { italic_π ∈ roman_Π : over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ≥ roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_c italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B square-root start_ARG italic_α / ( italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG } where r𝒰k(s1,a1,b1,,sH,aH,bH):=h[H]1{(h,sh,ah,bh,sh+1)𝒰k}rh(sh,ah,bh)assignsubscript𝑟superscript𝒰𝑘subscript𝑠1subscript𝑎1subscript𝑏1subscript𝑠𝐻subscript𝑎𝐻subscript𝑏𝐻subscriptdelimited-[]𝐻1subscript𝑠subscript𝑎subscript𝑏subscript𝑠1superscript𝒰𝑘subscript𝑟subscript𝑠subscript𝑎subscript𝑏r_{{\mathcal{U}}^{k}}(s_{1},a_{1},b_{1},\ldots,s_{H},a_{H},b_{H}):=\sum_{h\in[% H]}1\{(h,s_{h},a_{h},b_{h},s_{h+1})\notin{\mathcal{U}}^{k}\}r_{h}(s_{h},a_{h},% b_{h})italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) := ∑ start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT 1 { ( italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ∉ caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) and V¯π(r,P,Θ):=OPTIMISTIC_VALUE_ESTIMATE(π,r,P,Θ)assignsuperscript¯𝑉𝜋𝑟𝑃ΘOPTIMISTIC_VALUE_ESTIMATE𝜋𝑟𝑃Θ\bar{V}^{\pi}(r,P,\Theta):={\textrm{OPTIMISTIC\_VALUE\_ESTIMATE}}(\pi,r,P,\Theta)over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_r , italic_P , roman_Θ ) := OPTIMISTIC_VALUE_ESTIMATE ( italic_π , italic_r , italic_P , roman_Θ ) is given in Algorithm 6
7:end for
Algorithm 3 Adaptive Policy Elimination by Optimistic Value Estimation (APE-OVE)
1:Input: Policy version space ΠksuperscriptΠ𝑘\Pi^{k}roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, number of episodes Tksubscript𝑇𝑘T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
2:Initialize: P^k={P^hk}h[H]superscript^𝑃𝑘subscriptsubscriptsuperscript^𝑃𝑘delimited-[]𝐻\hat{P}^{k}=\{\hat{P}^{k}_{h}\}_{h\in[H]}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT arbitrary transition kernels, 𝒰k=superscript𝒰𝑘{\mathcal{U}}^{k}=\emptysetcaligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = ∅, Θhsak=Θ,(h,s,a)subscriptsuperscriptΘ𝑘𝑠𝑎Θfor-all𝑠𝑎\Theta^{k}_{hsa}=\Theta,\forall(h,s,a)roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s italic_a end_POSTSUBSCRIPT = roman_Θ , ∀ ( italic_h , italic_s , italic_a ), 𝒟=𝒟{\mathcal{D}}=\emptysetcaligraphic_D = ∅, Nhk(s,a,b,s)=0,(h,s,a,b,s)superscriptsubscript𝑁𝑘𝑠𝑎𝑏superscript𝑠0for-all𝑠𝑎𝑏superscript𝑠N_{h}^{k}(s,a,b,s^{\prime})=0,\forall(h,s,a,b,s^{\prime})italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0 , ∀ ( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), and for each (h,s,a,b),1hsab𝑠𝑎𝑏subscript1𝑠𝑎𝑏(h,s,a,b),1_{hsab}( italic_h , italic_s , italic_a , italic_b ) , 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT is the reward function rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such rh(s,a,b)=1{(h,s,a,b)=(h,s,a,b)}subscriptsuperscript𝑟superscriptsuperscript𝑠superscript𝑎superscript𝑏1superscriptsuperscript𝑠superscript𝑎superscript𝑏𝑠𝑎𝑏r^{\prime}_{h^{\prime}}(s^{\prime},a^{\prime},b^{\prime})=1\{(h^{\prime},s^{% \prime},a^{\prime},b^{\prime})=(h,s,a,b)\}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1 { ( italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( italic_h , italic_s , italic_a , italic_b ) }
3:for h=1,,H1𝐻h=1,\ldots,Hitalic_h = 1 , … , italic_H do
4:     for (s,a,b)𝒮×𝒜×𝑠𝑎𝑏𝒮𝒜(s,a,b)\in{\mathcal{S}}\times{\mathcal{A}}\times{\mathcal{B}}( italic_s , italic_a , italic_b ) ∈ caligraphic_S × caligraphic_A × caligraphic_B do
5:         πkhsab=argmaxπΠkOPTIMISTIC_VALUE_ESTIMATE(π,1hsab,P^k,Θk)superscript𝜋𝑘𝑠𝑎𝑏subscriptargmax𝜋superscriptΠ𝑘OPTIMISTIC_VALUE_ESTIMATE𝜋subscript1𝑠𝑎𝑏superscript^𝑃𝑘superscriptΘ𝑘\pi^{khsab}=\displaystyle\operatorname*{arg\,max}_{\pi\in\Pi^{k}}{\textrm{% OPTIMISTIC\_VALUE\_ESTIMATE}}(\pi,1_{hsab},\hat{P}^{k},\Theta^{k})italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT OPTIMISTIC_VALUE_ESTIMATE ( italic_π , 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) (Algorithm 6)
6:         Play πkhsabsuperscript𝜋𝑘𝑠𝑎𝑏\pi^{khsab}italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT for m1𝑚1m-1italic_m - 1 episodes (and collect nothing)
7:         Keep playing πkhsabsuperscript𝜋𝑘𝑠𝑎𝑏\pi^{khsab}italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT for Tksubscript𝑇𝑘T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT episodes and add all the transitions only at step hhitalic_h to 𝒟𝒟{\mathcal{D}}caligraphic_D
8:     end for
9:     Nhk(s,a,b,s)Nhk(s,a,b,s)+1,(s,a,b,s) s.t. (h,s,a,b,s)𝒟formulae-sequencesubscriptsuperscript𝑁𝑘𝑠𝑎𝑏superscript𝑠subscriptsuperscript𝑁𝑘𝑠𝑎𝑏superscript𝑠1for-all𝑠𝑎𝑏superscript𝑠 s.t. 𝑠𝑎𝑏superscript𝑠𝒟N^{k}_{h}(s,a,b,s^{\prime})\leftarrow N^{k}_{h}(s,a,b,s^{\prime})+1,\forall(s,% a,b,s^{\prime})\text{ s.t. }(h,s,a,b,s^{\prime})\in{\mathcal{D}}italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ← italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + 1 , ∀ ( italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) s.t. ( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_D
10:     Θhsak={θΘhsak:b:(h,s,a,b)𝒟Pθ(b)maxθΘhsakb:(h,s,a,b)𝒟Pθ(b)α},(s,a)𝒮×𝒜formulae-sequencesubscriptsuperscriptΘ𝑘𝑠𝑎conditional-set𝜃subscriptsuperscriptΘ𝑘𝑠𝑎subscript:𝑏𝑠𝑎𝑏𝒟subscript𝑃𝜃𝑏subscript𝜃subscriptsuperscriptΘ𝑘𝑠𝑎subscript:𝑏𝑠𝑎𝑏𝒟subscript𝑃𝜃𝑏𝛼for-all𝑠𝑎𝒮𝒜\Theta^{k}_{hsa}=\displaystyle\{\theta\in\Theta^{k}_{hsa}:\sum_{b:(h,s,a,b)\in% {\mathcal{D}}}P_{\theta}(b)\geq\max_{\theta\in\Theta^{k}_{hsa}}\sum_{b:(h,s,a,% b)\in{\mathcal{D}}}P_{\theta}(b)-\alpha\},\forall(s,a)\in{\mathcal{S}}\times{% \mathcal{A}}roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s italic_a end_POSTSUBSCRIPT = { italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s italic_a end_POSTSUBSCRIPT : ∑ start_POSTSUBSCRIPT italic_b : ( italic_h , italic_s , italic_a , italic_b ) ∈ caligraphic_D end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_b ) ≥ roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_b : ( italic_h , italic_s , italic_a , italic_b ) ∈ caligraphic_D end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_b ) - italic_α } , ∀ ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A
11:     𝒰k𝒰k{(h,s,a,b,s):Nhk(h,s,a,b,s)cH2log(SABHK/δ)}superscript𝒰𝑘superscript𝒰𝑘conditional-set𝑠𝑎𝑏superscript𝑠subscriptsuperscript𝑁𝑘𝑠𝑎𝑏superscript𝑠𝑐superscript𝐻2𝑆𝐴𝐵𝐻𝐾𝛿{\mathcal{U}}^{k}\leftarrow{\mathcal{U}}^{k}\cup\{(h,s,a,b,s^{\prime}):N^{k}_{% h}(h,s,a,b,s^{\prime})\leq cH^{2}\log(SABHK/\delta)\}caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ← caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∪ { ( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) : italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_c italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_S italic_A italic_B italic_H italic_K / italic_δ ) }
12:     P^hk=TRANSITION_ESTIMATE(h,Nhk,𝒰k,s)subscriptsuperscript^𝑃𝑘TRANSITION_ESTIMATEsubscriptsuperscript𝑁𝑘superscript𝒰𝑘superscript𝑠\hat{P}^{k}_{h}={\textrm{TRANSITION\_ESTIMATE}}(h,N^{k}_{h},{\mathcal{U}}^{k},% s^{\dagger})over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = TRANSITION_ESTIMATE ( italic_h , italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) (Algorithm 5)
13:     Reset 𝒟=𝒟{\mathcal{D}}=\emptysetcaligraphic_D = ∅
14:end for
15:Output: P^k={P^hk}h[H]superscript^𝑃𝑘subscriptsubscriptsuperscript^𝑃𝑘delimited-[]𝐻\hat{P}^{k}=\{\hat{P}^{k}_{h}\}_{h\in[H]}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT, ΘksuperscriptΘ𝑘\Theta^{k}roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, 𝒰ksuperscript𝒰𝑘{\mathcal{U}}^{k}caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
Algorithm 4 LAYERWISE_EXPLORATION(Πk,Tk)superscriptΠ𝑘subscript𝑇𝑘(\Pi^{k},T_{k})( roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
1:P^h(s|s,a,b)={Nh(s,a,b,s)Nh(s,a,b),(s,a,b,s) s.t. (h,s,a,b,s)𝒰0,(s,a,b,s) s.t. (h,s,a,b,s)𝒰subscript^𝑃conditionalsuperscript𝑠𝑠𝑎𝑏casessubscript𝑁𝑠𝑎𝑏superscript𝑠subscript𝑁𝑠𝑎𝑏for-all𝑠𝑎𝑏superscript𝑠 s.t. 𝑠𝑎𝑏superscript𝑠𝒰otherwise0for-all𝑠𝑎𝑏superscript𝑠 s.t. 𝑠𝑎𝑏superscript𝑠𝒰otherwise\hat{P}_{h}(s^{\prime}|s,a,b)=\begin{cases}\frac{N_{h}(s,a,b,s^{\prime})}{N_{h% }(s,a,b)},\forall(s,a,b,s^{\prime})\text{ s.t. }(h,s,a,b,s^{\prime})\notin{% \mathcal{U}}\\ 0,\forall(s,a,b,s^{\prime})\text{ s.t. }(h,s,a,b,s^{\prime})\in{\mathcal{U}}% \end{cases}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) = { start_ROW start_CELL divide start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) end_ARG , ∀ ( italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) s.t. ( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∉ caligraphic_U end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , ∀ ( italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) s.t. ( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_U end_CELL start_CELL end_CELL end_ROW
2:P^h(s|s,a,b)=1s𝒮:(h,s,a,b,s)𝒰P^h(s|s,a,b),(s,a,b)𝒮×𝒜×formulae-sequencesubscript^𝑃conditionalsuperscript𝑠𝑠𝑎𝑏1subscript:superscript𝑠𝒮𝑠𝑎𝑏superscript𝑠𝒰subscript^𝑃conditionalsuperscript𝑠𝑠𝑎𝑏for-all𝑠𝑎𝑏𝒮𝒜\hat{P}_{h}(s^{\dagger}|s,a,b)=1-\sum_{s^{\prime}\in{\mathcal{S}}:(h,s,a,b,s^{% \prime})\notin{\mathcal{U}}}\hat{P}_{h}(s^{\prime}|s,a,b),\forall(s,a,b)\in{% \mathcal{S}}\times{\mathcal{A}}\times{\mathcal{B}}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) = 1 - ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S : ( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∉ caligraphic_U end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) , ∀ ( italic_s , italic_a , italic_b ) ∈ caligraphic_S × caligraphic_A × caligraphic_B
3:P^h(s|s,a,b)=1,(a,b)𝒜×formulae-sequencesubscript^𝑃conditionalsuperscript𝑠superscript𝑠𝑎𝑏1for-all𝑎𝑏𝒜\hat{P}_{h}(s^{\dagger}|s^{\dagger},a,b)=1,\forall(a,b)\in{\mathcal{A}}\times{% \mathcal{B}}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT , italic_a , italic_b ) = 1 , ∀ ( italic_a , italic_b ) ∈ caligraphic_A × caligraphic_B
4:Output: P^hsubscript^𝑃\hat{P}_{h}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT
Algorithm 5 TRANSITION_ESTIMATE(h,Nh,𝒰,s)subscript𝑁𝒰superscript𝑠(h,N_{h},{\mathcal{U}},s^{\dagger})( italic_h , italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , caligraphic_U , italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT )
1:Input: reward function r𝑟ritalic_r, policy π𝜋\piitalic_π, transition kernel P𝑃Pitalic_P, parameter version space ΘΘ\Thetaroman_Θ
2:Initialize: V¯H+1π()=0subscriptsuperscript¯𝑉𝜋𝐻10\bar{V}^{\pi}_{H+1}(\cdot)=0over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT ( ⋅ ) = 0
3:for h=H,H1,,1𝐻𝐻11h=H,H-1,\ldots,1italic_h = italic_H , italic_H - 1 , … , 1 do
4:     Q¯hπ(s,a,b)=rh(s,a,b)+[PhV¯h+1π](s,a,b),(s,a,b)subscriptsuperscript¯𝑄𝜋𝑠𝑎𝑏subscript𝑟𝑠𝑎𝑏delimited-[]subscript𝑃subscriptsuperscript¯𝑉𝜋1𝑠𝑎𝑏for-all𝑠𝑎𝑏\bar{Q}^{\pi}_{h}(s,a,b)=r_{h}(s,a,b)+[P_{h}\bar{V}^{\pi}_{h+1}](s,a,b),% \forall(s,a,b)over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) + [ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] ( italic_s , italic_a , italic_b ) , ∀ ( italic_s , italic_a , italic_b )
5:     V¯hπ(s)=maxθΘhsπh(s)Q¯hπ(s,πh(s),Pθ),ssubscriptsuperscript¯𝑉𝜋𝑠subscript𝜃subscriptΘ𝑠subscript𝜋𝑠subscriptsuperscript¯𝑄𝜋𝑠subscript𝜋𝑠subscript𝑃𝜃for-all𝑠\bar{V}^{\pi}_{h}(s)=\max_{\theta\in\Theta_{hs\pi_{h}(s)}}\bar{Q}^{\pi}_{h}(s,% \pi_{h}(s),P_{\theta}),\forall sover¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUBSCRIPT italic_h italic_s italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , ∀ italic_s \triangleright Optimistic MLE
6:end for
7:Output: V¯1π(s1)subscriptsuperscript¯𝑉𝜋1subscript𝑠1\bar{V}^{\pi}_{1}(s_{1})over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
Algorithm 6 OPTIMISTIC_VALUE_ESTIMATE(π,r,P,Θ)𝜋𝑟𝑃Θ(\pi,r,P,\Theta)( italic_π , italic_r , italic_P , roman_Θ )
Definition 4 (Minimum positive visitation probability).

The quantity d:=infh,s,a:dh(s,a)>0dh(s,a)assignsuperscript𝑑subscriptinfimum:𝑠𝑎subscriptsuperscript𝑑𝑠𝑎0subscriptsuperscript𝑑𝑠𝑎d^{*}:=\inf_{h,s,a:d^{*}_{h}(s,a)>0}d^{*}_{h}(s,a)italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := roman_inf start_POSTSUBSCRIPT italic_h , italic_s , italic_a : italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) > 0 end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) is said to be the minimum positive visitation probability, where dh(s,a):=infπΠ:dhπ,f([π]m)(s,a)>0dhπ,f([π]m)(s,a).assignsubscriptsuperscript𝑑𝑠𝑎subscriptinfimum:𝜋Πsuperscriptsubscript𝑑𝜋𝑓superscriptdelimited-[]𝜋𝑚𝑠𝑎0superscriptsubscript𝑑𝜋𝑓superscriptdelimited-[]𝜋𝑚𝑠𝑎d^{*}_{h}(s,a):=\inf_{\pi\in\Pi:d_{h}^{\pi,f([\pi]^{m})}(s,a)>0}d_{h}^{\pi,f([% \pi]^{m})}(s,a).italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) := roman_inf start_POSTSUBSCRIPT italic_π ∈ roman_Π : italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s , italic_a ) > 0 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s , italic_a ) .

The minimum positive visitation probability – which has also been used recently to characterize instance-dependent bounds for PAC RL (Tirinzoni et al., 2023), is the minimal probability that any state-action pair can be visited at a time step, given they can be visited at all. This implies that during the exploration phase if we try a certain policy π𝜋\piitalic_π for N𝑁Nitalic_N episodes and encounter (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) at step hhitalic_h (in any episode), on average, π𝜋\piitalic_π would visit (h,s,a)𝑠𝑎(h,s,a)( italic_h , italic_s , italic_a ) for Nd𝑁superscript𝑑Nd^{*}italic_N italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT times out of N𝑁Nitalic_N episodes. This, along with the assumption that the adversary is consistent enables us to estimate the adversary’s response to any (h,s,a)𝑠𝑎(h,s,a)( italic_h , italic_s , italic_a ) that is visited within an estimation error of order 1/Nd1𝑁superscript𝑑1/\sqrt{Nd^{*}}1 / square-root start_ARG italic_N italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG. Note that we do not need to take care of the adversary’s response to any (h,s,a)𝑠𝑎(h,s,a)( italic_h , italic_s , italic_a ) that is not visited as these tuples are deemed infrequent by any policy and thus have negligible impact on the value estimation.

Theorem 5.

Playing APE-OVE against any m𝑚mitalic_m-memory bounded, stationary, and consistent adversaries in any Markov game for T𝑇Titalic_T episodes, with T=Ω~(max{H5AB(d)2S3,(m1)HSAB})𝑇~Ωsuperscript𝐻5𝐴𝐵superscriptsuperscript𝑑2superscript𝑆3𝑚1𝐻𝑆𝐴𝐵T\!=\!\tilde{\Omega}(\max\{\frac{H^{5}AB(d^{*})^{2}}{S^{3}},(m-1)HSAB\})italic_T = over~ start_ARG roman_Ω end_ARG ( roman_max { divide start_ARG italic_H start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_A italic_B ( italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG , ( italic_m - 1 ) italic_H italic_S italic_A italic_B } ), and

Tmin{H5SAB(d)2log4(HSABK/δ)α2,H9(d)2log4(HSABK/δ)(SAB)3α2,H13log2(HSABK/δ)(AB)3S5},greater-than-or-equivalent-to𝑇superscript𝐻5𝑆𝐴𝐵superscriptsuperscript𝑑2superscript4𝐻𝑆𝐴𝐵𝐾𝛿superscript𝛼2superscript𝐻9superscriptsuperscript𝑑2superscript4𝐻𝑆𝐴𝐵𝐾𝛿superscript𝑆𝐴𝐵3superscript𝛼2superscript𝐻13superscript2𝐻𝑆𝐴𝐵𝐾𝛿superscript𝐴𝐵3superscript𝑆5\displaystyle T\gtrsim\min\{\frac{H^{5}SAB(d^{*})^{2}\log^{4}(HSABK/\delta)}{% \alpha^{2}},\frac{H^{9}(d^{*})^{2}\log^{4}(HSABK/\delta)}{(SAB)^{3}\alpha^{2}}% ,\frac{H^{13}\log^{2}(HSABK/\delta)}{(AB)^{3}S^{5}}\},italic_T ≳ roman_min { divide start_ARG italic_H start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_S italic_A italic_B ( italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_H start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT ( italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG ( italic_S italic_A italic_B ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_H start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG ( italic_A italic_B ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT end_ARG } ,

guarantees that with probability at least 1δ1𝛿1-\delta1 - italic_δ,

PR(T)=𝒪((m1)H2SABloglogT+H3/2SAB(HSAB+H2+S3/2AB)TαdloglogT)PR𝑇𝒪𝑚1superscript𝐻2𝑆𝐴𝐵𝑇superscript𝐻32𝑆𝐴𝐵𝐻𝑆𝐴𝐵superscript𝐻2superscript𝑆32𝐴𝐵𝑇𝛼superscript𝑑𝑇\displaystyle{\textrm{PR}}(T)\!=\!{\mathcal{O}}\!\left(\!\!(m-1)H^{2}SAB\log% \log T\!+\!H^{3/2}\sqrt{SAB}(HSAB+\!H^{2}\!+S^{3/2}AB)\sqrt{\frac{T\alpha}{d^{% *}}}\log\log T\right)PR ( italic_T ) = caligraphic_O ( ( italic_m - 1 ) italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B roman_log roman_log italic_T + italic_H start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT square-root start_ARG italic_S italic_A italic_B end_ARG ( italic_H italic_S italic_A italic_B + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_S start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_A italic_B ) square-root start_ARG divide start_ARG italic_T italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_ARG roman_log roman_log italic_T )

where dsuperscript𝑑d^{*}italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the minimum positive visitation probability and α𝛼\alphaitalic_α is as defined in Algorithm 3.

Theorem 5 asserts a T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG policy regret bound against m𝑚mitalic_m-memory bounded, stationary, and consistent adversaries in Markov games. Notably, our bounds grow linearly with memory length m𝑚mitalic_m. Compared to the bound in Theorem 4, given T𝑇Titalic_T is sufficiently large, the bound in Theorem 5 deals with the general memory length m𝑚mitalic_m at the cost of a worse dependence on all other factors H,S,A,B,d𝐻𝑆𝐴𝐵superscript𝑑H,S,A,B,d^{*}italic_H , italic_S , italic_A , italic_B , italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Dealing with ζ𝜁\zetaitalic_ζ-approximately consistent adversaries (see Remark 1) will incur an additional term 𝒪(Tζ)𝒪𝑇𝜁{\mathcal{O}}(T\zeta)caligraphic_O ( italic_T italic_ζ ) to the policy regret.

6 Discussion

In this paper, we study learning in Markov games against adaptive adversaries and highlight the statistical hardness of learning in this setting. We identify a natural structural assumption on the response function of the adversary, wherein we provide two distinct algorithms that attain T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG policy regret, one for the unit memory and the other for general memory length.

There are several notable gaps in our current understanding of policy regret in Markov games. First, we do not know if the dependence on the minimum positive visitation probability dsuperscript𝑑d^{*}italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT when learning against m𝑚mitalic_m-memory bounded opponents is necessary. In other words, can we derive minimax bounds that hold for any problem instance, regardless of how small dsuperscript𝑑d^{*}italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is, for the case of general m𝑚mitalic_m? While it seems to us that such a dependence is necessary (as it seems difficult otherwise to learn the opponent’s response while also learning high-return policies), yet we are unable to prove or reject this conjecture. Second, as we state in Remark 2, we do not currently know the necessary conditions on the opponent’s response functions for learnability in this setting. This might as well require an alternate condition that generalizes our notion of consistent behaviors and fully characterizes the predictability of the opponent (in a similar way as the VC dimension characterizes learnability in statistical learning theory). Third, our theory currently views information, and not computation, as the main bottleneck and aims for policy regret minimization without worrying about computational complexity. As a result, some of the steps in our algorithms happen to be computationally inefficient. In particular, selecting a policy that maximizes the optimistic value function requires iterating over the learner’s policy set, which is exponentially large. Can we hope for computationally efficient no-policy regret algorithms in Markov games? Fourth, our policy regret bounds scale with the cardinality of the state space and the action space, which could be large in many practical settings. Can we avoid such dependence by employing function approximation (e.g., neural networks)?

Acknowledgments and Disclosure of Funding

This research was supported, in part, by the DARPA GARD award HR00112020004, NSF CAREER award IIS-1943251, funding from the Institute for Assured Autonomy (IAA) at JHU, and the Spring’22 workshop on “Learning and Games” at the Simons Institute for the Theory of Computing.

References

  • Agarwal et al. [2020] Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank MDPs. Advances in neural information processing systems, 33:20095–20107, 2020.
  • Arora et al. [2012] Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. arXiv preprint arXiv:1206.6400, 2012.
  • Arora et al. [2018] Raman Arora, Michael Dinitz, Teodor Vanislavov Marinov, and Mehryar Mohri. Policy regret in repeated games. Advances in Neural Information Processing Systems, 31, 2018.
  • Audibert and Bubeck [2009] Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217–226, 2009.
  • Awasthi et al. [2022] Pranjal Awasthi, Kush Bhatia, Sreenivas Gollapudi, and Kostas Kollias. Congested bandits: Optimal routing via short-term resets. In International Conference on Machine Learning, pages 1078–1100. PMLR, 2022.
  • Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International conference on machine learning, pages 263–272. PMLR, 2017.
  • Bai and Jin [2020] Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In International conference on machine learning, pages 551–560. PMLR, 2020.
  • Bai et al. [2020] Yu Bai, Chi Jin, and Tiancheng Yu. Near-optimal reinforcement learning with self-play. Advances in neural information processing systems, 33:2159–2170, 2020.
  • Baker et al. [2019] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528, 2019.
  • Balcan et al. [2005] M-F Balcan, Avrim Blum, Jason D Hartline, and Yishay Mansour. Mechanism design via machine learning. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), pages 605–614. IEEE, 2005.
  • Berner et al. [2019] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  • Bhatia and Sridharan [2020] Kush Bhatia and Karthik Sridharan. Online learning with dynamics: A minimax perspective. Advances in Neural Information Processing Systems, 33:15020–15030, 2020.
  • Blum et al. [2014] Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Learning optimal commitment to overcome insecurity. Advances in Neural Information Processing Systems, 27, 2014.
  • Brafman and Tennenholtz [2002] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
  • Braverman et al. [2019] Mark Braverman, Jieming Mao, Jon Schneider, and S Matthew Weinberg. Multi-armed bandit problems with strategic arms. In Conference on Learning Theory, pages 383–416. PMLR, 2019.
  • Brown and Sandholm [2018] Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
  • Cole and Roughgarden [2014] Richard Cole and Tim Roughgarden. The sample complexity of revenue maximization. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 243–252, 2014.
  • Conitzer and Sandholm [2002] Vincent Conitzer and Tuomas Sandholm. Complexity of mechanism design. arXiv preprint cs/0205075, 2002.
  • Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
  • Dekel et al. [2014] Ofer Dekel, Jian Ding, Tomer Koren, and Yuval Peres. Bandits with switching costs: T2/3superscript𝑇23{T}^{2/3}italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT regret. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 459–467, 2014.
  • Dinh et al. [2023] Le Cong Dinh, David Henry Mguni, Long Tran-Thanh, Jun Wang, and Yaodong Yang. Online Markov decision processes with non-oblivious strategic adversary. Autonomous Agents and Multi-Agent Systems, 37(1):15, 2023.
  • Domingues et al. [2021] Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited. In Algorithmic Learning Theory, pages 578–598. PMLR, 2021.
  • Dütting et al. [2019] Paul Dütting, Zhe Feng, Harikrishna Narasimhan, David Parkes, and Sai Srivatsa Ravindranath. Optimal auctions through deep learning. In International Conference on Machine Learning, pages 1706–1715. PMLR, 2019.
  • Filar and Vrieze [2012] Jerzy Filar and Koos Vrieze. Competitive Markov decision processes. Springer Science & Business Media, 2012.
  • Geer [2000] Sara A Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.
  • Hansen et al. [2013] Thomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1):1–16, 2013.
  • Heidari et al. [2016] Hoda Heidari, Michael J Kearns, and Aaron Roth. Tight policy regret bounds for improving and decaying bandits. In IJCAI, pages 1562–1570, 2016.
  • Hu and Wellman [2003] Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
  • Jaderberg et al. [2019] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019.
  • Jin et al. [2020] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020.
  • Jin et al. [2021] Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning–a simple, efficient, decentralized algorithm for multiagent RL. arXiv preprint arXiv:2110.14555, 2021.
  • Jin et al. [2022] Chi Jin, Qinghua Liu, and Tiancheng Yu. The power of exploiter: Provable multi-agent RL in large state spaces. In International Conference on Machine Learning, pages 10251–10279. PMLR, 2022.
  • Koren et al. [2017a] Tomer Koren, Roi Livni, and Yishay Mansour. Bandits with movement costs and adaptive pricing. In Conference on Learning Theory, pages 1242–1268. PMLR, 2017a.
  • Koren et al. [2017b] Tomer Koren, Roi Livni, and Yishay Mansour. Multi-armed bandits with metric movement costs. Advances in Neural Information Processing Systems, 30, 2017b.
  • Kwon et al. [2021] Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. RL for latent MDPs: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34:24523–24534, 2021.
  • Letchford et al. [2009] Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. In Algorithmic Game Theory: Second International Symposium, SAGT 2009, Paphos, Cyprus, October 18-20, 2009. Proceedings 2, pages 250–262. Springer, 2009.
  • Levine et al. [2017] Nir Levine, Koby Crammer, and Shie Mannor. Rotting bandits. Advances in neural information processing systems, 30, 2017.
  • Lindner et al. [2021] David Lindner, Hoda Heidari, and Andreas Krause. Addressing the long-term impact of ml decisions via policy regret. arXiv preprint arXiv:2106.01325, 2021.
  • Littman [1994] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.
  • Liu et al. [2021] Qinghua Liu, Tiancheng Yu, Yu Bai, and Chi Jin. A sharp analysis of model-based reinforcement learning with self-play. In International Conference on Machine Learning, pages 7001–7010. PMLR, 2021.
  • Liu et al. [2022] Qinghua Liu, Yuanhao Wang, and Chi Jin. Learning Markov games with adversarial opponents: Efficient algorithms and fundamental limits. In International Conference on Machine Learning, pages 14036–14053. PMLR, 2022.
  • Liu et al. [2023] Qinghua Liu, Praneeth Netrapalli, Csaba Szepesvari, and Chi Jin. Optimistic MLE: A generic model-based algorithm for partially observable sequential decision making. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 363–376, 2023.
  • Malik et al. [2022] Dhruv Malik, Yuanzhi Li, and Aarti Singh. Complete policy regret bounds for tallying bandits. In Conference on Learning Theory, pages 5146–5174. PMLR, 2022.
  • Malik et al. [2023] Dhruv Malik, Conor Igoe, Yuanzhi Li, and Aarti Singh. Weighted tallying bandits: overcoming intractability via repeated exposure optimality. In International Conference on Machine Learning, pages 23590–23609. PMLR, 2023.
  • Merhav et al. [2002] Neri Merhav, Erik Ordentlich, Gadiel Seroussi, and Marcelo J Weinberger. On sequential strategies for loss functions with memory. IEEE Transactions on Information Theory, 48(7):1947–1958, 2002.
  • Moravčík et al. [2017] Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017.
  • Nash [1950] John F. Nash. Equilibrium points in n𝑛nitalic_n-person games. Proceedings of the National Academy of Sciences, 36(1):48–49, 1950. doi: 10.1073/pnas.36.1.48.
  • Qiao et al. [2022] Dan Qiao, Ming Yin, Ming Min, and Yu-Xiang Wang. Sample-efficient reinforcement learning with loglog (t) switching cost. In International Conference on Machine Learning, pages 18031–18061. PMLR, 2022.
  • Ramponi and Restelli [2022] Giorgia Ramponi and Marcello Restelli. Learning in Markov games: can we exploit a general-sum opponent? In Uncertainty in Artificial Intelligence, pages 1665–1675. PMLR, 2022.
  • Russo and Van Roy [2013] Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.
  • Seznec et al. [2019] Julien Seznec, Andrea Locatelli, Alexandra Carpentier, Alessandro Lazaric, and Michal Valko. Rotting bandits are no harder than stochastic ones. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2564–2572. PMLR, 2019.
  • Shalev-Shwartz et al. [2016] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
  • Shapley [1953] Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  • Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge. nature, 550(7676):354–359, 2017.
  • Silver et al. [2018] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  • Tian et al. [2021] Yi Tian, Yuanhao Wang, Tiancheng Yu, and Suvrit Sra. Online learning in unknown Markov games. In International conference on machine learning, pages 10279–10288. PMLR, 2021.
  • Tirinzoni et al. [2023] Andrea Tirinzoni, Aymen Al-Marjani, and Emilie Kaufmann. Optimistic PAC reinforcement learning: the instance-dependent view. In International Conference on Algorithmic Learning Theory, pages 1460–1480. PMLR, 2023.
  • Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  • Von Stackelberg [2010] Heinrich Von Stackelberg. Market structure and equilibrium. Springer Science & Business Media, 2010.
  • Wei et al. [2017] Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. Advances in Neural Information Processing Systems, 30, 2017.
  • Wei et al. [2020] Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Linear last-iterate convergence in constrained saddle-point optimization. arXiv preprint arXiv:2006.09517, 2020.
  • Xie et al. [2020] Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. In Conference on learning theory, pages 3674–3682. PMLR, 2020.
  • Yang and Wang [2020] Yaodong Yang and Jun Wang. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint arXiv:2011.00583, 2020.
  • Zhang et al. [2021] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of reinforcement learning and control, pages 321–384, 2021.
  • Zhang [2006] Tong Zhang. From ϵitalic-ϵ\epsilonitalic_ϵ-entropy to KL-entropy: Analysis of minimum information complexity density estimation. 2006.
  • Zhang et al. [2023] Zihan Zhang, Yuxin Chen, Jason D Lee, and Simon S Du. Settling the sample complexity of online reinforcement learning. arXiv preprint arXiv:2307.13586, 2023.
  • Zheng et al. [2020] Stephan Zheng, Alexander Trott, Sunil Srinivasa, Nikhil Naik, Melvin Gruesbeck, David C Parkes, and Richard Socher. The AI economist: Improving equality and productivity with AI-driven tax policies. arXiv preprint arXiv:2004.13332, 2020.

Appendix A Missing proofs for Section 4

A.1 Proof of Theorem 1

Proof of Theorem 1.

The construction of a hard problem essentially follows the proof idea of Arora et al. [2012]. Policy regret requires the learner to compete with the best fixed sequence of policy in hindsight as if she could have changed her past policies. The lower bound utilizes this fact to construct an instance such that once the learner picks a particular policy in the first episode, she will receive a low reward for the remaining episodes. The only way to achieve a higher reward is to go back in time and select a different policy.

More formally, let’s consider any learner. Let π1superscript𝜋1{\pi}^{1}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT be a policy that the learner commits in the first episode with the highest positive probability p>0𝑝0p>0italic_p > 0. Note that π1superscript𝜋1\pi^{1}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and p𝑝pitalic_p are the inherent property of the learner and do not depend on the adversary and the Markov game as in the first episode, the learner has zero information about the adversary and the Markov game. Now let’s consider the adversary that depends only on the learner’s policy in the first episode and nothing else, i.e., for all t𝑡titalic_t and policy sequence π1,,πtsuperscript𝜋1superscript𝜋𝑡\pi^{1},\ldots,\pi^{t}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, ft(π1,,πt)=f(π1)subscript𝑓𝑡superscript𝜋1superscript𝜋𝑡𝑓superscript𝜋1f_{t}(\pi^{1},\ldots,\pi^{t})=f(\pi^{1})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_f ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) for some function f:ΠΨ:𝑓ΠΨf:\Pi\rightarrow\Psiitalic_f : roman_Π → roman_Ψ. In addition, let f𝑓fitalic_f such that f(π)=μ𝑓𝜋𝜇f(\pi)=\muitalic_f ( italic_π ) = italic_μ if π=π1𝜋superscript𝜋1\pi=\pi^{1}italic_π = italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and f(π)=ν𝑓𝜋𝜈f(\pi)=\nuitalic_f ( italic_π ) = italic_ν otherwise, where μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν such that for all s𝑠sitalic_s, supπV1π,ν(s)supπV1π,μ(s)=Ω(1)subscriptsupremum𝜋superscriptsubscript𝑉1𝜋𝜈𝑠subscriptsupremum𝜋superscriptsubscript𝑉1𝜋𝜇𝑠Ω1\sup_{\pi}V_{1}^{\pi,\nu}(s)-\sup_{\pi}V_{1}^{\pi,\mu}(s)=\Omega(1)roman_sup start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_ν end_POSTSUPERSCRIPT ( italic_s ) - roman_sup start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_μ end_POSTSUPERSCRIPT ( italic_s ) = roman_Ω ( 1 ). There exists a Markov game that always guarantees the existence of such μ,ν𝜇𝜈\mu,\nuitalic_μ , italic_ν (the constructions are fairly straightforward). Thus, with probability p𝑝pitalic_p, we have PR(T)=Ω(T)PR𝑇Ω𝑇{\textrm{PR}}(T)=\Omega(T)PR ( italic_T ) = roman_Ω ( italic_T ). Note that the external regret R(T)𝑅𝑇R(T)italic_R ( italic_T ) for this construction is 00.

A.2 Proof of Theorem 2

Proof of Theorem 2.

The proof follows from the two main arguments: (i) a reduction from any latent MDP [Kwon et al., 2021] to a Markov game with an adversary playing policies from a finite set of Markov policies, and (ii) a reduction from the notion of regret in latent MDPs to the policy regret w.r.t. an oblivious sequence of Markov policies.

Argument (i) is directly taken from [Liu et al., 2022, Proposition 5]. In particular, interacting with any latent MDP [Kwon et al., 2021] of L𝐿Litalic_L latent variables, S𝑆Sitalic_S states, A𝐴Aitalic_A actions, H𝐻Hitalic_H time steps, and binary rewards is equivalent to interacting (from the perspective of the learner) a (simulated) Markov game against an adversary whose policies are chosen from a set of L𝐿Litalic_L Markov policies. In particular, the simulated Markov game has SA+S𝑆𝐴𝑆SA+Sitalic_S italic_A + italic_S states, A𝐴Aitalic_A actions for the learner, 2S2𝑆2S2 italic_S actions for the adversary, and 2H2𝐻2H2 italic_H time steps (see [Liu et al., 2022, Section A.4] for the detailed construction of the simulated Markov game from any latent MDP). Thus, we can utilize any lower bound for latent MDP for the Markov game (but not vice versa).

To continue from Argument (i) and begin with Argument (ii), we recall the definition of latent MDPs [Kwon et al., 2021]. At the beginning of each episode, the nature secretly draws uniformly at random from a set of L𝐿Litalic_L base MDPs and the learner interacts with this drawn MDP for the episode. [Kwon et al., 2021, Theorem 3.1] show that for any learner, there exists a latent MDP with L𝐿Litalic_L base MDPs such that the learner needs at least Ω((SA/L)L/ϵ2)Ωsuperscript𝑆𝐴𝐿𝐿superscriptitalic-ϵ2\Omega((SA/L)^{L}/{\epsilon}^{2})roman_Ω ( ( italic_S italic_A / italic_L ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) episodes to identify an ϵitalic-ϵ{\epsilon}italic_ϵ-suboptimal policy, where the optimality is defined with respect to the average values over the M𝑀Mitalic_M base MDPs. Note that in the construction of the hard latent MDP instance above, there is a unique optimal policy (let’s call it πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) with respect to the aforementioned optimality notion. Thus, the regret of this learner over T𝑇Titalic_T episodes competing against πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is at least Ω(T(SA/L)L)Ω𝑇superscript𝑆𝐴𝐿𝐿\Omega(\sqrt{T(SA/L)^{L}})roman_Ω ( square-root start_ARG italic_T ( italic_S italic_A / italic_L ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_ARG ) (the learner suffers an instantaneous regret of ϵitalic-ϵ{\epsilon}italic_ϵ every time she fails to identify πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT). Note again that the regret above is the expectation with respect to the uniform distribution over L𝐿Litalic_L base MDPs. Thus, there exists a particular realization of a sequence of T𝑇Titalic_T base MDPs in a certain order such that the regret with respect to this sequence when competing with πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is at least the expected regret with respect to the uniform distribution over L𝐿Litalic_L base MDPs, which is Ω(T(SA/L)L)Ω𝑇superscript𝑆𝐴𝐿𝐿\Omega(\sqrt{T(SA/L)^{L}})roman_Ω ( square-root start_ARG italic_T ( italic_S italic_A / italic_L ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_ARG ). Finally, note that πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is also an optimal policy with respect to the total value across the sequence of T𝑇Titalic_T MDPs since πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is an optimal policy for each individual base MDP, per the construction in Kwon et al. [2021]. Thus, we can conclude that, for any learner, there exists a sequence of T𝑇Titalic_T MDPs from a set of L𝐿Litalic_L MDPs such that the regret of the learner with respect to this MDP sequence is Ω(T(SA/L)L)Ω𝑇superscript𝑆𝐴𝐿𝐿\Omega(\sqrt{T(SA/L)^{L}})roman_Ω ( square-root start_ARG italic_T ( italic_S italic_A / italic_L ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_ARG ). ∎

A.3 Proof of Theorem 3

Proof of Theorem 3.

Consider any learner. Consider the adversary’s policy space Ψ={μ,ν}Ψ𝜇𝜈\Psi=\{\mu,\nu\}roman_Ψ = { italic_μ , italic_ν } where for all h[H1]delimited-[]𝐻1h\in[H-1]italic_h ∈ [ italic_H - 1 ], μhsubscript𝜇\mu_{h}italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and νhsubscript𝜈\nu_{h}italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are arbitrary but μH(b1|s)=1,ssubscript𝜇𝐻conditionalsubscript𝑏1𝑠1for-all𝑠\mu_{H}(b_{1}|s)=1,\forall sitalic_μ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_s ) = 1 , ∀ italic_s and νH(b2|s)=1,ssubscript𝜈𝐻conditionalsubscript𝑏2𝑠1for-all𝑠\nu_{H}(b_{2}|s)=1,\forall sitalic_ν start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_s ) = 1 , ∀ italic_s, for some b1,b2subscript𝑏1subscript𝑏2b_{1},b_{2}\in{\mathcal{B}}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_B. Let the reactive function f𝑓fitalic_f to map all policies but some πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in ΠΠ\Piroman_Π to μ𝜇\muitalic_μ, whereas f(π)=ν𝑓superscript𝜋𝜈f(\pi^{*})=\nuitalic_f ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_ν. Now consider a deterministic Markov game with the following properties. The transition kernel is deterministic and always traverses through the same sequence of states, regardless of what actions the learner and the adversary take. The reward functions are deterministic everywhere, and also zero everywhere except that rH(s,a,b2)=1,s,asubscript𝑟𝐻𝑠𝑎subscript𝑏21for-all𝑠𝑎r_{H}(s,a,b_{2})=1,\forall s,aitalic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 , ∀ italic_s , italic_a. Except for πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that yields a positive reward if the learner selects it, all other policies in ΠΠ\Piroman_Π give zero reward. In addition, since the learner does not know f𝑓fitalic_f and that there is no relation whatsoever between f(π)𝑓𝜋f(\pi)italic_f ( italic_π ) and f(π)𝑓superscript𝜋f(\pi^{\prime})italic_f ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for any ππ𝜋superscript𝜋\pi\neq\pi^{\prime}italic_π ≠ italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the learner needs to play all policies in ΠΠ\Piroman_Π at least once to be able to identify πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. ∎

Appendix B Missing proofs for Section 5

B.1 Support lemmas

Maximum Likelihood Estimation.

Let {xi}i[T]Pθsimilar-tosubscriptsubscript𝑥𝑖𝑖delimited-[]𝑇subscript𝑃superscript𝜃\{x_{i}\}_{i\in[T]}\sim P_{\theta^{*}}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_T ] end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT where θΘsuperscript𝜃Θ\theta^{*}\in\Thetaitalic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ. Denote 𝒩Θ(ϵ)subscript𝒩Θitalic-ϵ{\mathcal{N}}_{\Theta}({\epsilon})caligraphic_N start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_ϵ ) the ϵitalic-ϵ{\epsilon}italic_ϵ-bracketing number of function class {Pθ:θΘ}conditional-setsubscript𝑃𝜃𝜃Θ\{P_{\theta}:\theta\in\Theta\}{ italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : italic_θ ∈ roman_Θ }. The following lemma says that the log-likelihood of the true model in the empirical data is close to that of any model within the model class, up to an error that scales logarithmically with the model complexity measured in a bracketing number.

Lemma B.1.

There exists an absolute constant c𝑐citalic_c such that for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability at least 1δ1𝛿1-\delta1 - italic_δ, for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] and θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, we have

i=1tlogPθ(xi)Pθ(xi)clog(𝒩Θ(1/T)T/δ).superscriptsubscript𝑖1𝑡subscript𝑃𝜃subscript𝑥𝑖subscript𝑃superscript𝜃subscript𝑥𝑖𝑐subscript𝒩Θ1𝑇𝑇𝛿\displaystyle\sum_{i=1}^{t}\log\frac{P_{\theta}(x_{i})}{P_{\theta^{*}}(x_{i})}% \leq c\log({\mathcal{N}}_{\Theta}(1/T)T/\delta).∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log divide start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_c roman_log ( caligraphic_N start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( 1 / italic_T ) italic_T / italic_δ ) .

The following lemma says that any model that is close to the true model in the log-likelihood in the historical data would yield a similar data distribution as the true model.

Lemma B.2.

There exists an absolute constant c𝑐citalic_c such that for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability at least 1δ1𝛿1-\delta1 - italic_δ, for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] and θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ,

dTV2(Pθ,Pθ)ct(i=1tlogPθ(xi)Pθ(xi)+log(𝒩Θ(1/T)T/δ)),subscriptsuperscript𝑑2𝑇𝑉subscript𝑃𝜃subscript𝑃superscript𝜃𝑐𝑡superscriptsubscript𝑖1𝑡subscript𝑃superscript𝜃subscript𝑥𝑖subscript𝑃𝜃subscript𝑥𝑖subscript𝒩Θ1𝑇𝑇𝛿\displaystyle d^{2}_{TV}(P_{\theta},P_{\theta^{*}})\leq\frac{c}{t}\left(\sum_{% i=1}^{t}\log\frac{P_{\theta^{*}}(x_{i})}{P_{\theta}(x_{i})}+\log({\mathcal{N}}% _{\Theta}(1/T)T/\delta)\right),italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_c end_ARG start_ARG italic_t end_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log divide start_ARG italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG + roman_log ( caligraphic_N start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( 1 / italic_T ) italic_T / italic_δ ) ) ,

where dTVsubscript𝑑𝑇𝑉d_{TV}italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT denotes the total variation distance.

The two lemmas above directly follow from [Liu et al., 2023, Proposition B.1] and [Liu et al., 2023, Proposition B.2], respectively, wherein the analysis built on the classical analysis of MLE [Geer, 2000] and the “tangent” sequence analysis in [Zhang, 2006, Agarwal et al., 2020], respectively. The following lemma is a direct corollary of Lemma B.1 and Lemma B.2.

Lemma B.3.

Let θ^targsupθΘi=1tlogPθ(xi)subscript^𝜃𝑡subscriptargsup𝜃Θsuperscriptsubscript𝑖1𝑡subscript𝑃𝜃subscript𝑥𝑖\hat{\theta}_{t}\in\operatorname*{arg\,sup}_{\theta\in\Theta}\sum_{i=1}^{t}% \log P_{\theta}(x_{i})over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_sup end_OPERATOR start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Define the version space:

Θt:={θΘ:i=1tlogPθ(xi)i=1tlogPθ^t(xi)clog(𝒩Θ(1/T)T/δ)}.assignsubscriptΘ𝑡conditional-set𝜃Θsuperscriptsubscript𝑖1𝑡subscript𝑃𝜃subscript𝑥𝑖superscriptsubscript𝑖1𝑡subscript𝑃subscript^𝜃𝑡subscript𝑥𝑖𝑐subscript𝒩Θ1𝑇𝑇𝛿\displaystyle\Theta_{t}:=\left\{\theta\in\Theta:\sum_{i=1}^{t}\log P_{\theta}(% x_{i})\geq\sum_{i=1}^{t}\log P_{\hat{\theta}_{t}}(x_{i})-c\log({\mathcal{N}}_{% \Theta}(1/T)T/\delta)\right\}.roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := { italic_θ ∈ roman_Θ : ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_c roman_log ( caligraphic_N start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( 1 / italic_T ) italic_T / italic_δ ) } .

Then, with probability at least 1δ1𝛿1-\delta1 - italic_δ, for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], we have θΘtsuperscript𝜃subscriptΘ𝑡\theta^{*}\in\Theta_{t}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and

maxθΘtdTV(Pθ,Pθ)clog(𝒩Θ(1/T)T/δ)t.subscript𝜃subscriptΘ𝑡subscript𝑑𝑇𝑉subscript𝑃𝜃subscript𝑃superscript𝜃𝑐subscript𝒩Θ1𝑇𝑇𝛿𝑡\displaystyle\max_{\theta\in\Theta_{t}}d_{TV}(P_{\theta},P_{\theta^{*}})\leq c% \sqrt{\frac{\log({\mathcal{N}}_{\Theta}(1/T)T/\delta)}{t}}.roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≤ italic_c square-root start_ARG divide start_ARG roman_log ( caligraphic_N start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( 1 / italic_T ) italic_T / italic_δ ) end_ARG start_ARG italic_t end_ARG end_ARG .

B.2 Proof of Theorem 4

We first introduce several notations that we will use throughout our proofs. We denote Nhtsubscriptsuperscript𝑁𝑡N^{t}_{h}italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and ΘitsuperscriptsubscriptΘ𝑖𝑡\Theta_{i}^{t}roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT the counters Nhsubscript𝑁N_{h}italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and the parameter confidence sets ΘitsuperscriptsubscriptΘ𝑖𝑡\Theta_{i}^{t}roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at the beginning of the episode t𝑡titalic_t.

Lemma B.4 (Optimism).

With probability at least 1δ1𝛿1-\delta1 - italic_δ, for all (h,s,π,t)𝑠𝜋𝑡(h,s,\pi,t)( italic_h , italic_s , italic_π , italic_t ), we have

V¯hπ(s)Vhπ,f(π)(s).superscriptsubscript¯𝑉𝜋𝑠superscriptsubscript𝑉𝜋𝑓𝜋𝑠\displaystyle\bar{V}_{h}^{\pi}(s)\geq V_{h}^{\pi,f(\pi)}(s).over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) ≥ italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ( italic_s ) .
Proof of Lemma B.4.

We will prove a stronger statement: For any (h,s,a,b,π)𝑠𝑎𝑏𝜋(h,s,a,b,\pi)( italic_h , italic_s , italic_a , italic_b , italic_π ), we have

Q¯hπ(s,a,b)Qhπ,f(π)(s,a,b) and V¯hπ(s)Vhπ,f(π)(s).superscriptsubscript¯𝑄𝜋𝑠𝑎𝑏superscriptsubscript𝑄𝜋𝑓𝜋𝑠𝑎𝑏 and superscriptsubscript¯𝑉𝜋𝑠superscriptsubscript𝑉𝜋𝑓𝜋𝑠\displaystyle\bar{Q}_{h}^{\pi}(s,a,b)\geq Q_{h}^{\pi,f(\pi)}(s,a,b)\text{ and % }\bar{V}_{h}^{\pi}(s)\geq V_{h}^{\pi,f(\pi)}(s).over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) ≥ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) and over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) ≥ italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ( italic_s ) .

We will prove by induction with h[H+1]delimited-[]𝐻1h\in[H+1]italic_h ∈ [ italic_H + 1 ]. For h=H+1𝐻1h=H+1italic_h = italic_H + 1, the claim in the lemma trivially holds. Assume by induction that the claim holds for some h+11h+1italic_h + 1. We will prove that it holds for hhitalic_h. Indeed, for any (s,a,b)𝑠𝑎𝑏(s,a,b)( italic_s , italic_a , italic_b ) such that Q¯hπ(s,a,b)=Hh+1superscriptsubscript¯𝑄𝜋𝑠𝑎𝑏𝐻1\bar{Q}_{h}^{\pi}(s,a,b)=H-h+1over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) = italic_H - italic_h + 1, of course Q¯hπ(s,a,b)Qhπ,f(π)(s,a,b)superscriptsubscript¯𝑄𝜋𝑠𝑎𝑏superscriptsubscript𝑄𝜋𝑓𝜋𝑠𝑎𝑏\bar{Q}_{h}^{\pi}(s,a,b)\geq Q_{h}^{\pi,f(\pi)}(s,a,b)over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) ≥ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ). Consider any (s,a,b)𝑠𝑎𝑏(s,a,b)( italic_s , italic_a , italic_b ) such that Q¯hπ(s,a,b)<Hh+1superscriptsubscript¯𝑄𝜋𝑠𝑎𝑏𝐻1\bar{Q}_{h}^{\pi}(s,a,b)<H-h+1over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) < italic_H - italic_h + 1, we have

Q¯hπ(s,a,b)Qhπ,f(π)(s,a,b)superscriptsubscript¯𝑄𝜋𝑠𝑎𝑏superscriptsubscript𝑄𝜋𝑓𝜋𝑠𝑎𝑏\displaystyle\bar{Q}_{h}^{\pi}(s,a,b)-Q_{h}^{\pi,f(\pi)}(s,a,b)over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) - italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) =[P^hV¯h+1π](s,a,b)+rh(s,a,b)+β(Nh(s,a,b))absentdelimited-[]subscript^𝑃superscriptsubscript¯𝑉1𝜋𝑠𝑎𝑏subscript𝑟𝑠𝑎𝑏𝛽subscript𝑁𝑠𝑎𝑏\displaystyle=[\hat{P}_{h}\bar{V}_{h+1}^{\pi}](s,a,b)+r_{h}(s,a,b)+\beta(N_{h}% (s,a,b))= [ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ] ( italic_s , italic_a , italic_b ) + italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) + italic_β ( italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) )
([PhVh+1π,f(π)](s,a,b)+rh(s,a,b))delimited-[]subscript𝑃superscriptsubscript𝑉1𝜋𝑓𝜋𝑠𝑎𝑏subscript𝑟𝑠𝑎𝑏\displaystyle-([P_{h}V_{h+1}^{\pi,f(\pi)}](s,a,b)+r_{h}(s,a,b))- ( [ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ] ( italic_s , italic_a , italic_b ) + italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) )
[(P^hPh)Vh+1π,f(π)](s,a,b)+β(Nh(s,a,b))absentdelimited-[]subscript^𝑃subscript𝑃superscriptsubscript𝑉1𝜋𝑓𝜋𝑠𝑎𝑏𝛽subscript𝑁𝑠𝑎𝑏\displaystyle\geq[(\hat{P}_{h}-P_{h})V_{h+1}^{\pi,f(\pi)}](s,a,b)+\beta(N_{h}(% s,a,b))≥ [ ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ] ( italic_s , italic_a , italic_b ) + italic_β ( italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) )
0,absent0\displaystyle\geq 0,≥ 0 ,

where the first inequality uses the induction assumption that V¯h+1πVh+1π,f(π)superscriptsubscript¯𝑉1𝜋subscriptsuperscript𝑉𝜋𝑓𝜋1\bar{V}_{h+1}^{\pi}\geq V^{\pi,f(\pi)}_{h+1}over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ≥ italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT and the last inequality uses Hoeffding’s inequality and the union bound. In addition, it follows from Lemma B.3 and the union bound that, with probability at least 1δ1𝛿1-\delta1 - italic_δ, for any (t,h,s,π)𝑡𝑠𝜋(t,h,s,\pi)( italic_t , italic_h , italic_s , italic_π ), we have

f(π)h(|s)PΘhsπh(s)t\displaystyle f(\pi)_{h}(\cdot|s)\in P_{\Theta^{t}_{hs\pi_{h}(s)}}italic_f ( italic_π ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∈ italic_P start_POSTSUBSCRIPT roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT

Under the same event wherein the above relation holds, we have

V¯hπ(s)superscriptsubscript¯𝑉𝜋𝑠\displaystyle\bar{V}_{h}^{\pi}(s)over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) =maxθΘhsπh(s)Q¯hπ(s,πh(s),Pθ)absentsubscript𝜃subscriptΘ𝑠subscript𝜋𝑠subscriptsuperscript¯𝑄𝜋𝑠subscript𝜋𝑠subscript𝑃𝜃\displaystyle=\max_{\theta\in\Theta_{hs\pi_{h}(s)}}\bar{Q}^{\pi}_{h}(s,\pi_{h}% (s),P_{\theta})= roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUBSCRIPT italic_h italic_s italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT )
Q¯hπ(s,πh(s),f(π)h)absentsubscriptsuperscript¯𝑄𝜋𝑠subscript𝜋𝑠𝑓subscript𝜋\displaystyle\geq\bar{Q}^{\pi}_{h}(s,\pi_{h}(s),f(\pi)_{h})≥ over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_f ( italic_π ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT )
Qhπ,f(π)(s,πh(s),f(π)h)absentsuperscriptsubscript𝑄𝜋𝑓𝜋𝑠subscript𝜋𝑠𝑓subscript𝜋\displaystyle\geq Q_{h}^{\pi,f(\pi)}(s,\pi_{h}(s),f(\pi)_{h})≥ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_f ( italic_π ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT )
=Vhπ,f(π)(s).absentsuperscriptsubscript𝑉𝜋𝑓𝜋𝑠\displaystyle=V_{h}^{\pi,f(\pi)}(s).= italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ( italic_s ) .

This completes the case for h+11h+1italic_h + 1 and thus completes the proof. ∎

Proof of Theorem 4.

By the optimism of V¯¯𝑉\bar{V}over¯ start_ARG italic_V end_ARG (Lemma B.4), with probability at least 1δ1𝛿1-\delta1 - italic_δ, for all (t,π)𝑡𝜋(t,\pi)( italic_t , italic_π ), we have

V1π,f(π)(s1t)V1πt,f(πt)(s1t)superscriptsubscript𝑉1𝜋𝑓𝜋superscriptsubscript𝑠1𝑡superscriptsubscript𝑉1superscript𝜋𝑡𝑓superscript𝜋𝑡superscriptsubscript𝑠1𝑡\displaystyle V_{1}^{\pi,f(\pi)}(s_{1}^{t})-V_{1}^{\pi^{t},f(\pi^{t})}(s_{1}^{% t})italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) V¯1π(s1t)V1πt,f(πt)(s1t)V¯1πt(s1t)V1πt,f(πt)(s1t)=Δ1tabsentsuperscriptsubscript¯𝑉1𝜋superscriptsubscript𝑠1𝑡superscriptsubscript𝑉1superscript𝜋𝑡𝑓superscript𝜋𝑡superscriptsubscript𝑠1𝑡superscriptsubscript¯𝑉1superscript𝜋𝑡superscriptsubscript𝑠1𝑡superscriptsubscript𝑉1superscript𝜋𝑡𝑓superscript𝜋𝑡superscriptsubscript𝑠1𝑡superscriptsubscriptΔ1𝑡\displaystyle\leq\bar{V}_{1}^{\pi}(s_{1}^{t})-V_{1}^{\pi^{t},f(\pi^{t})}(s_{1}% ^{t})\leq\bar{V}_{1}^{\pi^{t}}(s_{1}^{t})-V_{1}^{\pi^{t},f(\pi^{t})}(s_{1}^{t}% )=\Delta_{1}^{t}≤ over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ≤ over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT

where the second inequality follows from 4 of Algorithm 1, and the last equation is a result of what we now define:

Δht:=V¯hπt(sht)Vhπt,f(πt)(sht),(t,h).assignsuperscriptsubscriptΔ𝑡superscriptsubscript¯𝑉superscript𝜋𝑡subscriptsuperscript𝑠𝑡superscriptsubscript𝑉superscript𝜋𝑡𝑓superscript𝜋𝑡superscriptsubscript𝑠𝑡for-all𝑡\displaystyle\Delta_{h}^{t}:=\bar{V}_{h}^{\pi^{t}}(s^{t}_{h})-V_{h}^{\pi^{t},f% (\pi^{t})}(s_{h}^{t}),\forall(t,h).roman_Δ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , ∀ ( italic_t , italic_h ) .

We now decompose ΔhtsuperscriptsubscriptΔ𝑡\Delta_{h}^{t}roman_Δ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as follows:

ΔhtsuperscriptsubscriptΔ𝑡\displaystyle\Delta_{h}^{t}roman_Δ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =maxθΘhshtahttQ¯hπt(sht,aht,Pθ)Qhπt,f(πt)(sht,aht,f(πt)h)absentsubscript𝜃subscriptsuperscriptΘ𝑡superscriptsubscript𝑠𝑡subscriptsuperscript𝑎𝑡subscriptsuperscript¯𝑄superscript𝜋𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscript𝑃𝜃superscriptsubscript𝑄superscript𝜋𝑡𝑓superscript𝜋𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡𝑓subscriptsuperscript𝜋𝑡\displaystyle=\max_{\theta\in\Theta^{t}_{hs_{h}^{t}a^{t}_{h}}}\bar{Q}^{\pi^{t}% }_{h}(s^{t}_{h},a^{t}_{h},P_{\theta})-Q_{h}^{\pi^{t},f(\pi^{t})}(s^{t}_{h},a^{% t}_{h},f(\pi^{t})_{h})= roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT )
=Q¯hπt(sht,aht,bht)Qhπt,f(πt)(sht,aht,bht)=:ξhtabsentsubscriptsubscriptsuperscript¯𝑄superscript𝜋𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscriptsuperscript𝑏𝑡subscriptsuperscript𝑄superscript𝜋𝑡𝑓superscript𝜋𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscriptsuperscript𝑏𝑡:absentsubscriptsuperscript𝜉𝑡\displaystyle=\underbrace{\bar{Q}^{\pi^{t}}_{h}(s^{t}_{h},a^{t}_{h},b^{t}_{h})% -Q^{\pi^{t},f(\pi^{t})}_{h}(s^{t}_{h},a^{t}_{h},b^{t}_{h})}_{=:\xi^{t}_{h}}= under⏟ start_ARG over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT = : italic_ξ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+Q¯hπt(sht,aht,f(πt)h)Q¯hπt(sht,aht,bht)+Qhπt,f(πt)(sht,aht,bht)Qhπt,f(πt)(sht,aht,f(πt)h)=:ζhtsubscriptsubscriptsuperscript¯𝑄superscript𝜋𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡𝑓subscriptsuperscript𝜋𝑡subscriptsuperscript¯𝑄superscript𝜋𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscriptsuperscript𝑏𝑡subscriptsuperscript𝑄superscript𝜋𝑡𝑓superscript𝜋𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscriptsuperscript𝑏𝑡superscriptsubscript𝑄superscript𝜋𝑡𝑓superscript𝜋𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡𝑓subscriptsuperscript𝜋𝑡:absentsubscriptsuperscript𝜁𝑡\displaystyle+\underbrace{\bar{Q}^{\pi^{t}}_{h}(s^{t}_{h},a^{t}_{h},f(\pi^{t})% _{h})-\bar{Q}^{\pi^{t}}_{h}(s^{t}_{h},a^{t}_{h},b^{t}_{h})+Q^{\pi^{t},f(\pi^{t% })}_{h}(s^{t}_{h},a^{t}_{h},b^{t}_{h})-Q_{h}^{\pi^{t},f(\pi^{t})}(s^{t}_{h},a^% {t}_{h},f(\pi^{t})_{h})}_{=:\zeta^{t}_{h}}+ under⏟ start_ARG over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT = : italic_ζ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+maxθΘhshtahttQ¯hπt(sht,aht,Pθ)Q¯hπt(sht,aht,f(πt)h)=:γht.subscriptsubscript𝜃subscriptsuperscriptΘ𝑡superscriptsubscript𝑠𝑡subscriptsuperscript𝑎𝑡subscriptsuperscript¯𝑄superscript𝜋𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscript𝑃𝜃subscriptsuperscript¯𝑄superscript𝜋𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡𝑓subscriptsuperscript𝜋𝑡:absentsubscriptsuperscript𝛾𝑡\displaystyle+\underbrace{\max_{\theta\in\Theta^{t}_{hs_{h}^{t}a^{t}_{h}}}\bar% {Q}^{\pi^{t}}_{h}(s^{t}_{h},a^{t}_{h},P_{\theta})-\bar{Q}^{\pi^{t}}_{h}(s^{t}_% {h},a^{t}_{h},f(\pi^{t})_{h})}_{=:\gamma^{t}_{h}}.+ under⏟ start_ARG roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) - over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT = : italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

We will bound each of ξht,ζht,γhtsubscriptsuperscript𝜉𝑡subscriptsuperscript𝜁𝑡subscriptsuperscript𝛾𝑡\xi^{t}_{h},\zeta^{t}_{h},\gamma^{t}_{h}italic_ξ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ζ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT separately as follows.

Bounding {ξht}subscriptsuperscript𝜉𝑡\{\xi^{t}_{h}\}{ italic_ξ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT }.

For simplicity, we denote xht=(sht,aht,bht)subscriptsuperscript𝑥𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscriptsuperscript𝑏𝑡x^{t}_{h}=(s^{t}_{h},a^{t}_{h},b^{t}_{h})italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ). We define

Vh+1(s)=supπΠVh+1π,f(π)(s),s.subscriptsuperscript𝑉1𝑠subscriptsupremum𝜋Πsuperscriptsubscript𝑉1𝜋𝑓𝜋𝑠for-all𝑠\displaystyle V^{*}_{h+1}(s)=\sup_{\pi\in\Pi}V_{h+1}^{\pi,f(\pi)}(s),\forall s.italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s ) = roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( italic_π ) end_POSTSUPERSCRIPT ( italic_s ) , ∀ italic_s .

Note that the optimality above does not require that there exists an optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that Vh(s)=Vhπ,f(π)(s),(h,s)subscriptsuperscript𝑉𝑠superscriptsubscript𝑉superscript𝜋𝑓superscript𝜋𝑠for-all𝑠V^{*}_{h}(s)=V_{h}^{\pi^{*},f(\pi^{*})}(s),\forall(h,s)italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_s ) , ∀ ( italic_h , italic_s ). Note that if Q¯hπt(xht)=Hh+1subscriptsuperscript¯𝑄superscript𝜋𝑡subscriptsuperscript𝑥𝑡𝐻1\bar{Q}^{\pi^{t}}_{h}(x^{t}_{h})=H-h+1over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) = italic_H - italic_h + 1, it is trivial that ζht0subscriptsuperscript𝜁𝑡0\zeta^{t}_{h}\leq 0italic_ζ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≤ 0. Thus, we only need to consider when Q¯hπt(xht)<Hh+1subscriptsuperscript¯𝑄superscript𝜋𝑡subscriptsuperscript𝑥𝑡𝐻1\bar{Q}^{\pi^{t}}_{h}(x^{t}_{h})<H-h+1over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) < italic_H - italic_h + 1, and thus

ζhtsubscriptsuperscript𝜁𝑡\displaystyle\zeta^{t}_{h}italic_ζ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT =[P^htV¯h+1πt](xht)+β(Nht(xht))[PhVh+1πt,f(πt)](xht)absentdelimited-[]subscriptsuperscript^𝑃𝑡superscriptsubscript¯𝑉1superscript𝜋𝑡subscriptsuperscript𝑥𝑡𝛽superscriptsubscript𝑁𝑡subscriptsuperscript𝑥𝑡delimited-[]subscript𝑃superscriptsubscript𝑉1superscript𝜋𝑡𝑓superscript𝜋𝑡subscriptsuperscript𝑥𝑡\displaystyle=[\hat{P}^{t}_{h}\bar{V}_{h+1}^{\pi^{t}}](x^{t}_{h})+\beta(N_{h}^% {t}(x^{t}_{h}))-[P_{h}V_{h+1}^{\pi^{t},f(\pi^{t})}](x^{t}_{h})= [ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + italic_β ( italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) - [ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ] ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT )
=[(P^htPh)Vh+1](xht)+[(P^htPh)(V¯h+1πtVh+1)](xht)+[Ph(V¯h+1πtVh+1πt,f(πt))](xht)+β(Nht(xht))absentdelimited-[]subscriptsuperscript^𝑃𝑡subscript𝑃subscriptsuperscript𝑉1subscriptsuperscript𝑥𝑡delimited-[]subscriptsuperscript^𝑃𝑡subscript𝑃superscriptsubscript¯𝑉1superscript𝜋𝑡subscriptsuperscript𝑉1subscriptsuperscript𝑥𝑡delimited-[]subscript𝑃superscriptsubscript¯𝑉1superscript𝜋𝑡superscriptsubscript𝑉1superscript𝜋𝑡𝑓superscript𝜋𝑡subscriptsuperscript𝑥𝑡𝛽superscriptsubscript𝑁𝑡subscriptsuperscript𝑥𝑡\displaystyle=[(\hat{P}^{t}_{h}-P_{h})V^{*}_{h+1}](x^{t}_{h})+[(\hat{P}^{t}_{h% }-P_{h})(\bar{V}_{h+1}^{\pi^{t}}-V^{*}_{h+1})](x^{t}_{h})+[P_{h}(\bar{V}_{h+1}% ^{\pi^{t}}-V_{h+1}^{\pi^{t},f(\pi^{t})})](x^{t}_{h})+\beta(N_{h}^{t}(x^{t}_{h}))= [ ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + [ ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ( over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ] ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + [ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) ] ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + italic_β ( italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) )
[(P^htPh)(V¯h+1πtVh+1)](xht)+[Ph(V¯h+1πtVh+1πt,f(πt))](xht)+2β(Nht(xht)).absentdelimited-[]subscriptsuperscript^𝑃𝑡subscript𝑃superscriptsubscript¯𝑉1superscript𝜋𝑡subscriptsuperscript𝑉1subscriptsuperscript𝑥𝑡delimited-[]subscript𝑃superscriptsubscript¯𝑉1superscript𝜋𝑡superscriptsubscript𝑉1superscript𝜋𝑡𝑓superscript𝜋𝑡subscriptsuperscript𝑥𝑡2𝛽superscriptsubscript𝑁𝑡subscriptsuperscript𝑥𝑡\displaystyle\leq[(\hat{P}^{t}_{h}-P_{h})(\bar{V}_{h+1}^{\pi^{t}}-V^{*}_{h+1})% ](x^{t}_{h})+[P_{h}(\bar{V}_{h+1}^{\pi^{t}}-V_{h+1}^{\pi^{t},f(\pi^{t})})](x^{% t}_{h})+2\beta(N_{h}^{t}(x^{t}_{h})).≤ [ ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ( over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ] ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + [ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) ] ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + 2 italic_β ( italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) .

By Bernstein’s inequality, with probability at least 1δ1𝛿1-\delta1 - italic_δ, for all (s,a,b,s,h,t)𝑠𝑎𝑏superscript𝑠𝑡(s,a,b,s^{\prime},h,t)( italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_h , italic_t ) and with ι:=log(2S2ABHT/δ)assign𝜄2superscript𝑆2𝐴𝐵𝐻𝑇𝛿\iota:=\log(2S^{2}ABHT/\delta)italic_ι := roman_log ( 2 italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_B italic_H italic_T / italic_δ ), we have

P^ht(s|s,a,b)Ph(s|s,a,b)subscriptsuperscript^𝑃𝑡conditionalsuperscript𝑠𝑠𝑎𝑏subscript𝑃conditionalsuperscript𝑠𝑠𝑎𝑏\displaystyle\hat{P}^{t}_{h}(s^{\prime}|s,a,b)-P_{h}(s^{\prime}|s,a,b)over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) ιNht(s,a,b)+2Ph(s|s,a,b)ιNht(s,a,b)absent𝜄superscriptsubscript𝑁𝑡𝑠𝑎𝑏2subscript𝑃conditionalsuperscript𝑠𝑠𝑎𝑏𝜄superscriptsubscript𝑁𝑡𝑠𝑎𝑏\displaystyle\leq\frac{\iota}{N_{h}^{t}(s,a,b)}+\sqrt{\frac{2P_{h}(s^{\prime}|% s,a,b)\iota}{N_{h}^{t}(s,a,b)}}≤ divide start_ARG italic_ι end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) end_ARG + square-root start_ARG divide start_ARG 2 italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) italic_ι end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) end_ARG end_ARG
1HPh(s|s,a,b)+Hι2Nht(s,a,b)+ιNht(s,a,b)absent1𝐻subscript𝑃conditionalsuperscript𝑠𝑠𝑎𝑏𝐻𝜄2subscriptsuperscript𝑁𝑡𝑠𝑎𝑏𝜄superscriptsubscript𝑁𝑡𝑠𝑎𝑏\displaystyle\leq\frac{1}{H}P_{h}(s^{\prime}|s,a,b)+\frac{H\iota}{2N^{t}_{h}(s% ,a,b)}+\frac{\iota}{N_{h}^{t}(s,a,b)}≤ divide start_ARG 1 end_ARG start_ARG italic_H end_ARG italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) + divide start_ARG italic_H italic_ι end_ARG start_ARG 2 italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) end_ARG + divide start_ARG italic_ι end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) end_ARG
=1HPh(s|s,a,b)+(1+H2)ιNht(s,a,b),absent1𝐻subscript𝑃conditionalsuperscript𝑠𝑠𝑎𝑏1𝐻2𝜄superscriptsubscript𝑁𝑡𝑠𝑎𝑏\displaystyle=\frac{1}{H}P_{h}(s^{\prime}|s,a,b)+(1+\frac{H}{2})\frac{\iota}{N% _{h}^{t}(s,a,b)},= divide start_ARG 1 end_ARG start_ARG italic_H end_ARG italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) + ( 1 + divide start_ARG italic_H end_ARG start_ARG 2 end_ARG ) divide start_ARG italic_ι end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) end_ARG ,

where note that the first inequality holds even when Nht(s,a,b)=0subscriptsuperscript𝑁𝑡𝑠𝑎𝑏0N^{t}_{h}(s,a,b)=0italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) = 0 and the second inequality follows form AM-GM. Thus, with probability 1δ1𝛿1-\delta1 - italic_δ, for all (t,h)𝑡(t,h)( italic_t , italic_h ), we have

[(P^htPh)(V¯h+1πtVh+1)](xht)SH(1+H/2)ιNht(xht)+1H[Ph(V¯h+1πtVh+1)](xht).delimited-[]subscriptsuperscript^𝑃𝑡subscript𝑃superscriptsubscript¯𝑉1superscript𝜋𝑡subscriptsuperscript𝑉1subscriptsuperscript𝑥𝑡𝑆𝐻1𝐻2𝜄superscriptsubscript𝑁𝑡subscriptsuperscript𝑥𝑡1𝐻delimited-[]subscript𝑃superscriptsubscript¯𝑉1superscript𝜋𝑡subscriptsuperscript𝑉1subscriptsuperscript𝑥𝑡\displaystyle[(\hat{P}^{t}_{h}-P_{h})(\bar{V}_{h+1}^{\pi^{t}}-V^{*}_{h+1})](x^% {t}_{h})\leq\frac{SH(1+H/2)\iota}{N_{h}^{t}(x^{t}_{h})}+\frac{1}{H}[P_{h}(\bar% {V}_{h+1}^{\pi^{t}}-V^{*}_{h+1})](x^{t}_{h}).[ ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ( over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ] ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_S italic_H ( 1 + italic_H / 2 ) italic_ι end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG + divide start_ARG 1 end_ARG start_ARG italic_H end_ARG [ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ] ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) .

Plugging this inequality into ζhtsubscriptsuperscript𝜁𝑡\zeta^{t}_{h}italic_ζ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT above, then with probability at least 1δ1𝛿1-\delta1 - italic_δ, for all (t,h)𝑡(t,h)( italic_t , italic_h ),

ζhtsubscriptsuperscript𝜁𝑡\displaystyle\zeta^{t}_{h}italic_ζ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT SH(1+H/2)ιNht(xht)+(1+1H)((V¯h+1πtVh+1πt,f(πt))(sh+1t)+ϵh+1t)+2β(Nht(xht))absent𝑆𝐻1𝐻2𝜄superscriptsubscript𝑁𝑡subscriptsuperscript𝑥𝑡11𝐻superscriptsubscript¯𝑉1superscript𝜋𝑡superscriptsubscript𝑉1superscript𝜋𝑡𝑓superscript𝜋𝑡subscriptsuperscript𝑠𝑡1subscriptsuperscriptitalic-ϵ𝑡12𝛽subscriptsuperscript𝑁𝑡subscriptsuperscript𝑥𝑡\displaystyle\leq\frac{SH(1+H/2)\iota}{N_{h}^{t}(x^{t}_{h})}+(1+\frac{1}{H})% \left((\bar{V}_{h+1}^{\pi^{t}}-V_{h+1}^{\pi^{t},f(\pi^{t})})(s^{t}_{h+1})+{% \epsilon}^{t}_{h+1}\right)+2\beta(N^{t}_{h}(x^{t}_{h}))≤ divide start_ARG italic_S italic_H ( 1 + italic_H / 2 ) italic_ι end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG + ( 1 + divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ) ( ( over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) + 2 italic_β ( italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) )
3SH2ι2Nht(xht)+(1+1H)(Δh+1t+ϵh+1t)+2β(Nht(xht)),absent3𝑆superscript𝐻2𝜄2subscriptsuperscript𝑁𝑡subscriptsuperscript𝑥𝑡11𝐻subscriptsuperscriptΔ𝑡1subscriptsuperscriptitalic-ϵ𝑡12𝛽subscriptsuperscript𝑁𝑡subscriptsuperscript𝑥𝑡\displaystyle\leq\frac{3SH^{2}\iota}{2N^{t}_{h}(x^{t}_{h})}+(1+\frac{1}{H})% \left(\Delta^{t}_{h+1}+{\epsilon}^{t}_{h+1}\right)+2\beta(N^{t}_{h}(x^{t}_{h})),≤ divide start_ARG 3 italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ι end_ARG start_ARG 2 italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG + ( 1 + divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ) ( roman_Δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT + italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) + 2 italic_β ( italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) ,

where we define

ϵh+1t:=[Ph(V¯h+1πtVh+1πt,f(πt))](xht)(V¯h+1πtVh+1πt,f(πt))(sh+1t).assignsubscriptsuperscriptitalic-ϵ𝑡1delimited-[]subscript𝑃superscriptsubscript¯𝑉1superscript𝜋𝑡superscriptsubscript𝑉1superscript𝜋𝑡𝑓superscript𝜋𝑡subscriptsuperscript𝑥𝑡superscriptsubscript¯𝑉1superscript𝜋𝑡superscriptsubscript𝑉1superscript𝜋𝑡𝑓superscript𝜋𝑡subscriptsuperscript𝑠𝑡1\displaystyle{\epsilon}^{t}_{h+1}:=[P_{h}(\bar{V}_{h+1}^{\pi^{t}}-V_{h+1}^{\pi% ^{t},f(\pi^{t})})](x^{t}_{h})-(\bar{V}_{h+1}^{\pi^{t}}-V_{h+1}^{\pi^{t},f(\pi^% {t})})(s^{t}_{h+1}).italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT := [ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) ] ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - ( over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) .
Bounding tζhtsubscript𝑡subscriptsuperscript𝜁𝑡\sum_{t}\zeta^{t}_{h}∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ζ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and tϵhtsubscript𝑡subscriptsuperscriptitalic-ϵ𝑡\sum_{t}{\epsilon}^{t}_{h}∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

Note that for all hhitalic_h, {ζht}t[T]subscriptsubscriptsuperscript𝜁𝑡𝑡delimited-[]𝑇\{\zeta^{t}_{h}\}_{t\in[T]}{ italic_ζ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT and {ϵht}t[T]subscriptsubscriptsuperscriptitalic-ϵ𝑡𝑡delimited-[]𝑇\{{\epsilon}^{t}_{h}\}_{t\in[T]}{ italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT are martingale difference sequences. Thus, by Azuma-Hoeffding’s inequality and the union bound, with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

t,hζhtH2Tlog(H/δ), and t,hϵhtH2Tlog(H/δ)formulae-sequenceless-than-or-similar-tosubscript𝑡subscriptsuperscript𝜁𝑡superscript𝐻2𝑇𝐻𝛿less-than-or-similar-to and subscript𝑡subscriptsuperscriptitalic-ϵ𝑡superscript𝐻2𝑇𝐻𝛿\displaystyle\sum_{t,h}\zeta^{t}_{h}\lesssim H^{2}\sqrt{T\log(H/\delta)},\text% { and }\sum_{t,h}{\epsilon}^{t}_{h}\lesssim H^{2}\sqrt{T\log(H/\delta)}∑ start_POSTSUBSCRIPT italic_t , italic_h end_POSTSUBSCRIPT italic_ζ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≲ italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_T roman_log ( italic_H / italic_δ ) end_ARG , and ∑ start_POSTSUBSCRIPT italic_t , italic_h end_POSTSUBSCRIPT italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≲ italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_T roman_log ( italic_H / italic_δ ) end_ARG
Bounding {γht}subscriptsuperscript𝛾𝑡\{\gamma^{t}_{h}\}{ italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT }.

By Lemma B.3, and the union bound, with probability at least 1δ1𝛿1-\delta1 - italic_δ, for all (t,h)𝑡(t,h)( italic_t , italic_h ), we have

γhtsubscriptsuperscript𝛾𝑡\displaystyle\gamma^{t}_{h}italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT =maxθΘhshtahttQ¯hπt(sht,aht,Pθ)Q¯hπt(sht,aht,f(πt)h)absentsubscript𝜃subscriptsuperscriptΘ𝑡superscriptsubscript𝑠𝑡subscriptsuperscript𝑎𝑡subscriptsuperscript¯𝑄superscript𝜋𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡subscript𝑃𝜃subscriptsuperscript¯𝑄superscript𝜋𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡𝑓subscriptsuperscript𝜋𝑡\displaystyle=\max_{\theta\in\Theta^{t}_{hs_{h}^{t}a^{t}_{h}}}\bar{Q}^{\pi^{t}% }_{h}(s^{t}_{h},a^{t}_{h},P_{\theta})-\bar{Q}^{\pi^{t}}_{h}(s^{t}_{h},a^{t}_{h% },f(\pi^{t})_{h})= roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) - over¯ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT )
2HmaxθΘhshtahttdTV(Pθ,f(πt)h(|sht))\displaystyle\leq 2H\max_{\theta\in\Theta^{t}_{hs^{t}_{h}a^{t}_{h}}}d_{TV}(P_{% \theta},f(\pi^{t})_{h}(\cdot|s^{t}_{h}))≤ 2 italic_H roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_f ( italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) )
HαNht(sht,aht).less-than-or-similar-toabsent𝐻𝛼subscriptsuperscript𝑁𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡\displaystyle\lesssim H\sqrt{\frac{\alpha}{N^{t}_{h}(s^{t}_{h},a^{t}_{h})}}.≲ italic_H square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG end_ARG .

Plugging these bounds into the definition of ΔhtsubscriptsuperscriptΔ𝑡\Delta^{t}_{h}roman_Δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, combining them using the union bound and re-scaling δ𝛿\deltaitalic_δ, we have that: with probability at least 1δ1𝛿1-\delta1 - italic_δ, for all (t,h,π)𝑡𝜋(t,h,\pi)( italic_t , italic_h , italic_π ), we have

ΔhtsubscriptsuperscriptΔ𝑡\displaystyle\Delta^{t}_{h}roman_Δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT =ξht+ζht+γhtabsentsubscriptsuperscript𝜉𝑡subscriptsuperscript𝜁𝑡subscriptsuperscript𝛾𝑡\displaystyle=\xi^{t}_{h}+\zeta^{t}_{h}+\gamma^{t}_{h}= italic_ξ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_ζ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT
3SH2ι2Nht(xht)+(1+1H)(Δh+1t+ϵh+1t)+2β(Nht(xht))+ζht+HαNht(sht,aht).less-than-or-similar-toabsent3𝑆superscript𝐻2𝜄2subscriptsuperscript𝑁𝑡subscriptsuperscript𝑥𝑡11𝐻subscriptsuperscriptΔ𝑡1subscriptsuperscriptitalic-ϵ𝑡12𝛽subscriptsuperscript𝑁𝑡subscriptsuperscript𝑥𝑡subscriptsuperscript𝜁𝑡𝐻𝛼subscriptsuperscript𝑁𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡\displaystyle\lesssim\frac{3SH^{2}\iota}{2N^{t}_{h}(x^{t}_{h})}+(1+\frac{1}{H}% )\left(\Delta^{t}_{h+1}+{\epsilon}^{t}_{h+1}\right)+2\beta(N^{t}_{h}(x^{t}_{h}% ))+\zeta^{t}_{h}+H\sqrt{\frac{\alpha}{N^{t}_{h}(s^{t}_{h},a^{t}_{h})}}.≲ divide start_ARG 3 italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ι end_ARG start_ARG 2 italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG + ( 1 + divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ) ( roman_Δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT + italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) + 2 italic_β ( italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) + italic_ζ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_H square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG end_ARG .

Thus, we have with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

t=1TΔ1tsuperscriptsubscript𝑡1𝑇subscriptsuperscriptΔ𝑡1\displaystyle\sum_{t=1}^{T}\Delta^{t}_{1}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT t=1T(1+1H)Hh=1H(3SH2ι2Nht(xht)+ϵh+1t+ζht+2β(Nht(xht))+HαNht(sht,aht))less-than-or-similar-toabsentsuperscriptsubscript𝑡1𝑇superscript11𝐻𝐻superscriptsubscript1𝐻3𝑆superscript𝐻2𝜄2subscriptsuperscript𝑁𝑡subscriptsuperscript𝑥𝑡subscriptsuperscriptitalic-ϵ𝑡1subscriptsuperscript𝜁𝑡2𝛽subscriptsuperscript𝑁𝑡subscriptsuperscript𝑥𝑡𝐻𝛼subscriptsuperscript𝑁𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡\displaystyle\lesssim\sum_{t=1}^{T}(1+\frac{1}{H})^{H}\sum_{h=1}^{H}\left(% \frac{3SH^{2}\iota}{2N^{t}_{h}(x^{t}_{h})}+{\epsilon}^{t}_{h+1}+\zeta^{t}_{h}+% 2\beta(N^{t}_{h}(x^{t}_{h}))+H\sqrt{\frac{\alpha}{N^{t}_{h}(s^{t}_{h},a^{t}_{h% })}}\right)≲ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( 1 + divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ( divide start_ARG 3 italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ι end_ARG start_ARG 2 italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG + italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT + italic_ζ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + 2 italic_β ( italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) + italic_H square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG end_ARG )
SH2ιt,h1Nht(xht)+H2Tlog(H/δ)+Hlog(HSABT|Π|/δ)t,h1Nht(xht)less-than-or-similar-toabsent𝑆superscript𝐻2𝜄subscript𝑡1subscriptsuperscript𝑁𝑡subscriptsuperscript𝑥𝑡superscript𝐻2𝑇𝐻𝛿𝐻𝐻𝑆𝐴𝐵𝑇Π𝛿subscript𝑡1subscriptsuperscript𝑁𝑡subscriptsuperscript𝑥𝑡\displaystyle\lesssim SH^{2}\iota\sum_{t,h}\frac{1}{N^{t}_{h}(x^{t}_{h})}+H^{2% }\sqrt{T\log(H/\delta)}+H\sqrt{\log(HSABT|\Pi|/\delta)}\sum_{t,h}\frac{1}{% \sqrt{N^{t}_{h}(x^{t}_{h})}}≲ italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ι ∑ start_POSTSUBSCRIPT italic_t , italic_h end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_T roman_log ( italic_H / italic_δ ) end_ARG + italic_H square-root start_ARG roman_log ( italic_H italic_S italic_A italic_B italic_T | roman_Π | / italic_δ ) end_ARG ∑ start_POSTSUBSCRIPT italic_t , italic_h end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG end_ARG
+Hαt,h1Nht(sht,aht).𝐻𝛼subscript𝑡1subscriptsuperscript𝑁𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡\displaystyle+H\sqrt{\alpha}\sum_{t,h}\frac{1}{\sqrt{N^{t}_{h}(s^{t}_{h},a^{t}% _{h})}}.+ italic_H square-root start_ARG italic_α end_ARG ∑ start_POSTSUBSCRIPT italic_t , italic_h end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG end_ARG .

Finally, note that

t,h1Nht(xht)=h(s,a,b)i=1NhT(s,a,b)1ih(s,a,b):NhT(s,a,b)1logNhT(s,a,b)HSABlogT.subscript𝑡1subscriptsuperscript𝑁𝑡subscriptsuperscript𝑥𝑡subscriptsubscript𝑠𝑎𝑏superscriptsubscript𝑖1subscriptsuperscript𝑁𝑇𝑠𝑎𝑏1𝑖subscriptsubscript:𝑠𝑎𝑏subscriptsuperscript𝑁𝑇𝑠𝑎𝑏1subscriptsuperscript𝑁𝑇𝑠𝑎𝑏𝐻𝑆𝐴𝐵𝑇\displaystyle\sum_{t,h}\frac{1}{N^{t}_{h}(x^{t}_{h})}=\sum_{h}\sum_{(s,a,b)}% \sum_{i=1}^{N^{T}_{h}(s,a,b)}\frac{1}{i}\leq\sum_{h}\sum_{(s,a,b):N^{T}_{h}(s,% a,b)\geq 1}\log N^{T}_{h}(s,a,b)\leq HSAB\log T.∑ start_POSTSUBSCRIPT italic_t , italic_h end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG = ∑ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_i end_ARG ≤ ∑ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) : italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) ≥ 1 end_POSTSUBSCRIPT roman_log italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) ≤ italic_H italic_S italic_A italic_B roman_log italic_T .
t,h1Nht(xht)subscript𝑡1subscriptsuperscript𝑁𝑡subscriptsuperscript𝑥𝑡\displaystyle\sum_{t,h}\frac{1}{\sqrt{N^{t}_{h}(x^{t}_{h})}}∑ start_POSTSUBSCRIPT italic_t , italic_h end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG end_ARG =h(s,a,b)i=1NhT(s,a,b)1ih(s,a,b)NhT(s,a,b)HSAB(h,s,a,b)NhT(s,a,b)absentsubscriptsubscript𝑠𝑎𝑏superscriptsubscript𝑖1subscriptsuperscript𝑁𝑇𝑠𝑎𝑏1𝑖subscriptsubscript𝑠𝑎𝑏subscriptsuperscript𝑁𝑇𝑠𝑎𝑏𝐻𝑆𝐴𝐵subscript𝑠𝑎𝑏subscriptsuperscript𝑁𝑇𝑠𝑎𝑏\displaystyle=\sum_{h}\sum_{(s,a,b)}\sum_{i=1}^{N^{T}_{h}(s,a,b)}\frac{1}{% \sqrt{i}}\leq\sum_{h}\sum_{(s,a,b)}\sqrt{N^{T}_{h}(s,a,b)}\leq\sqrt{HSAB}\sqrt% {\sum_{(h,s,a,b)}N^{T}_{h}(s,a,b)}= ∑ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_i end_ARG end_ARG ≤ ∑ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) end_POSTSUBSCRIPT square-root start_ARG italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) end_ARG ≤ square-root start_ARG italic_H italic_S italic_A italic_B end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT ( italic_h , italic_s , italic_a , italic_b ) end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) end_ARG
=HSABT.absent𝐻𝑆𝐴𝐵𝑇\displaystyle=H\sqrt{SABT}.= italic_H square-root start_ARG italic_S italic_A italic_B italic_T end_ARG .
t,h1Nht(sht,aht)subscript𝑡1subscriptsuperscript𝑁𝑡subscriptsuperscript𝑠𝑡subscriptsuperscript𝑎𝑡\displaystyle\sum_{t,h}\frac{1}{\sqrt{N^{t}_{h}(s^{t}_{h},a^{t}_{h})}}∑ start_POSTSUBSCRIPT italic_t , italic_h end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_ARG end_ARG =(h,s,a)i=1NhT(s,a)1ih,s,aNhT(s,a)HSAh,s,aNhT(s,a)=HSAT.absentsubscript𝑠𝑎superscriptsubscript𝑖1subscriptsuperscript𝑁𝑇𝑠𝑎1𝑖subscript𝑠𝑎subscriptsuperscript𝑁𝑇𝑠𝑎𝐻𝑆𝐴subscript𝑠𝑎subscriptsuperscript𝑁𝑇𝑠𝑎𝐻𝑆𝐴𝑇\displaystyle=\sum_{(h,s,a)}\sum_{i=1}^{N^{T}_{h}(s,a)}\frac{1}{\sqrt{i}}\leq% \sum_{h,s,a}\sqrt{N^{T}_{h}(s,a)}\leq\sqrt{HSA}\sqrt{\sum_{h,s,a}N^{T}_{h}(s,a% )}=H\sqrt{SAT}.= ∑ start_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_i end_ARG end_ARG ≤ ∑ start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT square-root start_ARG italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG ≤ square-root start_ARG italic_H italic_S italic_A end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG = italic_H square-root start_ARG italic_S italic_A italic_T end_ARG .

Plugging these three inequalities above into the bound for t=1TΔ1tsuperscriptsubscript𝑡1𝑇subscriptsuperscriptΔ𝑡1\sum_{t=1}^{T}\Delta^{t}_{1}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT right before and re-scaling δ𝛿\deltaitalic_δ complete the proof. ∎

B.3 Proof of Theorem 5

The layerwise exploration stage (Algorithm 4) performs layerwise exploration for each layer h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] and estimates infrequent transitions into 𝒰𝒰{\mathcal{U}}caligraphic_U. Since infrequent transitions do not significantly affect policy evaluation in any way (will be proved precisely later), we can exclude them and quickly refrain from exploring them extensively. However, excluding them changes the underlying data distribution of the experiences that the earner would receive when interacting with the environment. To handle this bias issue, it is often convenient to consider an “absorbing” Markov game Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, a refinement of the original Markov game M𝑀Mitalic_M that excludes all infrequent transitions.

Definition 5 (Absorbing Markov games).

Given a Markov game M=(𝒮,𝒜,,r,P,H)𝑀𝒮𝒜𝑟𝑃𝐻M=({\mathcal{S}},{\mathcal{A}},{\mathcal{B}},r,P,H)italic_M = ( caligraphic_S , caligraphic_A , caligraphic_B , italic_r , italic_P , italic_H ), a set of transitions 𝒰𝒰{\mathcal{U}}caligraphic_U, and a dummy state ssuperscript𝑠s^{\dagger}italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT, an absorbing Markov game M=(𝒮{s},𝒜,,r,P~,H)superscript𝑀𝒮superscript𝑠𝒜𝑟~𝑃𝐻M^{\prime}=({\mathcal{S}}\cup\{s^{\dagger}\},{\mathcal{A}},{\mathcal{B}},r,% \tilde{P},H)italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( caligraphic_S ∪ { italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT } , caligraphic_A , caligraphic_B , italic_r , over~ start_ARG italic_P end_ARG , italic_H ) w.r.t. (M,𝒰,s)𝑀𝒰superscript𝑠(M,{\mathcal{U}},s^{\dagger})( italic_M , caligraphic_U , italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) is defined as follows: For any (h,s,a,b,s)[H]×𝒮×𝒜××𝒮𝑠𝑎𝑏superscript𝑠delimited-[]𝐻𝒮𝒜𝒮(h,s,a,b,s^{\prime})\in[H]\times{\mathcal{S}}\times{\mathcal{A}}\times{% \mathcal{B}}\times{\mathcal{S}}( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ [ italic_H ] × caligraphic_S × caligraphic_A × caligraphic_B × caligraphic_S,

P~h(s|s,a,b)={Ph(s|s,a,b) if (h,s,a,b)𝒰0 if (h,s,a,b)𝒰,subscript~𝑃conditionalsuperscript𝑠𝑠𝑎𝑏casessubscript𝑃conditionalsuperscript𝑠𝑠𝑎𝑏 if 𝑠𝑎𝑏𝒰0 if 𝑠𝑎𝑏𝒰\displaystyle\tilde{P}_{h}(s^{\prime}|s,a,b)=\begin{cases}P_{h}(s^{\prime}|s,a% ,b)&\text{ if }(h,s,a,b)\notin{\mathcal{U}}\\ 0&\text{ if }(h,s,a,b)\in{\mathcal{U}},\end{cases}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) = { start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) end_CELL start_CELL if ( italic_h , italic_s , italic_a , italic_b ) ∉ caligraphic_U end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if ( italic_h , italic_s , italic_a , italic_b ) ∈ caligraphic_U , end_CELL end_ROW

P~h(s|s,a,b)=1s𝒮P~h(s|s,a,b)subscript~𝑃conditionalsuperscript𝑠𝑠𝑎𝑏1subscriptsuperscript𝑠𝒮subscript~𝑃conditionalsuperscript𝑠𝑠𝑎𝑏\tilde{P}_{h}(s^{\dagger}|s,a,b)=1-\sum_{s^{\prime}\in{\mathcal{S}}}\tilde{P}_% {h}(s^{\prime}|s,a,b)over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) = 1 - ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) and P~h(s|s,a,b)=1subscript~𝑃conditionalsuperscript𝑠superscript𝑠𝑎𝑏1\tilde{P}_{h}(s^{\dagger}|s^{\dagger},a,b)=1over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT , italic_a , italic_b ) = 1. In addition, rh(s,a,b)={rh(s,a,b) if s𝒮,0 if s=s,subscript𝑟𝑠𝑎𝑏casessubscript𝑟𝑠𝑎𝑏 if 𝑠𝒮otherwise0 if 𝑠superscript𝑠otherwiser_{h}(s,a,b)=\begin{cases}r_{h}(s,a,b)\text{ if }s\in{\mathcal{S}},\\ 0\text{ if }s=s^{\dagger},\end{cases}italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) = { start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) if italic_s ∈ caligraphic_S , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 if italic_s = italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT , end_CELL start_CELL end_CELL end_ROW, πh(|s)={πh(|s) if s𝒮,arbitrary if s=s,\pi_{h}(\cdot|s)=\begin{cases}\pi_{h}(\cdot|s)\text{ if }s\in{\mathcal{S}},\\ \text{arbitrary}\text{ if }s=s^{\dagger},\end{cases}italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) = { start_ROW start_CELL italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) if italic_s ∈ caligraphic_S , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_arbitrary italic_if italic_s = italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT , end_CELL start_CELL end_CELL end_ROW, μh(|s)={μh(|s) if s𝒮,arbitrary if s=s.\mu_{h}(\cdot|s)=\begin{cases}\mu_{h}(\cdot|s)\text{ if }s\in{\mathcal{S}},\\ \text{arbitrary}\text{ if }s=s^{\dagger}.\end{cases}italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) = { start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) if italic_s ∈ caligraphic_S , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_arbitrary italic_if italic_s = italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT . end_CELL start_CELL end_CELL end_ROW

Let P~ksuperscript~𝑃𝑘\tilde{P}^{k}over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT be the absorbing transition kernels w.r.t. (M,𝒰k,s)𝑀superscript𝒰𝑘superscript𝑠(M,{\mathcal{U}}^{k},s^{\dagger})( italic_M , caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) (5). Notice that the transition dynamics P^ksuperscript^𝑃𝑘\hat{P}^{k}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT by Algorithm 4 are unbiased estimates of the absorbing transition dynamics P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG.

B.3.1 Sampling policies are sufficiently exploratory

We now show that the sampling policies in the reward-free exploration stage are sufficiently exploratory over the state-action space of the Markov game. We start with bounding the difference between P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG and P^ksuperscript^𝑃𝑘\hat{P}^{k}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT (5 of Algorithm 3) using empirical Bernstein’s inequality.

Lemma B.5.

Define the event E𝐸Eitalic_E: (k,h,s,a,b,s)[K]×[H]×𝒮×𝒜××𝒮for-all𝑘𝑠𝑎𝑏superscript𝑠delimited-[]𝐾delimited-[]𝐻𝒮𝒜𝒮\forall(k,h,s,a,b,s^{\prime})\in[K]\times[H]\times{\mathcal{S}}\times{\mathcal% {A}}\times{\mathcal{B}}\times{\mathcal{S}}∀ ( italic_k , italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ [ italic_K ] × [ italic_H ] × caligraphic_S × caligraphic_A × caligraphic_B × caligraphic_S such that (h,s,a,b,s)𝒰𝑠𝑎𝑏superscript𝑠𝒰(h,s,a,b,s^{\prime})\notin{\mathcal{U}}( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∉ caligraphic_U,

|P^hk(s|s,a,b)P~hk(s|s,a,b)|2P^hk(s|s,a,b)ιNhk(s,a,b)+7ι3Nhk(s,a,b)\displaystyle|\hat{P}^{k}_{h}(s^{\prime}|s,a,b)-\tilde{P}^{k}_{h}(s^{\prime}|s% ,a,b)|\leq\sqrt{\frac{2\hat{P}^{k}_{h}(s^{\prime}|s,a,b)\iota}{N^{k}_{h}(s,a,b% )}}+\frac{7\iota}{3N^{k}_{h}(s,a,b)}| over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) - over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) | ≤ square-root start_ARG divide start_ARG 2 over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) italic_ι end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) end_ARG end_ARG + divide start_ARG 7 italic_ι end_ARG start_ARG 3 italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) end_ARG

where ι:=clog(SABHK/δ)assign𝜄𝑐𝑆𝐴𝐵𝐻𝐾𝛿\iota:=c\log(SABHK/\delta)italic_ι := italic_c roman_log ( italic_S italic_A italic_B italic_H italic_K / italic_δ ) and Nhksubscriptsuperscript𝑁𝑘N^{k}_{h}italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are the counter at layer hhitalic_h in epoch k𝑘kitalic_k obtained at 9 by running Algorithm 4 in epoch k𝑘kitalic_k. Then, we have Pr(E)1δPr𝐸1𝛿\Pr(E)\geq 1-\deltaroman_Pr ( italic_E ) ≥ 1 - italic_δ. In addition, (h,s,a,b,s)𝒰for-all𝑠𝑎𝑏superscript𝑠𝒰\forall(h,s,a,b,s^{\prime})\in{\mathcal{U}}∀ ( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_U, P^hk(s|s,a,b)=P~h(s|s,a,b)=0subscriptsuperscript^𝑃𝑘conditionalsuperscript𝑠𝑠𝑎𝑏subscript~𝑃conditionalsuperscript𝑠𝑠𝑎𝑏0\hat{P}^{k}_{h}(s^{\prime}|s,a,b)=\tilde{P}_{h}(s^{\prime}|s,a,b)=0over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) = over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) = 0.

Proof of Lemma B.5.

Lemma B.5 is essentially the analogous of [Qiao et al., 2022, Lemma E.2] from MDPs to Markov games. The first part follows from empirical Bernstein’s inequality and union bound. The second part comes from the definition of the absorbing transition kernels P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG and the construction of the empirical transition kernels P^ksuperscript^𝑃𝑘\hat{P}^{k}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. ∎

Lemma B.6.

Conditioned on the event E𝐸Eitalic_E in Lemma B.5: For all (k,h,s,a,b,s)[K]×[H]×𝒮×𝒜××𝒮𝑘𝑠𝑎𝑏superscript𝑠delimited-[]𝐾delimited-[]𝐻𝒮𝒜𝒮(k,h,s,a,b,s^{\prime})\in[K]\times[H]\times{\mathcal{S}}\times{\mathcal{A}}% \times{\mathcal{B}}\times{\mathcal{S}}( italic_k , italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ [ italic_K ] × [ italic_H ] × caligraphic_S × caligraphic_A × caligraphic_B × caligraphic_S such that (h,s,a,b,s)𝒰𝑠𝑎𝑏superscript𝑠𝒰(h,s,a,b,s^{\prime})\notin{\mathcal{U}}( italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∉ caligraphic_U, we have

(11H)P^hk(s|s,a,b)P~hk(s|s,a,b)(1+1H)P^hk(s|s,a,b).11𝐻subscriptsuperscript^𝑃𝑘conditionalsuperscript𝑠𝑠𝑎𝑏subscriptsuperscript~𝑃𝑘conditionalsuperscript𝑠𝑠𝑎𝑏11𝐻subscriptsuperscript^𝑃𝑘conditionalsuperscript𝑠𝑠𝑎𝑏\displaystyle(1-\frac{1}{H})\hat{P}^{k}_{h}(s^{\prime}|s,a,b)\leq\tilde{P}^{k}% _{h}(s^{\prime}|s,a,b)\leq(1+\frac{1}{H})\hat{P}^{k}_{h}(s^{\prime}|s,a,b).( 1 - divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ) over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) ≤ over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) ≤ ( 1 + divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ) over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b ) .
Proof of Lemma B.6.

Lemma B.6 is essentially the same as [Qiao et al., 2022, Lemma E.3]. ∎

Lemma B.7.

Conditioned on the event E𝐸Eitalic_E in Lemma B.5: For all (k,h,s,a,b,s)[K]×[H]×𝒮×𝒜××𝒮𝑘𝑠𝑎𝑏superscript𝑠delimited-[]𝐾delimited-[]𝐻𝒮𝒜𝒮(k,h,s,a,b,s^{\prime})\in[K]\times[H]\times{\mathcal{S}}\times{\mathcal{A}}% \times{\mathcal{B}}\times{\mathcal{S}}( italic_k , italic_h , italic_s , italic_a , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ [ italic_K ] × [ italic_H ] × caligraphic_S × caligraphic_A × caligraphic_B × caligraphic_S and any policy π𝜋\piitalic_π, we have

14Vπ,f([π]m)(1hsab,P^k)Vπ,f([π]m)(1hsab,P~k)3Vπ,f([π]m)(1hsab,P^k),14superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript1𝑠𝑎𝑏superscript^𝑃𝑘superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript1𝑠𝑎𝑏superscript~𝑃𝑘3superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript1𝑠𝑎𝑏superscript^𝑃𝑘\displaystyle\frac{1}{4}V^{\pi,f([\pi]^{m})}(1_{hsab},\hat{P}^{k})\leq V^{\pi,% f([\pi]^{m})}(1_{hsab},\tilde{P}^{k})\leq 3V^{\pi,f([\pi]^{m})}(1_{hsab},\hat{% P}^{k}),divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ≤ italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ≤ 3 italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ,

where Vπ,μ(r,P)superscript𝑉𝜋𝜇𝑟𝑃V^{\pi,\mu}(r,P)italic_V start_POSTSUPERSCRIPT italic_π , italic_μ end_POSTSUPERSCRIPT ( italic_r , italic_P ) denotes the expected total reward under policies (π,μ)𝜋𝜇(\pi,\mu)( italic_π , italic_μ ) and the Markov game specified by the reward function r𝑟ritalic_r and transition kernels P𝑃Pitalic_P.

Proof of Lemma B.7.

The proof essentially follows from the proof of [Qiao et al., 2022, Lemma E.5]. ∎

Lemma B.5 to Lemma B.7 are similar in nature with corresponding lemmas in a single-agent MDP in [Qiao et al., 2022]. We now prove a novel lemma that’s absent in the single-agent MDP setting yet crucial to our theorem. Recall our notion that, V¯π(r,P,Θ):=OPTIMISTIC_VALUE_ESTIMATE(π,r,P,Θ)assignsuperscript¯𝑉𝜋𝑟𝑃ΘOPTIMISTIC_VALUE_ESTIMATE𝜋𝑟𝑃Θ\bar{V}^{\pi}(r,P,\Theta):={\textrm{OPTIMISTIC\_VALUE\_ESTIMATE}}(\pi,r,P,\Theta)over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_r , italic_P , roman_Θ ) := OPTIMISTIC_VALUE_ESTIMATE ( italic_π , italic_r , italic_P , roman_Θ ) which is given in Algorithm 6.

Lemma B.8.

Fix any k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] and consider P^k,Θk,𝒰k=LAYERWISE_EXPLORATION(Πk,Tk)superscript^𝑃𝑘superscriptΘ𝑘superscript𝒰𝑘LAYERWISE_EXPLORATIONsuperscriptΠ𝑘subscript𝑇𝑘\hat{P}^{k},\Theta^{k},{\mathcal{U}}^{k}={\textrm{LAYERWISE\_EXPLORATION}}(\Pi% ^{k},T_{k})over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = LAYERWISE_EXPLORATION ( roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (5 of Algorithm 3). Define the event Eksubscript𝐸𝑘E_{k}italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT: for all (h,s,a,b)[H]×𝒮×𝒜×𝑠𝑎𝑏delimited-[]𝐻𝒮𝒜(h,s,a,b)\in[H]\times{\mathcal{S}}\times{\mathcal{A}}\times{\mathcal{B}}( italic_h , italic_s , italic_a , italic_b ) ∈ [ italic_H ] × caligraphic_S × caligraphic_A × caligraphic_B and all πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π, we have

0V¯π(1hsab,P^k,Θk)Vπ,f([π]m)(1hsab,P^k)ξMLE(Tk),0superscript¯𝑉𝜋subscript1𝑠𝑎𝑏superscript^𝑃𝑘superscriptΘ𝑘superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript1𝑠𝑎𝑏superscript^𝑃𝑘subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle 0\leq\bar{V}^{\pi}(1_{hsab},\hat{P}^{k},\Theta^{k})-V^{\pi,f([% \pi]^{m})}(1_{hsab},\hat{P}^{k})\leq\xi_{MLE}(T_{k}),0 ≤ over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ≤ italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,

where ξMLE(Tk):=cHαdTkassignsubscript𝜉𝑀𝐿𝐸subscript𝑇𝑘𝑐𝐻𝛼superscript𝑑subscript𝑇𝑘\xi_{MLE}(T_{k}):=cH\sqrt{\frac{\alpha}{d^{*}T_{k}}}italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) := italic_c italic_H square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG. Assume that T𝑇Titalic_T is sufficiently large such that Tk2log(SHKA/δ)d2,k[K]formulae-sequencesubscript𝑇𝑘2𝑆𝐻𝐾𝐴𝛿superscriptsuperscript𝑑2for-all𝑘delimited-[]𝐾T_{k}\geq\frac{2\log(SHKA/\delta)}{{d^{*}}^{2}},\forall k\in[K]italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ divide start_ARG 2 roman_log ( italic_S italic_H italic_K italic_A / italic_δ ) end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , ∀ italic_k ∈ [ italic_K ]. Then, Pr(Ek)1δPrsubscript𝐸𝑘1𝛿\Pr(E_{k})\geq 1-\deltaroman_Pr ( italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≥ 1 - italic_δ.

Proof of Lemma B.8.

Let us fix any (h,s,a,b)𝑠𝑎𝑏(h,s,a,b)( italic_h , italic_s , italic_a , italic_b ) and π𝜋\piitalic_π. Note that the value function for any policy under any dynamic w.r.t. the reward function 1hsabsubscript1𝑠𝑎𝑏1_{hsab}1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT is zero at any step h>hsuperscripth^{\prime}>hitalic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_h. Also, notice that, prior to the exploration of layer hhitalic_h in the reward-free exploration (Algorithm 4), P^1k,,P^h1ksubscriptsuperscript^𝑃𝑘1subscriptsuperscript^𝑃𝑘1\hat{P}^{k}_{1},\ldots,\hat{P}^{k}_{h-1}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h - 1 end_POSTSUBSCRIPT are already constructed.

Additional notations.

In OPTIMISTIC_VALUE_ESTIMATE(1hsab,P^k,π,Θ)OPTIMISTIC_VALUE_ESTIMATEsubscript1𝑠𝑎𝑏superscript^𝑃𝑘𝜋Θ{\textrm{OPTIMISTIC\_VALUE\_ESTIMATE}}(1_{hsab},\hat{P}^{k},\pi,\Theta)OPTIMISTIC_VALUE_ESTIMATE ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_π , roman_Θ ) (Algorithm 6), we denote the intermediate value estimates V¯lπsubscriptsuperscript¯𝑉𝜋𝑙\bar{V}^{\pi}_{l}over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT by V¯lπ(;1hsab,P^k,Θk)superscriptsubscript¯𝑉𝑙𝜋subscript1𝑠𝑎𝑏superscript^𝑃𝑘superscriptΘ𝑘\bar{V}_{l}^{\pi}(\cdot;1_{hsab},\hat{P}^{k},\Theta^{k})over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( ⋅ ; 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) to emphasize the dependence on the reward function and the transition dynamics being used. We denote Nhk(s,a)subscriptsuperscript𝑁𝑘𝑠𝑎N^{k}_{h}(s,a)italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) the count of pairs (h,s,a)𝑠𝑎(h,s,a)( italic_h , italic_s , italic_a ) during the hhitalic_h-th layer exploration of Algorithm 4. We write Vhπ,μ(s;r,P)subscriptsuperscript𝑉𝜋𝜇𝑠𝑟𝑃V^{\pi,\mu}_{h}(s;r,P)italic_V start_POSTSUPERSCRIPT italic_π , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_r , italic_P ) in place of Vhπ,μ(s)subscriptsuperscript𝑉𝜋𝜇𝑠V^{\pi,\mu}_{h}(s)italic_V start_POSTSUPERSCRIPT italic_π , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) to emphasize the dependence on the reward function r𝑟ritalic_r and transition dynamic P𝑃Pitalic_P.

We will evaluate the quantity Δl(s¯):=Vlπ,f([π]m)(s¯;1hsab,P^k)V¯lπ(s¯;1hsab,P^k,Θk)assignsubscriptΔ𝑙¯𝑠superscriptsubscript𝑉𝑙𝜋𝑓superscriptdelimited-[]𝜋𝑚¯𝑠subscript1𝑠𝑎𝑏superscript^𝑃𝑘superscriptsubscript¯𝑉𝑙𝜋¯𝑠subscript1𝑠𝑎𝑏superscript^𝑃𝑘superscriptΘ𝑘\Delta_{l}(\bar{s}):=V_{l}^{\pi,f([\pi]^{m})}(\bar{s};1_{hsab},\hat{P}^{k})-% \bar{V}_{l}^{\pi}(\bar{s};1_{hsab},\hat{P}^{k},\Theta^{k})roman_Δ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) := italic_V start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( over¯ start_ARG italic_s end_ARG ; 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( over¯ start_ARG italic_s end_ARG ; 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) for any l[h1],s¯𝒮formulae-sequence𝑙delimited-[]1¯𝑠𝒮l\in[h-1],\bar{s}\in{\mathcal{S}}italic_l ∈ [ italic_h - 1 ] , over¯ start_ARG italic_s end_ARG ∈ caligraphic_S.

The first part V¯π(1hsab,P^k,Θk)Vπ,f([π]m)(1hsab,P^k)0superscript¯𝑉𝜋subscript1𝑠𝑎𝑏superscript^𝑃𝑘superscriptΘ𝑘superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript1𝑠𝑎𝑏superscript^𝑃𝑘0\bar{V}^{\pi}(1_{hsab},\hat{P}^{k},\Theta^{k})-V^{\pi,f([\pi]^{m})}(1_{hsab},% \hat{P}^{k})\geq 0over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ≥ 0 follows from that with probability at least 1δ1𝛿1-\delta1 - italic_δ, θhsaΘhsak,(h,s,a)subscriptsuperscript𝜃𝑠𝑎subscriptsuperscriptΘ𝑘𝑠𝑎for-all𝑠𝑎\theta^{*}_{hsa}\in\Theta^{k}_{hsa},\forall(h,s,a)italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s italic_a end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s italic_a end_POSTSUBSCRIPT , ∀ ( italic_h , italic_s , italic_a ). Thus, V¯π(1hsab,P^k,Θk)superscript¯𝑉𝜋subscript1𝑠𝑎𝑏superscript^𝑃𝑘superscriptΘ𝑘\bar{V}^{\pi}(1_{hsab},\hat{P}^{k},\Theta^{k})over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) is an optimistic estimate of Vπ,f([π]m)(1hsab,P^k)superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript1𝑠𝑎𝑏superscript^𝑃𝑘V^{\pi,f([\pi]^{m})}(1_{hsab},\hat{P}^{k})italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ). For the second part, we have

Δl(s¯)subscriptΔ𝑙¯𝑠\displaystyle\Delta_{l}(\bar{s})roman_Δ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) =supθΘl,s¯,πl(s¯)ks𝒮Plk(s|s¯,πl(s¯),Pθ)V¯l+1π(s;1hsab,P^k,Θk)absentsubscriptsupremum𝜃subscriptsuperscriptΘ𝑘𝑙¯𝑠subscript𝜋𝑙¯𝑠subscriptsuperscript𝑠𝒮subscriptsuperscript𝑃𝑘𝑙conditionalsuperscript𝑠¯𝑠subscript𝜋𝑙¯𝑠subscript𝑃𝜃subscriptsuperscript¯𝑉𝜋𝑙1superscript𝑠subscript1𝑠𝑎𝑏superscript^𝑃𝑘superscriptΘ𝑘\displaystyle=\sup_{\theta\in\Theta^{k}_{l,\bar{s},\pi_{l}(\bar{s})}}\sum_{s^{% \prime}\in{\mathcal{S}}}P^{k}_{l}(s^{\prime}|\bar{s},\pi_{l}(\bar{s}),P_{% \theta})\bar{V}^{\pi}_{l+1}(s^{\prime};1_{hsab},\hat{P}^{k},\Theta^{k})= roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
s𝒮Plk(s|s¯,πl(s¯),f([π]m)l(|s¯))Vl+1π,f([π]m)(s;1hasb,P^k)\displaystyle-\sum_{s^{\prime}\in{\mathcal{S}}}P^{k}_{l}(s^{\prime}|\bar{s},% \pi_{l}(\bar{s}),f([\pi]^{m})_{l}(\cdot|\bar{s}))V^{\pi,f([\pi]^{m})}_{l+1}(s^% {\prime};1_{hasb},\hat{P}^{k})- ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ⋅ | over¯ start_ARG italic_s end_ARG ) ) italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; 1 start_POSTSUBSCRIPT italic_h italic_a italic_s italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
=s𝒮Plk(s|s¯,πl(s¯),f([π]m)l(|s¯))Δl+1(s)\displaystyle=\sum_{s^{\prime}\in{\mathcal{S}}}P^{k}_{l}(s^{\prime}|\bar{s},% \pi_{l}(\bar{s}),f([\pi]^{m})_{l}(\cdot|\bar{s}))\Delta_{l+1}(s^{\prime})= ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ⋅ | over¯ start_ARG italic_s end_ARG ) ) roman_Δ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
+s𝒮(Plk(s|s¯,πl(s¯),Pθ)Plk(s|s¯,πl(s¯),f([π]m)l(|s¯)))V¯l+1π(s;1hsab,P^k,Θk)\displaystyle+\sum_{s^{\prime}\in{\mathcal{S}}}\left(P^{k}_{l}(s^{\prime}|\bar% {s},\pi_{l}(\bar{s}),P_{\theta})-P^{k}_{l}(s^{\prime}|\bar{s},\pi_{l}(\bar{s})% ,f([\pi]^{m})_{l}(\cdot|\bar{s}))\right)\bar{V}^{\pi}_{l+1}(s^{\prime};1_{hsab% },\hat{P}^{k},\Theta^{k})+ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) - italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ⋅ | over¯ start_ARG italic_s end_ARG ) ) ) over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
max{Δl+1(s):s𝒮 s.t. b,(l,s¯,πl(s¯),b,s)𝒰k}absent:subscriptΔ𝑙1superscript𝑠superscript𝑠𝒮 s.t. superscript𝑏𝑙¯𝑠subscript𝜋𝑙¯𝑠superscript𝑏superscript𝑠superscript𝒰𝑘\displaystyle\leq\max\{\Delta_{l+1}(s^{\prime}):s^{\prime}\in{\mathcal{S}}% \text{ s.t. }\exists b^{\prime}\in{\mathcal{B}},(l,\bar{s},\pi_{l}(\bar{s}),b^% {\prime},s^{\prime})\notin{\mathcal{U}}^{k}\}≤ roman_max { roman_Δ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) : italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S s.t. ∃ italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_B , ( italic_l , over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∉ caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT }
+1{Nlk(s¯,πl(s¯))1}2maxθΘls¯πl(s¯)kdTV(f([π]m)l(|s¯),Pθ),\displaystyle+1\{N^{k}_{l}(\bar{s},\pi_{l}(\bar{s}))\geq 1\}\cdot 2\max_{% \theta\in\Theta^{k}_{l\bar{s}\pi_{l}(\bar{s})}}d_{TV}(f([\pi]^{m})_{l}(\cdot|% \bar{s}),P_{\theta}),+ 1 { italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) ) ≥ 1 } ⋅ 2 roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l over¯ start_ARG italic_s end_ARG italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ⋅ | over¯ start_ARG italic_s end_ARG ) , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) ,

where we use the convention that max=00\max\emptyset=0roman_max ∅ = 0, and the last inequality follows from that Plk(s|s¯,πl(s¯),b)=0subscriptsuperscript𝑃𝑘𝑙conditionalsuperscript𝑠¯𝑠subscript𝜋𝑙¯𝑠superscript𝑏0P^{k}_{l}(s^{\prime}|\bar{s},\pi_{l}(\bar{s}),b^{\prime})=0italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0 if (l,s¯,πl(s¯),b,s)𝒰k𝑙¯𝑠subscript𝜋𝑙¯𝑠superscript𝑏superscript𝑠superscript𝒰𝑘(l,\bar{s},\pi_{l}(\bar{s}),b^{\prime},s^{\prime})\notin{\mathcal{U}}^{k}( italic_l , over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∉ caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, that V¯l+1π(s;1hsab,P^k,Θk)[0,1]subscriptsuperscript¯𝑉𝜋𝑙1superscript𝑠subscript1𝑠𝑎𝑏superscript^𝑃𝑘superscriptΘ𝑘01\bar{V}^{\pi}_{l+1}(s^{\prime};1_{hsab},\hat{P}^{k},\Theta^{k})\in[0,1]over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∈ [ 0 , 1 ], and that, for any two distributions p,q[0,1]|𝒳|𝑝𝑞superscript01𝒳p,q\in[0,1]^{|{\mathcal{X}}|}italic_p , italic_q ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT | caligraphic_X | end_POSTSUPERSCRIPT over a finite support 𝒳𝒳{\mathcal{X}}caligraphic_X, we have dTV(p,q)=12pq1subscript𝑑𝑇𝑉𝑝𝑞12subscriptnorm𝑝𝑞1d_{TV}(p,q)=\frac{1}{2}\|p-q\|_{1}italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_p , italic_q ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_p - italic_q ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

If Nlk(s¯,πl(s¯))=0subscriptsuperscript𝑁𝑘𝑙¯𝑠subscript𝜋𝑙¯𝑠0N^{k}_{l}(\bar{s},{\pi}_{l}(\bar{s}))=0italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) ) = 0, then Δl(s¯)=0subscriptΔ𝑙¯𝑠0\Delta_{l}(\bar{s})=0roman_Δ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) = 0. Consider the case Nlk(s¯,πl(s¯))1subscriptsuperscript𝑁𝑘𝑙¯𝑠subscript𝜋𝑙¯𝑠1N^{k}_{l}(\bar{s},{\pi}_{l}(\bar{s}))\geq 1italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) ) ≥ 1. That means that the state-action pair (s¯,πl(s¯))¯𝑠subscript𝜋𝑙¯𝑠(\bar{s},{\pi}_{l}(\bar{s}))( over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) ) must be visited in step l𝑙litalic_l at least once by at least one policy πkls~a~b~superscript𝜋𝑘𝑙~𝑠~𝑎~𝑏\pi^{kl\tilde{s}\tilde{a}\tilde{b}}italic_π start_POSTSUPERSCRIPT italic_k italic_l over~ start_ARG italic_s end_ARG over~ start_ARG italic_a end_ARG over~ start_ARG italic_b end_ARG end_POSTSUPERSCRIPT for some (s~,a~,b~)𝒮×𝒜×~𝑠~𝑎~𝑏𝒮𝒜(\tilde{s},\tilde{a},\tilde{b})\in{\mathcal{S}}\times{\mathcal{A}}\times{% \mathcal{B}}( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_a end_ARG , over~ start_ARG italic_b end_ARG ) ∈ caligraphic_S × caligraphic_A × caligraphic_B. Note that this policy πkls~a~b~superscript𝜋𝑘𝑙~𝑠~𝑎~𝑏\pi^{kl\tilde{s}\tilde{a}\tilde{b}}italic_π start_POSTSUPERSCRIPT italic_k italic_l over~ start_ARG italic_s end_ARG over~ start_ARG italic_a end_ARG over~ start_ARG italic_b end_ARG end_POSTSUPERSCRIPT is run for m1+Tk𝑚1subscript𝑇𝑘m-1+T_{k}italic_m - 1 + italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT consecutive episodes. Thus, by the definition of the minimum positive visitation probability dsuperscript𝑑d^{*}italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we must have

𝔼[Nlk(s¯,πl(s¯))]dTk,𝔼delimited-[]subscriptsuperscript𝑁𝑘𝑙¯𝑠subscript𝜋𝑙¯𝑠superscript𝑑subscript𝑇𝑘\displaystyle{\mathbb{E}}\left[N^{k}_{l}(\bar{s},{\pi}_{l}(\bar{s}))\right]% \geq d^{*}T_{k},blackboard_E [ italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) ) ] ≥ italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where the expectation is w.r.t. the transition kernel P𝑃Pitalic_P of the original Markov game M𝑀Mitalic_M and policy πkls~a~b~superscript𝜋𝑘𝑙~𝑠~𝑎~𝑏\pi^{kl\tilde{s}\tilde{a}\tilde{b}}italic_π start_POSTSUPERSCRIPT italic_k italic_l over~ start_ARG italic_s end_ARG over~ start_ARG italic_a end_ARG over~ start_ARG italic_b end_ARG end_POSTSUPERSCRIPT. By Hoelfding’s inequality and the union bound: With probability at least 1δ1𝛿1-\delta1 - italic_δ, for all l,s¯,k,π𝑙¯𝑠𝑘𝜋l,\bar{s},k,\piitalic_l , over¯ start_ARG italic_s end_ARG , italic_k , italic_π, we have

Nlk(s¯,πl(s¯))𝔼[Nlk(s¯,πl(s¯))]Tklog(SHKA/δ).subscriptsuperscript𝑁𝑘𝑙¯𝑠subscript𝜋𝑙¯𝑠𝔼delimited-[]subscriptsuperscript𝑁𝑘𝑙¯𝑠subscript𝜋𝑙¯𝑠subscript𝑇𝑘𝑆𝐻𝐾𝐴𝛿\displaystyle N^{k}_{l}(\bar{s},{\pi}_{l}(\bar{s}))\geq{\mathbb{E}}\left[N^{k}% _{l}(\bar{s},{\pi}_{l}(\bar{s}))\right]-\sqrt{T_{k}\log(SHKA/\delta)}.italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) ) ≥ blackboard_E [ italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) ) ] - square-root start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log ( italic_S italic_H italic_K italic_A / italic_δ ) end_ARG .

In particular, for (l,s¯,k,π)𝑙¯𝑠𝑘𝜋(l,\bar{s},k,\pi)( italic_l , over¯ start_ARG italic_s end_ARG , italic_k , italic_π ) such that 𝔼[Nlk(s¯,πl(s¯))]dTk𝔼delimited-[]subscriptsuperscript𝑁𝑘𝑙¯𝑠subscript𝜋𝑙¯𝑠superscript𝑑subscript𝑇𝑘{\mathbb{E}}\left[N^{k}_{l}(\bar{s},{\pi}_{l}(\bar{s}))\right]\geq d^{*}T_{k}blackboard_E [ italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) ) ] ≥ italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and for Tk2log(SHKA/δ)d2subscript𝑇𝑘2𝑆𝐻𝐾𝐴𝛿superscriptsuperscript𝑑2T_{k}\geq\frac{2\log(SHKA/\delta)}{{d^{*}}^{2}}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ divide start_ARG 2 roman_log ( italic_S italic_H italic_K italic_A / italic_δ ) end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, we have Nlk(s¯,πl(s¯))12dTksubscriptsuperscript𝑁𝑘𝑙¯𝑠subscript𝜋𝑙¯𝑠12superscript𝑑subscript𝑇𝑘N^{k}_{l}(\bar{s},{\pi}_{l}(\bar{s}))\geq\frac{1}{2}d^{*}T_{k}italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG , italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) ) ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with probability at least 1δ1𝛿1-\delta1 - italic_δ. Combined with Lemma B.3, with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

maxθΘls¯πl(s¯)kdTV(f([π]m)l(|s¯),Pθ)cαdTk.\displaystyle\max_{\theta\in\Theta^{k}_{l\bar{s}\pi_{l}(\bar{s})}}d_{TV}(f([% \pi]^{m})_{l}(\cdot|\bar{s}),P_{\theta})\leq c\sqrt{\frac{\alpha}{d^{*}T_{k}}}.roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l over¯ start_ARG italic_s end_ARG italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ⋅ | over¯ start_ARG italic_s end_ARG ) , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) ≤ italic_c square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG . (2)

Thus, under the same event that the above inequality holds, we have

Δ1(s1)cHαdTk.subscriptΔ1subscript𝑠1𝑐𝐻𝛼superscript𝑑subscript𝑇𝑘\displaystyle\Delta_{1}(s_{1})\leq cH\sqrt{\frac{\alpha}{d^{*}T_{k}}}.roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ italic_c italic_H square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG .

Next, we will show that the transition samples collected in 𝒰ksuperscript𝒰𝑘{\mathcal{U}}^{k}caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT are indeed infrequent transitions by any policy. Let τ=(s1,a1,b1,,sH,aH,bH)𝜏subscript𝑠1subscript𝑎1subscript𝑏1subscript𝑠𝐻subscript𝑎𝐻subscript𝑏𝐻\tau=(s_{1},a_{1},b_{1},\ldots,s_{H},a_{H},b_{H})italic_τ = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) be a random trajectory generated by the learner’s policy π𝜋\piitalic_π and the opponent’s policies f([π]m)𝑓superscriptdelimited-[]𝜋𝑚f([\pi]^{m})italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) for some policy π𝜋\piitalic_π.

Definition 6 (Bad events).

Under the original transition kernel P𝑃Pitalic_P, we define {\mathcal{F}}caligraphic_F to be the event that there exists h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] such that (h,sh,ah,bh,sh+1)𝒰ksubscript𝑠subscript𝑎subscript𝑏subscript𝑠1superscript𝒰𝑘(h,s_{h},a_{h},b_{h},s_{h+1})\in{\mathcal{U}}^{k}( italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ∈ caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and we define hsubscript{\mathcal{F}}_{h}caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to be the event such that hhitalic_h is the smallest step that (h,sh,ah,bh,sh+1)𝒰ksubscript𝑠subscript𝑎subscript𝑏subscript𝑠1superscript𝒰𝑘(h,s_{h},a_{h},b_{h},s_{h+1})\in{\mathcal{U}}^{k}( italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ∈ caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. Under the absorbing transition kernel P~ksuperscript~𝑃𝑘\tilde{P}^{k}over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, we define {\mathcal{F}}caligraphic_F to be the event that there exists h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] such that sh+1=ssubscript𝑠1superscript𝑠s_{h+1}=s^{\dagger}italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT and we define hsubscript{\mathcal{F}}_{h}caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to be the event such that hhitalic_h is the smallest step that sh+1=ssubscript𝑠1superscript𝑠s_{h+1}=s^{\dagger}italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT.

Lemma B.9.

Conditioned on the event E𝐸Eitalic_E in Lemma B.5 and the event Eksubscript𝐸𝑘E_{k}italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in Lemma B.8, with probability at least 1δ1𝛿1-\delta1 - italic_δ, for any k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], we have

supπΠPr[|P,π]H3log(HSABK/δ)Tk+HξMLE(Tk).less-than-or-similar-tosubscriptsupremum𝜋ΠPrconditional𝑃𝜋superscript𝐻3𝐻𝑆𝐴𝐵𝐾𝛿subscript𝑇𝑘𝐻subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle\sup_{\pi\in\Pi}\Pr[{\mathcal{F}}|P,\pi]\lesssim\frac{H^{3}\log(% HSABK/\delta)}{T_{k}}+H\xi_{MLE}(T_{k}).roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_Pr [ caligraphic_F | italic_P , italic_π ] ≲ divide start_ARG italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_log ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + italic_H italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) .

where ξMLE()subscript𝜉𝑀𝐿𝐸\xi_{MLE}(\cdot)italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( ⋅ ) is defined in Lemma B.8 and {\mathcal{F}}caligraphic_F is defined in 6.

Proof of Lemma B.9.

Under the event E𝐸Eitalic_E in Lemma B.5 and the event Eksubscript𝐸𝑘E_{k}italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in Lemma B.8, for any (h,s,a,b)𝑠𝑎𝑏(h,s,a,b)( italic_h , italic_s , italic_a , italic_b ), we have

Vπkhsab,f([πkhsab]m)(1hsab,P~k)superscript𝑉superscript𝜋𝑘𝑠𝑎𝑏𝑓superscriptdelimited-[]superscript𝜋𝑘𝑠𝑎𝑏𝑚subscript1𝑠𝑎𝑏superscript~𝑃𝑘\displaystyle V^{\pi^{khsab},f([\pi^{khsab}]^{m})}(1_{hsab},\tilde{P}^{k})italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT , italic_f ( [ italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) 14Vπkhsab,f([πkhsab]m)(1hsab,P^k)absent14superscript𝑉superscript𝜋𝑘𝑠𝑎𝑏𝑓superscriptdelimited-[]superscript𝜋𝑘𝑠𝑎𝑏𝑚subscript1𝑠𝑎𝑏superscript^𝑃𝑘\displaystyle\geq\frac{1}{4}V^{\pi^{khsab},f([\pi^{khsab}]^{m})}(1_{hsab},\hat% {P}^{k})≥ divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT , italic_f ( [ italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
14V¯πkhsab(1hsab,P^k,Θk)ξMLE(Tk)absent14superscript¯𝑉superscript𝜋𝑘𝑠𝑎𝑏subscript1𝑠𝑎𝑏superscript^𝑃𝑘superscriptΘ𝑘subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle\geq\frac{1}{4}\bar{V}^{\pi^{khsab}}(1_{hsab},\hat{P}^{k},\Theta^% {k})-\xi_{MLE}(T_{k})≥ divide start_ARG 1 end_ARG start_ARG 4 end_ARG over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
=14supπΠkV¯π(1hsab,P^k,Θk)ξMLE(Tk)absent14subscriptsupremum𝜋superscriptΠ𝑘superscript¯𝑉𝜋subscript1𝑠𝑎𝑏superscript^𝑃𝑘superscriptΘ𝑘subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle=\frac{1}{4}\sup_{\pi\in\Pi^{k}}\bar{V}^{\pi}(1_{hsab},\hat{P}^{k% },\Theta^{k})-\xi_{MLE}(T_{k})= divide start_ARG 1 end_ARG start_ARG 4 end_ARG roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
14supπΠkVπ,f([π]m)(1hsab,P^k)ξMLE(Tk)absent14subscriptsupremum𝜋superscriptΠ𝑘superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript1𝑠𝑎𝑏superscript^𝑃𝑘subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle\geq\frac{1}{4}\sup_{\pi\in\Pi^{k}}V^{\pi,f([\pi]^{m})}(1_{hsab},% \hat{P}^{k})-\xi_{MLE}(T_{k})≥ divide start_ARG 1 end_ARG start_ARG 4 end_ARG roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
112supπΠkVπ,f([π]m)(1hsab,P~k)ξMLE(Tk)absent112subscriptsupremum𝜋superscriptΠ𝑘superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript1𝑠𝑎𝑏superscript~𝑃𝑘subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle\geq\frac{1}{12}\sup_{\pi\in\Pi^{k}}V^{\pi,f([\pi]^{m})}(1_{hsab}% ,\tilde{P}^{k})-\xi_{MLE}(T_{k})≥ divide start_ARG 1 end_ARG start_ARG 12 end_ARG roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (3)

where the first inequality and the last inequality follow from Lemma B.7, the second inequality follows from Lemma B.8, the second and third inequality follow from Lemma B.8, and the equation follows from the definition of πkhsabsuperscript𝜋𝑘𝑠𝑎𝑏\pi^{khsab}italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT in Algorithm 4. Let πkhsuperscript𝜋𝑘\pi^{kh}italic_π start_POSTSUPERSCRIPT italic_k italic_h end_POSTSUPERSCRIPT be a policy that chooses each πkhsabsuperscript𝜋𝑘𝑠𝑎𝑏\pi^{khsab}italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT with probability 1SAB1𝑆𝐴𝐵\frac{1}{SAB}divide start_ARG 1 end_ARG start_ARG italic_S italic_A italic_B end_ARG for any (s,a,b)𝒮×𝒜×𝑠𝑎𝑏𝒮𝒜(s,a,b)\in{\mathcal{S}}\times{\mathcal{A}}\times{\mathcal{B}}( italic_s , italic_a , italic_b ) ∈ caligraphic_S × caligraphic_A × caligraphic_B. Thus, we have

Pr[h|P,πkh]Prconditionalsubscript𝑃superscript𝜋𝑘\displaystyle\Pr[{\mathcal{F}}_{h}|P,\pi^{kh}]roman_Pr [ caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | italic_P , italic_π start_POSTSUPERSCRIPT italic_k italic_h end_POSTSUPERSCRIPT ] =Pr[h|P~k,πkh]absentPrconditionalsubscriptsuperscript~𝑃𝑘superscript𝜋𝑘\displaystyle=\Pr[{\mathcal{F}}_{h}|\tilde{P}^{k},\pi^{kh}]= roman_Pr [ caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_k italic_h end_POSTSUPERSCRIPT ]
=1SABs¯,a¯,b¯s,a,bVπkhs¯a¯b¯,f([πkhs¯a¯b¯]m)(1hsab,P~k)P~h(s|s,a,b)absent1𝑆𝐴𝐵subscript¯𝑠¯𝑎¯𝑏subscript𝑠𝑎𝑏superscript𝑉superscript𝜋𝑘¯𝑠¯𝑎¯𝑏𝑓superscriptdelimited-[]superscript𝜋𝑘¯𝑠¯𝑎¯𝑏𝑚subscript1𝑠𝑎𝑏superscript~𝑃𝑘subscript~𝑃conditionalsuperscript𝑠𝑠𝑎𝑏\displaystyle=\frac{1}{SAB}\sum_{\bar{s},\bar{a},\bar{b}}\sum_{s,a,b}V^{\pi^{% kh\bar{s}\bar{a}\bar{b}},f([\pi^{kh\bar{s}\bar{a}\bar{b}}]^{m})}(1_{hsab},% \tilde{P}^{k})\tilde{P}_{h}(s^{\dagger}|s,a,b)= divide start_ARG 1 end_ARG start_ARG italic_S italic_A italic_B end_ARG ∑ start_POSTSUBSCRIPT over¯ start_ARG italic_s end_ARG , over¯ start_ARG italic_a end_ARG , over¯ start_ARG italic_b end_ARG end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s , italic_a , italic_b end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_k italic_h over¯ start_ARG italic_s end_ARG over¯ start_ARG italic_a end_ARG over¯ start_ARG italic_b end_ARG end_POSTSUPERSCRIPT , italic_f ( [ italic_π start_POSTSUPERSCRIPT italic_k italic_h over¯ start_ARG italic_s end_ARG over¯ start_ARG italic_a end_ARG over¯ start_ARG italic_b end_ARG end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b )
1SABs,a,bVπkhsab,f([πkhsab]m)(1hsab,P~k)P~h(s|s,a,b)absent1𝑆𝐴𝐵subscript𝑠𝑎𝑏superscript𝑉superscript𝜋𝑘𝑠𝑎𝑏𝑓superscriptdelimited-[]superscript𝜋𝑘𝑠𝑎𝑏𝑚subscript1𝑠𝑎𝑏superscript~𝑃𝑘subscript~𝑃conditionalsuperscript𝑠𝑠𝑎𝑏\displaystyle\geq\frac{1}{SAB}\sum_{s,a,b}V^{\pi^{khsab},f([\pi^{khsab}]^{m})}% (1_{hsab},\tilde{P}^{k})\tilde{P}_{h}(s^{\dagger}|s,a,b)≥ divide start_ARG 1 end_ARG start_ARG italic_S italic_A italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_s , italic_a , italic_b end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT , italic_f ( [ italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT | italic_s , italic_a , italic_b )
112SABs,a,bsupπΠkVπ,f([π]m)(1hsab,P~k)1SABξMLE(Tk)absent112𝑆𝐴𝐵subscript𝑠𝑎𝑏subscriptsupremum𝜋superscriptΠ𝑘superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript1𝑠𝑎𝑏superscript~𝑃𝑘1𝑆𝐴𝐵subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle\geq\frac{1}{12SAB}\sum_{s,a,b}\sup_{\pi\in\Pi^{k}}V^{\pi,f([\pi]% ^{m})}(1_{hsab},\tilde{P}^{k})-\frac{1}{SAB}\xi_{MLE}(T_{k})≥ divide start_ARG 1 end_ARG start_ARG 12 italic_S italic_A italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_s , italic_a , italic_b end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_S italic_A italic_B end_ARG italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
112SABsupπΠks,a,bVπ,f([π]m)(1hsab,P~k)1SABξMLE(Tk)absent112𝑆𝐴𝐵subscriptsupremum𝜋superscriptΠ𝑘subscript𝑠𝑎𝑏superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript1𝑠𝑎𝑏superscript~𝑃𝑘1𝑆𝐴𝐵subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle\geq\frac{1}{12SAB}\sup_{\pi\in\Pi^{k}}\sum_{s,a,b}V^{\pi,f([\pi]% ^{m})}(1_{hsab},\tilde{P}^{k})-\frac{1}{SAB}\xi_{MLE}(T_{k})≥ divide start_ARG 1 end_ARG start_ARG 12 italic_S italic_A italic_B end_ARG roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s , italic_a , italic_b end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_S italic_A italic_B end_ARG italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
=112SABsupπΠkPr[h|P~k,π]1SABξMLE(Tk)absent112𝑆𝐴𝐵subscriptsupremum𝜋superscriptΠ𝑘Prconditionalsubscriptsuperscript~𝑃𝑘𝜋1𝑆𝐴𝐵subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle=\frac{1}{12SAB}\sup_{\pi\in\Pi^{k}}\Pr[{\mathcal{F}}_{h}|\tilde{% P}^{k},\pi]-\frac{1}{SAB}\xi_{MLE}(T_{k})= divide start_ARG 1 end_ARG start_ARG 12 italic_S italic_A italic_B end_ARG roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Pr [ caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_π ] - divide start_ARG 1 end_ARG start_ARG italic_S italic_A italic_B end_ARG italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
=112SABsupπΠkPr[h|P,π]1SABξMLE(Tk).absent112𝑆𝐴𝐵subscriptsupremum𝜋superscriptΠ𝑘Prconditionalsubscript𝑃𝜋1𝑆𝐴𝐵subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle=\frac{1}{12SAB}\sup_{\pi\in\Pi^{k}}\Pr[{\mathcal{F}}_{h}|P,\pi]-% \frac{1}{SAB}\xi_{MLE}(T_{k}).= divide start_ARG 1 end_ARG start_ARG 12 italic_S italic_A italic_B end_ARG roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Pr [ caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | italic_P , italic_π ] - divide start_ARG 1 end_ARG start_ARG italic_S italic_A italic_B end_ARG italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (4)

By the construction of 𝒰ksuperscript𝒰𝑘{\mathcal{U}}^{k}caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, we have that

Pr[h|P,πkh]cH2log(HSABK/δ)SABTk.Prconditionalsubscript𝑃superscript𝜋𝑘𝑐superscript𝐻2𝐻𝑆𝐴𝐵𝐾𝛿𝑆𝐴𝐵subscript𝑇𝑘\displaystyle\Pr[{\mathcal{F}}_{h}|P,\pi^{kh}]\leq c\frac{H^{2}\log(HSABK/% \delta)}{SABT_{k}}.roman_Pr [ caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | italic_P , italic_π start_POSTSUPERSCRIPT italic_k italic_h end_POSTSUPERSCRIPT ] ≤ italic_c divide start_ARG italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG italic_S italic_A italic_B italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG .

Thus, combined with Equation 4, we have

supπΠkPr[h|P,π]H2log(HSABK/δ)Tk+ξMLE(Tk).less-than-or-similar-tosubscriptsupremum𝜋superscriptΠ𝑘Prconditionalsubscript𝑃𝜋superscript𝐻2𝐻𝑆𝐴𝐵𝐾𝛿subscript𝑇𝑘subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle\sup_{\pi\in\Pi^{k}}\Pr[{\mathcal{F}}_{h}|P,\pi]\lesssim\frac{H^{% 2}\log(HSABK/\delta)}{T_{k}}+\xi_{MLE}(T_{k}).roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Pr [ caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | italic_P , italic_π ] ≲ divide start_ARG italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) .

Finally, note that

Pr[|P,π]=h[H]Pr[h|P,π],Prconditional𝑃𝜋subscriptdelimited-[]𝐻Prconditionalsubscript𝑃𝜋\displaystyle\Pr[{\mathcal{F}}|P,\pi]=\sum_{h\in[H]}\Pr[{\mathcal{F}}_{h}|P,% \pi],roman_Pr [ caligraphic_F | italic_P , italic_π ] = ∑ start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT roman_Pr [ caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | italic_P , italic_π ] ,

which concludes our proof.

B.3.2 Uniform policy evaluation

In this part, we will show that the empirical transition kernel Pksuperscript𝑃𝑘P^{k}italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT constructed from the exploratory data by our sampling policies is a good surrogate for the true transition kernel P𝑃Pitalic_P in evaluating the value of uniformly all policies.

Lemma B.10.

Conditioned on the event E𝐸Eitalic_E in Lemma B.5 and the event Eksubscript𝐸𝑘E_{k}italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in Lemma B.8 and the high-probability event in Lemma B.9, with probability at least 1δ1𝛿1-\delta1 - italic_δ, for any k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], any reward function r𝑟ritalic_r, and any policy πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π, we have

0Vπ,f([π]m)(r,P)Vπ,f([π]m)(r𝒰k,P~k)H4log(HSABK/δ)Tk+H2ξMLE(Tk),0superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚𝑟𝑃superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript𝑟superscript𝒰𝑘superscript~𝑃𝑘less-than-or-similar-tosuperscript𝐻4𝐻𝑆𝐴𝐵𝐾𝛿subscript𝑇𝑘superscript𝐻2subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle 0\leq V^{\pi,f([\pi]^{m})}(r,P)-V^{\pi,f([\pi]^{m})}(r_{{% \mathcal{U}}^{k}},\tilde{P}^{k})\lesssim\frac{H^{4}\log(HSABK/\delta)}{T_{k}}+% H^{2}\xi_{MLE}(T_{k}),0 ≤ italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r , italic_P ) - italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ≲ divide start_ARG italic_H start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT roman_log ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,

where for any trajectory τ=(s1,a1,b1,,sH,aH,bH)𝜏subscript𝑠1subscript𝑎1subscript𝑏1subscript𝑠𝐻subscript𝑎𝐻subscript𝑏𝐻\tau=(s_{1},a_{1},b_{1},\ldots,s_{H},a_{H},b_{H})italic_τ = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ), r𝒰k(τ):=h=1H1{(h,sh,ah,bh,sh+1)𝒰k)}rh(sh,ah,bh)r_{{\mathcal{U}}^{k}}(\tau):=\sum_{h=1}^{H}1\{(h,s_{h},a_{h},b_{h},s_{h+1})% \notin{\mathcal{U}}^{k})\}r_{h}(s_{h},a_{h},b_{h})italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) := ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT 1 { ( italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ∉ caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) } italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ).

Proof of Lemma B.10.

We have

Vπ,f([π]m)(r,P)superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚𝑟𝑃\displaystyle V^{\pi,f([\pi]^{m})}(r,P)italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r , italic_P ) =τr(τ)Pr(τ|P,π)absentsubscript𝜏𝑟𝜏Prconditional𝜏𝑃𝜋\displaystyle=\sum_{\tau}r(\tau)\Pr(\tau|P,\pi)= ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_r ( italic_τ ) roman_Pr ( italic_τ | italic_P , italic_π )
=τr(τ)Pr(τ|P,π)+τr(τ)Pr(τ|P,π)absentsubscript𝜏𝑟𝜏Prconditional𝜏𝑃𝜋subscript𝜏𝑟𝜏Prconditional𝜏𝑃𝜋\displaystyle=\sum_{\tau\notin{\mathcal{F}}}r(\tau)\Pr(\tau|P,\pi)+\sum_{\tau% \in{\mathcal{F}}}r(\tau)\Pr(\tau|P,\pi)= ∑ start_POSTSUBSCRIPT italic_τ ∉ caligraphic_F end_POSTSUBSCRIPT italic_r ( italic_τ ) roman_Pr ( italic_τ | italic_P , italic_π ) + ∑ start_POSTSUBSCRIPT italic_τ ∈ caligraphic_F end_POSTSUBSCRIPT italic_r ( italic_τ ) roman_Pr ( italic_τ | italic_P , italic_π )
=τr(τ)Pr(τ|P~k,π)+τr(τ)Pr(τ|P,π)absentsubscript𝜏𝑟𝜏Prconditional𝜏superscript~𝑃𝑘𝜋subscript𝜏𝑟𝜏Prconditional𝜏𝑃𝜋\displaystyle=\sum_{\tau\notin{\mathcal{F}}}r(\tau)\Pr(\tau|\tilde{P}^{k},\pi)% +\sum_{\tau\in{\mathcal{F}}}r(\tau)\Pr(\tau|P,\pi)= ∑ start_POSTSUBSCRIPT italic_τ ∉ caligraphic_F end_POSTSUBSCRIPT italic_r ( italic_τ ) roman_Pr ( italic_τ | over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_π ) + ∑ start_POSTSUBSCRIPT italic_τ ∈ caligraphic_F end_POSTSUBSCRIPT italic_r ( italic_τ ) roman_Pr ( italic_τ | italic_P , italic_π )
=τr𝒰k(τ)Pr(τ|P~k,π)+τr(τ)Pr(τ|P,π)absentsubscript𝜏subscript𝑟superscript𝒰𝑘𝜏Prconditional𝜏superscript~𝑃𝑘𝜋subscript𝜏𝑟𝜏Prconditional𝜏𝑃𝜋\displaystyle=\sum_{\tau\notin{\mathcal{F}}}r_{{\mathcal{U}}^{k}}(\tau)\Pr(% \tau|\tilde{P}^{k},\pi)+\sum_{\tau\in{\mathcal{F}}}r(\tau)\Pr(\tau|P,\pi)= ∑ start_POSTSUBSCRIPT italic_τ ∉ caligraphic_F end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) roman_Pr ( italic_τ | over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_π ) + ∑ start_POSTSUBSCRIPT italic_τ ∈ caligraphic_F end_POSTSUBSCRIPT italic_r ( italic_τ ) roman_Pr ( italic_τ | italic_P , italic_π )
Vπ,f([π]m)(r𝒰k,P~k)+τr(τ)Pr(τ|P,π)absentsuperscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript𝑟superscript𝒰𝑘superscript~𝑃𝑘subscript𝜏𝑟𝜏Prconditional𝜏𝑃𝜋\displaystyle\leq V^{\pi,f([\pi]^{m})}(r_{{\mathcal{U}}^{k}},\tilde{P}^{k})+% \sum_{\tau\in{\mathcal{F}}}r(\tau)\Pr(\tau|P,\pi)≤ italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_τ ∈ caligraphic_F end_POSTSUBSCRIPT italic_r ( italic_τ ) roman_Pr ( italic_τ | italic_P , italic_π )
Vπ,f([π]m)(r𝒰k,P~k)+H4log(HSABK/δ)Tk+H2ξMLE(Tk),less-than-or-similar-toabsentsuperscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript𝑟superscript𝒰𝑘superscript~𝑃𝑘superscript𝐻4𝐻𝑆𝐴𝐵𝐾𝛿subscript𝑇𝑘superscript𝐻2subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle\lesssim V^{\pi,f([\pi]^{m})}(r_{{\mathcal{U}}^{k}},\tilde{P}^{k}% )+\frac{H^{4}\log(HSABK/\delta)}{T_{k}}+H^{2}\xi_{MLE}(T_{k}),≲ italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + divide start_ARG italic_H start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT roman_log ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,

where the last inequality is due to Lemma B.9. Similarly, we have

Vπ,f([π]m)(r,P)superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚𝑟𝑃\displaystyle V^{\pi,f([\pi]^{m})}(r,P)italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r , italic_P ) =τr𝒰k(τ)Pr(τ|P~k,π)+τr(τ)Pr(τ|P,π)absentsubscript𝜏subscript𝑟superscript𝒰𝑘𝜏Prconditional𝜏superscript~𝑃𝑘𝜋subscript𝜏𝑟𝜏Prconditional𝜏𝑃𝜋\displaystyle=\sum_{\tau\notin{\mathcal{F}}}r_{{\mathcal{U}}^{k}}(\tau)\Pr(% \tau|\tilde{P}^{k},\pi)+\sum_{\tau\in{\mathcal{F}}}r(\tau)\Pr(\tau|P,\pi)= ∑ start_POSTSUBSCRIPT italic_τ ∉ caligraphic_F end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) roman_Pr ( italic_τ | over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_π ) + ∑ start_POSTSUBSCRIPT italic_τ ∈ caligraphic_F end_POSTSUBSCRIPT italic_r ( italic_τ ) roman_Pr ( italic_τ | italic_P , italic_π )
τr𝒰k(τ)Pr(τ|P~k,π)+τr𝒰k(τ)Pr(τ|P,π)absentsubscript𝜏subscript𝑟superscript𝒰𝑘𝜏Prconditional𝜏superscript~𝑃𝑘𝜋subscript𝜏subscript𝑟superscript𝒰𝑘𝜏Prconditional𝜏𝑃𝜋\displaystyle\geq\sum_{\tau\notin{\mathcal{F}}}r_{{\mathcal{U}}^{k}}(\tau)\Pr(% \tau|\tilde{P}^{k},\pi)+\sum_{\tau\in{\mathcal{F}}}r_{{\mathcal{U}}^{k}}(\tau)% \Pr(\tau|P,\pi)≥ ∑ start_POSTSUBSCRIPT italic_τ ∉ caligraphic_F end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) roman_Pr ( italic_τ | over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_π ) + ∑ start_POSTSUBSCRIPT italic_τ ∈ caligraphic_F end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) roman_Pr ( italic_τ | italic_P , italic_π )
τr𝒰k(τ)Pr(τ|P~k,π)+τr𝒰k(τ)Pr(τ|P~k,π)absentsubscript𝜏subscript𝑟superscript𝒰𝑘𝜏Prconditional𝜏superscript~𝑃𝑘𝜋subscript𝜏subscript𝑟superscript𝒰𝑘𝜏Prconditional𝜏superscript~𝑃𝑘𝜋\displaystyle\geq\sum_{\tau\notin{\mathcal{F}}}r_{{\mathcal{U}}^{k}}(\tau)\Pr(% \tau|\tilde{P}^{k},\pi)+\sum_{\tau\in{\mathcal{F}}}r_{{\mathcal{U}}^{k}}(\tau)% \Pr(\tau|\tilde{P}^{k},\pi)≥ ∑ start_POSTSUBSCRIPT italic_τ ∉ caligraphic_F end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) roman_Pr ( italic_τ | over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_π ) + ∑ start_POSTSUBSCRIPT italic_τ ∈ caligraphic_F end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) roman_Pr ( italic_τ | over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_π )
=Vπ,f([π]m)(r𝒰k,P~k).absentsuperscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript𝑟superscript𝒰𝑘superscript~𝑃𝑘\displaystyle=V^{\pi,f([\pi]^{m})}(r_{{\mathcal{U}}^{k}},\tilde{P}^{k}).= italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) .

Lemma B.11.

With probability at least 1δ1𝛿1-\delta1 - italic_δ, for any k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], any reward function r𝑟ritalic_r, and any policy πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π, we have

0V¯π(r𝒰k,P^k,Θk)Vπ,f([π]m)(r𝒰k,P^k)H2αdTk.0superscript¯𝑉𝜋subscript𝑟superscript𝒰𝑘superscript^𝑃𝑘superscriptΘ𝑘superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript𝑟superscript𝒰𝑘superscript^𝑃𝑘less-than-or-similar-tosuperscript𝐻2𝛼superscript𝑑subscript𝑇𝑘\displaystyle 0\leq\bar{V}^{\pi}(r_{{\mathcal{U}}^{k}},\hat{P}^{k},\Theta^{k})% -V^{\pi,f([\pi]^{m})}(r_{{\mathcal{U}}^{k}},\hat{P}^{k})\lesssim H^{2}\sqrt{% \frac{\alpha}{d^{*}T_{k}}}.0 ≤ over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ≲ italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG .
Proof of Lemma B.11.

The first inequality is trivial, following the first part of Lemma B.8. We will focus on the second inequality. Fix any deterministic policy π𝜋\piitalic_π. For simplicity, we write V¯hπ(s):=V¯hπ(r𝒰k,P^k,Θk)(s)assignsuperscriptsubscript¯𝑉𝜋𝑠superscriptsubscript¯𝑉𝜋subscript𝑟superscript𝒰𝑘superscript^𝑃𝑘superscriptΘ𝑘𝑠\bar{V}_{h}^{\pi}(s):=\bar{V}_{h}^{\pi}(r_{{\mathcal{U}}^{k}},\hat{P}^{k},% \Theta^{k})(s)over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) := over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ( italic_s ), and Vhπ(s):=Vhπ,f([π]m)(r𝒰k,P^k)(s)assignsubscriptsuperscript𝑉𝜋𝑠superscriptsubscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript𝑟superscript𝒰𝑘superscript^𝑃𝑘𝑠V^{\pi}_{h}(s):=V_{h}^{\pi,f([\pi]^{m})}(r_{{\mathcal{U}}^{k}},\hat{P}^{k})(s)italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) := italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ( italic_s ). Let Δhk(s):=V¯hπ(s)Vhπ(s)assignsuperscriptsubscriptΔ𝑘𝑠superscriptsubscript¯𝑉𝜋𝑠subscriptsuperscript𝑉𝜋𝑠\Delta_{h}^{k}(s):=\bar{V}_{h}^{\pi}(s)-V^{\pi}_{h}(s)roman_Δ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) := over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) - italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ).

First of all, by construction of P^ksuperscript^𝑃𝑘\hat{P}^{k}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and r𝒰ksubscript𝑟superscript𝒰𝑘r_{{\mathcal{U}}^{k}}italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we have Δhk(s)=0superscriptsubscriptΔ𝑘𝑠0\Delta_{h}^{k}(s)=0roman_Δ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) = 0 if s=s𝑠superscript𝑠s=s^{\dagger}italic_s = italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT or if Nhk(s,πh(s))=0subscriptsuperscript𝑁𝑘𝑠subscript𝜋𝑠0N^{k}_{h}(s,\pi_{h}(s))=0italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) = 0. This explains the very reason we design the truncated reward function r𝒰ksubscript𝑟superscript𝒰𝑘r_{{\mathcal{U}}^{k}}italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.

We now consider s𝒮𝑠𝒮s\in{\mathcal{S}}italic_s ∈ caligraphic_S such that Nhk(s,πh(s))>0subscriptsuperscript𝑁𝑘𝑠subscript𝜋𝑠0N^{k}_{h}(s,\pi_{h}(s))>0italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) > 0. This condition, along with the consistent behavior and the minimum visitation probability, allows us to estimate the response f([π]m)h(|s)f([\pi]^{m})_{h}(\cdot|s)italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) sufficiently. In particular, f([π]m)h(|s)f([\pi]^{m})_{h}(\cdot|s)italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) depends only on the data obtained by visiting (h,s,πh(s))𝑠subscript𝜋𝑠(h,s,\pi_{h}(s))( italic_h , italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) which is indeed visited at least dTksuperscript𝑑subscript𝑇𝑘d^{*}T_{k}italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT times, thus can be estimated up to an order of 1/dTk1superscript𝑑subscript𝑇𝑘1/\sqrt{d^{*}T_{k}}1 / square-root start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG error. We have

Δhk(s)subscriptsuperscriptΔ𝑘𝑠\displaystyle\Delta^{k}_{h}(s)roman_Δ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) =r𝒰k,h(s,πh(s),Pθ)+P^hkV¯h+1π(s,πh(s),Pθ)absentsubscript𝑟superscript𝒰𝑘𝑠subscript𝜋𝑠subscript𝑃𝜃subscriptsuperscript^𝑃𝑘superscriptsubscript¯𝑉1𝜋𝑠subscript𝜋𝑠subscript𝑃𝜃\displaystyle=r_{{\mathcal{U}}^{k},h}(s,\pi_{h}(s),P_{\theta})+\hat{P}^{k}_{h}% \bar{V}_{h+1}^{\pi}(s,\pi_{h}(s),P_{\theta})= italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) + over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT )
r𝒰k,h(s,πh(s),f([π]m)h(|s))P^hkVh+1π(s,πh(s),f([π]m)h(|s))\displaystyle-r_{{\mathcal{U}}^{k},h}(s,\pi_{h}(s),f([\pi]^{m})_{h}(\cdot|s))-% \hat{P}^{k}_{h}V_{h+1}^{\pi}(s,\pi_{h}(s),f([\pi]^{m})_{h}(\cdot|s))- italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) ) - over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) )
=r𝒰k,h(s,πh(s),Pθ)r𝒰k,h(s,πh(s),f([π]m)h(|s))\displaystyle=r_{{\mathcal{U}}^{k},h}(s,\pi_{h}(s),P_{\theta})-r_{{\mathcal{U}% }^{k},h}(s,\pi_{h}(s),f([\pi]^{m})_{h}(\cdot|s))= italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) - italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) )
+P^hk(V¯h+1πVh+1π)(s,πh(s),f([π]m)h(|s))\displaystyle+\hat{P}^{k}_{h}(\bar{V}^{\pi}_{h+1}-V^{\pi}_{h+1})(s,\pi_{h}(s),% f([\pi]^{m})_{h}(\cdot|s))+ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) )
+P^hkV¯h+1π(s,πh(s),Pθ)P^hkV¯h+1π(s,πh(s),f([π]m)h(|s))\displaystyle+\hat{P}^{k}_{h}\bar{V}_{h+1}^{\pi}(s,\pi_{h}(s),P_{\theta})-\hat% {P}^{k}_{h}\bar{V}_{h+1}^{\pi}(s,\pi_{h}(s),f([\pi]^{m})_{h}(\cdot|s))+ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) - over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) )
supθΘhsπh(s)kdTV(Pθ,f([π]m)h(|s))\displaystyle\leq\sup_{\theta\in\Theta^{k}_{hs\pi_{h}(s)}}d_{TV}(P_{\theta},f(% [\pi]^{m})_{h}(\cdot|s))≤ roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) )
+max{Δh+1k(s):s𝒮 s.t. b,(h,s,πh(s),b,s)𝒰k}:superscriptsubscriptΔ1𝑘superscript𝑠superscript𝑠𝒮 s.t. 𝑏𝑠subscript𝜋𝑠𝑏superscript𝑠superscript𝒰𝑘\displaystyle+\max\{\Delta_{h+1}^{k}(s^{\prime}):s^{\prime}\in{\mathcal{S}}% \text{ s.t. }\exists b\in{\mathcal{B}},(h,s,\pi_{h}(s),b,s^{\prime})\notin{% \mathcal{U}}^{k}\}+ roman_max { roman_Δ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) : italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S s.t. ∃ italic_b ∈ caligraphic_B , ( italic_h , italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∉ caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT }
+HsupθΘhsπh(s)kdTV(Pθ,f([π]m)h(|s))\displaystyle+H\sup_{\theta\in\Theta^{k}_{hs\pi_{h}(s)}}d_{TV}(P_{\theta},f([% \pi]^{m})_{h}(\cdot|s))+ italic_H roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) )
=(H+1)supθΘhsπh(s)kdTV(Pθ,f([π]m)h(|s))\displaystyle=(H+1)\sup_{\theta\in\Theta^{k}_{hs\pi_{h}(s)}}d_{TV}(P_{\theta},% f([\pi]^{m})_{h}(\cdot|s))= ( italic_H + 1 ) roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) )
+max{Δh+1k(s):s𝒮 s.t. b,(h,s,πh(s),b,s)𝒰k}:superscriptsubscriptΔ1𝑘superscript𝑠superscript𝑠𝒮 s.t. 𝑏𝑠subscript𝜋𝑠𝑏superscript𝑠superscript𝒰𝑘\displaystyle+\max\{\Delta_{h+1}^{k}(s^{\prime}):s^{\prime}\in{\mathcal{S}}% \text{ s.t. }\exists b\in{\mathcal{B}},(h,s,\pi_{h}(s),b,s^{\prime})\notin{% \mathcal{U}}^{k}\}+ roman_max { roman_Δ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) : italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S s.t. ∃ italic_b ∈ caligraphic_B , ( italic_h , italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_b , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∉ caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT }

Note that, similar to Equation 2, as Nhk(s,πh(s))>0subscriptsuperscript𝑁𝑘𝑠subscript𝜋𝑠0N^{k}_{h}(s,\pi_{h}(s))>0italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) > 0, with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

supθΘhsπh(s)kdTV(Pθ,f([π]m)h(|s))αdTk.\displaystyle\sup_{\theta\in\Theta^{k}_{hs\pi_{h}(s)}}d_{TV}(P_{\theta},f([\pi% ]^{m})_{h}(\cdot|s))\lesssim\sqrt{\frac{\alpha}{d^{*}T_{k}}}.roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_s italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s ) ) ≲ square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG .

Thus, we have

Δ1k(s1)H2αdTk.less-than-or-similar-tosuperscriptsubscriptΔ1𝑘subscript𝑠1superscript𝐻2𝛼superscript𝑑subscript𝑇𝑘\displaystyle\Delta_{1}^{k}(s_{1})\lesssim H^{2}\sqrt{\frac{\alpha}{d^{*}T_{k}% }}.roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≲ italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG .

Lemma B.12.

Conditioned on the event E𝐸Eitalic_E in Lemma B.5 and the event Eksubscript𝐸𝑘E_{k}italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in Lemma B.8, with probability 1δ1𝛿1-\delta1 - italic_δ, for any k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π and any reward function rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we have

|Vπ,f([π]m)(r,P^k)Vπ,f([π]m)(r,P~k)|HS3/2ABlog(HAT/δ)Tk+HSABξMLE(Tk).less-than-or-similar-tosuperscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚superscript𝑟superscript^𝑃𝑘superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚superscript𝑟superscript~𝑃𝑘𝐻superscript𝑆32𝐴𝐵𝐻𝐴𝑇𝛿subscript𝑇𝑘𝐻𝑆𝐴𝐵subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle|V^{\pi,f([\pi]^{m})}(r^{\prime},\hat{P}^{k})-V^{\pi,f([\pi]^{m})% }(r^{\prime},\tilde{P}^{k})|\lesssim HS^{3/2}AB\sqrt{\frac{\log(HAT/\delta)}{T% _{k}}}+HSAB\cdot\xi_{MLE}(T_{k}).| italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) | ≲ italic_H italic_S start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_A italic_B square-root start_ARG divide start_ARG roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG + italic_H italic_S italic_A italic_B ⋅ italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) .
Proof of Lemma B.12.

By the simulation lemma [Dann et al., 2017], we have

|Vπ,f([π]m)(r,P^k)Vπ,f([π]m)(r,P~k)|𝔼P~k,πh=1H|(P^hkP~hk)V^h+1π|,superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚superscript𝑟superscript^𝑃𝑘superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚superscript𝑟superscript~𝑃𝑘subscript𝔼superscript~𝑃𝑘𝜋superscriptsubscript1𝐻subscriptsuperscript^𝑃𝑘subscriptsuperscript~𝑃𝑘superscriptsubscript^𝑉1𝜋\displaystyle|V^{\pi,f([\pi]^{m})}(r^{\prime},\hat{P}^{k})-V^{\pi,f([\pi]^{m})% }(r^{\prime},\tilde{P}^{k})|\leq{\mathbb{E}}_{\tilde{P}^{k},\pi}\sum_{h=1}^{H}% |(\hat{P}^{k}_{h}-\tilde{P}^{k}_{h})\hat{V}_{h+1}^{\pi}|,| italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) | ≤ blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_π end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT | ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT | ,

where V^h+1π:=Vπ,f([π]m)(r,P^k)assignsuperscriptsubscript^𝑉1𝜋superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚superscript𝑟superscript^𝑃𝑘\hat{V}_{h+1}^{\pi}:=V^{\pi,f([\pi]^{m})}(r^{\prime},\hat{P}^{k})over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT := italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ). Define the sampling distribution νhΔ(𝒮×𝒜×),h[H]formulae-sequencesubscript𝜈Δ𝒮𝒜delimited-[]𝐻\nu_{h}\in\Delta({\mathcal{S}}\times{\mathcal{A}}\times{\mathcal{B}}),h\in[H]italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S × caligraphic_A × caligraphic_B ) , italic_h ∈ [ italic_H ] by

νh(s,a,b):=1SABs¯,a¯,b¯Vπkhs¯a¯b¯,f([πkhs¯a¯b¯]m)(1hsab,P~k).assignsubscript𝜈𝑠𝑎𝑏1𝑆𝐴𝐵subscript¯𝑠¯𝑎¯𝑏superscript𝑉superscript𝜋𝑘¯𝑠¯𝑎¯𝑏𝑓superscriptdelimited-[]superscript𝜋𝑘¯𝑠¯𝑎¯𝑏𝑚subscript1𝑠𝑎𝑏superscript~𝑃𝑘\displaystyle\nu_{h}(s,a,b):=\frac{1}{SAB}\sum_{\bar{s},\bar{a},\bar{b}}V^{\pi% ^{kh\bar{s}\bar{a}\bar{b}},f([\pi^{kh\bar{s}\bar{a}\bar{b}}]^{m})}(1_{hsab},% \tilde{P}^{k}).italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) := divide start_ARG 1 end_ARG start_ARG italic_S italic_A italic_B end_ARG ∑ start_POSTSUBSCRIPT over¯ start_ARG italic_s end_ARG , over¯ start_ARG italic_a end_ARG , over¯ start_ARG italic_b end_ARG end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_k italic_h over¯ start_ARG italic_s end_ARG over¯ start_ARG italic_a end_ARG over¯ start_ARG italic_b end_ARG end_POSTSUPERSCRIPT , italic_f ( [ italic_π start_POSTSUPERSCRIPT italic_k italic_h over¯ start_ARG italic_s end_ARG over¯ start_ARG italic_a end_ARG over¯ start_ARG italic_b end_ARG end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) .

Then, we have

𝔼P~k,π|(P^hkP~hk)V^h+1π|=s,a,b|(P^hkP~hk)V^h+1π(s,a,b)|Vπ,f([π]m)(1hsab,P~k)subscript𝔼superscript~𝑃𝑘𝜋subscriptsuperscript^𝑃𝑘subscriptsuperscript~𝑃𝑘superscriptsubscript^𝑉1𝜋subscript𝑠𝑎𝑏subscriptsuperscript^𝑃𝑘subscriptsuperscript~𝑃𝑘superscriptsubscript^𝑉1𝜋𝑠𝑎𝑏superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript1𝑠𝑎𝑏superscript~𝑃𝑘\displaystyle{\mathbb{E}}_{\tilde{P}^{k},\pi}|(\hat{P}^{k}_{h}-\tilde{P}^{k}_{% h})\hat{V}_{h+1}^{\pi}|=\sum_{s,a,b}|(\hat{P}^{k}_{h}-\tilde{P}^{k}_{h})\hat{V% }_{h+1}^{\pi}(s,a,b)|\cdot V^{\pi,f([\pi]^{m})}(1_{hsab},\tilde{P}^{k})blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_π end_POSTSUBSCRIPT | ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT | = ∑ start_POSTSUBSCRIPT italic_s , italic_a , italic_b end_POSTSUBSCRIPT | ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) | ⋅ italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
=s,a,b|(P^hkP~hk)V^h+1π(s,a,b)|Vπ,f([π]m)(1hsab,P~k)1{πh(s)=a}absentsubscript𝑠𝑎𝑏subscriptsuperscript^𝑃𝑘subscriptsuperscript~𝑃𝑘superscriptsubscript^𝑉1𝜋𝑠𝑎𝑏superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript1𝑠𝑎𝑏superscript~𝑃𝑘1subscript𝜋𝑠𝑎\displaystyle=\sum_{s,a,b}|(\hat{P}^{k}_{h}-\tilde{P}^{k}_{h})\hat{V}_{h+1}^{% \pi}(s,a,b)|\cdot V^{\pi,f([\pi]^{m})}(1_{hsab},\tilde{P}^{k})1\{\pi_{h}(s)=a\}= ∑ start_POSTSUBSCRIPT italic_s , italic_a , italic_b end_POSTSUBSCRIPT | ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) | ⋅ italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) 1 { italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_a }
12s,a,b|(P^hkP~hk)V^h+1π(s,a,b)|1{πh(s)=a}Vπkhsab,f([πkhsab]m)(1hsab,P~k)absent12subscript𝑠𝑎𝑏subscriptsuperscript^𝑃𝑘subscriptsuperscript~𝑃𝑘superscriptsubscript^𝑉1𝜋𝑠𝑎𝑏1subscript𝜋𝑠𝑎superscript𝑉superscript𝜋𝑘𝑠𝑎𝑏𝑓superscriptdelimited-[]superscript𝜋𝑘𝑠𝑎𝑏𝑚subscript1𝑠𝑎𝑏superscript~𝑃𝑘\displaystyle\leq 12\sum_{s,a,b}|(\hat{P}^{k}_{h}-\tilde{P}^{k}_{h})\hat{V}_{h% +1}^{\pi}(s,a,b)|1\{\pi_{h}(s)=a\}\cdot V^{\pi^{khsab},f([\pi^{khsab}]^{m})}(1% _{hsab},\tilde{P}^{k})≤ 12 ∑ start_POSTSUBSCRIPT italic_s , italic_a , italic_b end_POSTSUBSCRIPT | ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) | 1 { italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_a } ⋅ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT , italic_f ( [ italic_π start_POSTSUPERSCRIPT italic_k italic_h italic_s italic_a italic_b end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
+12HSABξMLE(Tk)12𝐻𝑆𝐴𝐵subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle+12HSAB\cdot\xi_{MLE}(T_{k})+ 12 italic_H italic_S italic_A italic_B ⋅ italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
12s,a,b|(P^hkP~hk)V^h+1π(s,a,b)|1{πh(s)=a}s¯,a¯,b¯Vπkhs¯a¯b¯,f([πkhs¯a¯b¯]m)(1hsab,P~k)absent12subscript𝑠𝑎𝑏subscriptsuperscript^𝑃𝑘subscriptsuperscript~𝑃𝑘superscriptsubscript^𝑉1𝜋𝑠𝑎𝑏1subscript𝜋𝑠𝑎subscript¯𝑠¯𝑎¯𝑏superscript𝑉superscript𝜋𝑘¯𝑠¯𝑎¯𝑏𝑓superscriptdelimited-[]superscript𝜋𝑘¯𝑠¯𝑎¯𝑏𝑚subscript1𝑠𝑎𝑏superscript~𝑃𝑘\displaystyle\leq 12\sum_{s,a,b}|(\hat{P}^{k}_{h}-\tilde{P}^{k}_{h})\hat{V}_{h% +1}^{\pi}(s,a,b)|1\{\pi_{h}(s)=a\}\cdot\sum_{\bar{s},\bar{a},\bar{b}}V^{\pi^{% kh\bar{s}\bar{a}\bar{b}},f([\pi^{kh\bar{s}\bar{a}\bar{b}}]^{m})}(1_{hsab},% \tilde{P}^{k})≤ 12 ∑ start_POSTSUBSCRIPT italic_s , italic_a , italic_b end_POSTSUBSCRIPT | ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) | 1 { italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_a } ⋅ ∑ start_POSTSUBSCRIPT over¯ start_ARG italic_s end_ARG , over¯ start_ARG italic_a end_ARG , over¯ start_ARG italic_b end_ARG end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_k italic_h over¯ start_ARG italic_s end_ARG over¯ start_ARG italic_a end_ARG over¯ start_ARG italic_b end_ARG end_POSTSUPERSCRIPT , italic_f ( [ italic_π start_POSTSUPERSCRIPT italic_k italic_h over¯ start_ARG italic_s end_ARG over¯ start_ARG italic_a end_ARG over¯ start_ARG italic_b end_ARG end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( 1 start_POSTSUBSCRIPT italic_h italic_s italic_a italic_b end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
+12HSABξMLE(Tk)12𝐻𝑆𝐴𝐵subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle+12HSAB\cdot\xi_{MLE}(T_{k})+ 12 italic_H italic_S italic_A italic_B ⋅ italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
12SABs,a,b|(P^hkP~hk)V^h+1π(s,a,b)|1{πh(s)=a}νh(s,a,b)+12HSABξMLE(Tk)absent12𝑆𝐴𝐵subscript𝑠𝑎𝑏subscriptsuperscript^𝑃𝑘subscriptsuperscript~𝑃𝑘superscriptsubscript^𝑉1𝜋𝑠𝑎𝑏1subscript𝜋𝑠𝑎subscript𝜈𝑠𝑎𝑏12𝐻𝑆𝐴𝐵subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle\leq 12SAB\sum_{s,a,b}|(\hat{P}^{k}_{h}-\tilde{P}^{k}_{h})\hat{V}% _{h+1}^{\pi}(s,a,b)|1\{\pi_{h}(s)=a\}\nu_{h}(s,a,b)+12HSAB\cdot\xi_{MLE}(T_{k})≤ 12 italic_S italic_A italic_B ∑ start_POSTSUBSCRIPT italic_s , italic_a , italic_b end_POSTSUBSCRIPT | ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) | 1 { italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_a } italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) + 12 italic_H italic_S italic_A italic_B ⋅ italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
=12SABs,a,b|(P^hkP~hk)V^h+1π(s,a,b)|2νh(s,a,b)1{a=πh(s)}+12HSABξMLE(Tk)absent12𝑆𝐴𝐵subscript𝑠𝑎𝑏superscriptsubscriptsuperscript^𝑃𝑘subscriptsuperscript~𝑃𝑘superscriptsubscript^𝑉1𝜋𝑠𝑎𝑏2subscript𝜈𝑠𝑎𝑏1𝑎subscript𝜋𝑠12𝐻𝑆𝐴𝐵subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle=12SAB\sqrt{\sum_{s,a,b}|(\hat{P}^{k}_{h}-\tilde{P}^{k}_{h})\hat{% V}_{h+1}^{\pi}(s,a,b)|^{2}\nu_{h}(s,a,b)1\{a=\pi_{h}(s)\}}+12HSAB\cdot\xi_{MLE% }(T_{k})= 12 italic_S italic_A italic_B square-root start_ARG ∑ start_POSTSUBSCRIPT italic_s , italic_a , italic_b end_POSTSUBSCRIPT | ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_b ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_b ) 1 { italic_a = italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) } end_ARG + 12 italic_H italic_S italic_A italic_B ⋅ italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
12SABsupV:𝒮s[0,H]supg:𝒮s𝒜𝔼νh|(P^hkP~hk)V(s,a,b)|21{g(s)=a}absent12𝑆𝐴𝐵subscriptsupremum:𝑉𝒮superscript𝑠0𝐻subscriptsupremum:𝑔𝒮superscript𝑠𝒜subscript𝔼subscript𝜈superscriptsubscriptsuperscript^𝑃𝑘subscriptsuperscript~𝑃𝑘𝑉𝑠𝑎𝑏21𝑔𝑠𝑎\displaystyle\leq 12SAB\sqrt{\sup_{V:{\mathcal{S}}\cup s^{\dagger}\rightarrow[% 0,H]}\sup_{g:{\mathcal{S}}\cup s^{\dagger}\rightarrow{\mathcal{A}}}{\mathbb{E}% }_{\nu_{h}}|(\hat{P}^{k}_{h}-\tilde{P}^{k}_{h})V(s,a,b)|^{2}1\{g(s)=a\}}≤ 12 italic_S italic_A italic_B square-root start_ARG roman_sup start_POSTSUBSCRIPT italic_V : caligraphic_S ∪ italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT → [ 0 , italic_H ] end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_g : caligraphic_S ∪ italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT → caligraphic_A end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) italic_V ( italic_s , italic_a , italic_b ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1 { italic_g ( italic_s ) = italic_a } end_ARG
+12HSABξMLE(Tk)12𝐻𝑆𝐴𝐵subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle+12HSAB\cdot\xi_{MLE}(T_{k})+ 12 italic_H italic_S italic_A italic_B ⋅ italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
HS3/2ABlog(HAT/δ)Tk+HSABξMLE(Tk),less-than-or-similar-toabsent𝐻superscript𝑆32𝐴𝐵𝐻𝐴𝑇𝛿subscript𝑇𝑘𝐻𝑆𝐴𝐵subscript𝜉𝑀𝐿𝐸subscript𝑇𝑘\displaystyle\lesssim HS^{3/2}AB\sqrt{\frac{\log(HAT/\delta)}{T_{k}}}+HSAB% \cdot\xi_{MLE}(T_{k}),≲ italic_H italic_S start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_A italic_B square-root start_ARG divide start_ARG roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG + italic_H italic_S italic_A italic_B ⋅ italic_ξ start_POSTSUBSCRIPT italic_M italic_L italic_E end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,

where the second equality is due to that π𝜋\piitalic_π is deterministic, the first inequality follows from Equation 3, the third inequality follows from Jensen’s inequality, and the last inequality follows from the fundamental Lemma B.13. ∎

Lemma B.13 ([Jin et al., 2020, Lemma C.2]).

With probability at least 1δ1𝛿1-\delta1 - italic_δ, for all h[H],k[K]formulae-sequencedelimited-[]𝐻𝑘delimited-[]𝐾h\in[H],k\in[K]italic_h ∈ [ italic_H ] , italic_k ∈ [ italic_K ], we have

supV:𝒮s[0,H]supg:𝒮s𝒜𝔼νh|(P^hkP~h)V(s,a,b)|21{g(s)=a}H2Slog(HAT/δ)Tk.less-than-or-similar-tosubscriptsupremum:𝑉𝒮superscript𝑠0𝐻subscriptsupremum:𝑔𝒮superscript𝑠𝒜subscript𝔼subscript𝜈superscriptsubscriptsuperscript^𝑃𝑘subscript~𝑃𝑉𝑠𝑎𝑏21𝑔𝑠𝑎superscript𝐻2𝑆𝐻𝐴𝑇𝛿subscript𝑇𝑘\displaystyle\sup_{V:{\mathcal{S}}\cup s^{\dagger}\rightarrow[0,H]}\sup_{g:{% \mathcal{S}}\cup s^{\dagger}\rightarrow{\mathcal{A}}}{\mathbb{E}}_{\nu_{h}}|(% \hat{P}^{k}_{h}-\tilde{P}_{h})V(s,a,b)|^{2}1\{g(s)=a\}\lesssim\frac{H^{2}S\log% (HAT/\delta)}{T_{k}}.roman_sup start_POSTSUBSCRIPT italic_V : caligraphic_S ∪ italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT → [ 0 , italic_H ] end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_g : caligraphic_S ∪ italic_s start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT → caligraphic_A end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) italic_V ( italic_s , italic_a , italic_b ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1 { italic_g ( italic_s ) = italic_a } ≲ divide start_ARG italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG .
Proof of Lemma B.13.

Note that P^ksuperscript^𝑃𝑘\hat{P}^{k}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is the empirical transition kernel constructed by sampling according to the data distribution ν𝜈\nuitalic_ν under the transition kernel P~ksuperscript~𝑃𝑘\tilde{P}^{k}over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for Tksubscript𝑇𝑘T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT samples. Thus, Lemma B.13 is a direct application of [Jin et al., 2020, Lemma C.2]. ∎

Finally, we will show that any policy in Πk+1superscriptΠ𝑘1\Pi^{k+1}roman_Π start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT is of high quality.

Lemma B.14.

Recall the version space Πk+1superscriptΠ𝑘1\Pi^{k+1}roman_Π start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT defined at 6 of Algorithm 3. With probability at least 1δ1𝛿1-\delta1 - italic_δ, for any k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] and any πΠk+1𝜋superscriptΠ𝑘1\pi\in\Pi^{k+1}italic_π ∈ roman_Π start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT, and any reward function r𝑟ritalic_r, we have

supπΠVπ,f([π]m)(r,P)Vπ,f([π]m)(r,P)subscriptsupremum𝜋Πsuperscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚𝑟𝑃superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚𝑟𝑃\displaystyle\sup_{\pi\in\Pi}V^{\pi,f([\pi]^{m})}(r,P)-V^{\pi,f([\pi]^{m})}(r,P)roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r , italic_P ) - italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r , italic_P ) =𝒪(H2(SAB+H)αdTk+H4log(HSABK/δ)Tk\displaystyle={\mathcal{O}}\bigg{(}H^{2}(SAB+H)\sqrt{\frac{\alpha}{d^{*}T_{k}}% }+\frac{H^{4}\log(HSABK/\delta)}{T_{k}}= caligraphic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_S italic_A italic_B + italic_H ) square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG + divide start_ARG italic_H start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT roman_log ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG
+HS3/2ABlog(HAT/δ)Tk).\displaystyle+HS^{3/2}AB\sqrt{\frac{\log(HAT/\delta)}{T_{k}}}\bigg{)}.+ italic_H italic_S start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_A italic_B square-root start_ARG divide start_ARG roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) .
Proof of Lemma B.14.

Consider any πΠk+1𝜋superscriptΠ𝑘1\pi\in\Pi^{k+1}italic_π ∈ roman_Π start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT. We have

Vπ,f([π]m)(r,P)Vπ,f([π]m)(r𝒰k,P~k)superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚𝑟𝑃superscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript𝑟superscript𝒰𝑘superscript~𝑃𝑘\displaystyle V^{\pi,f([\pi]^{m})}(r,P)\geq V^{\pi,f([\pi]^{m})}(r_{{\mathcal{% U}}^{k}},\tilde{P}^{k})italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r , italic_P ) ≥ italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
Vπ,f([π]m)(r𝒰k,P^k)HS3/2ABlog(HAT/δ)TkH2SABαdTkabsentsuperscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript𝑟superscript𝒰𝑘superscript^𝑃𝑘𝐻superscript𝑆32𝐴𝐵𝐻𝐴𝑇𝛿subscript𝑇𝑘superscript𝐻2𝑆𝐴𝐵𝛼superscript𝑑subscript𝑇𝑘\displaystyle\geq V^{\pi,f([\pi]^{m})}(r_{{\mathcal{U}}^{k}},\hat{P}^{k})-HS^{% 3/2}AB\sqrt{\frac{\log(HAT/\delta)}{T_{k}}}-H^{2}SAB\sqrt{\frac{\alpha}{d^{*}T% _{k}}}≥ italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_H italic_S start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_A italic_B square-root start_ARG divide start_ARG roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG - italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG
V¯π(r𝒰k,P^k,Θk)H2αdTkHS3/2ABlog(HAT/δ)TkH2SABαdTkabsentsuperscript¯𝑉𝜋subscript𝑟superscript𝒰𝑘superscript^𝑃𝑘superscriptΘ𝑘superscript𝐻2𝛼superscript𝑑subscript𝑇𝑘𝐻superscript𝑆32𝐴𝐵𝐻𝐴𝑇𝛿subscript𝑇𝑘superscript𝐻2𝑆𝐴𝐵𝛼superscript𝑑subscript𝑇𝑘\displaystyle\geq\bar{V}^{\pi}(r_{{\mathcal{U}}^{k}},\hat{P}^{k},\Theta^{k})-H% ^{2}\sqrt{\frac{\alpha}{d^{*}T_{k}}}-HS^{3/2}AB\sqrt{\frac{\log(HAT/\delta)}{T% _{k}}}-H^{2}SAB\sqrt{\frac{\alpha}{d^{*}T_{k}}}≥ over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG - italic_H italic_S start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_A italic_B square-root start_ARG divide start_ARG roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG - italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG
supπΠV¯π(r𝒰k,P^k,Θk)𝒪(H2SABαdTk+HS3/2ABlog(HAT/δ)Tk)absentsubscriptsupremum𝜋Πsuperscript¯𝑉𝜋subscript𝑟superscript𝒰𝑘superscript^𝑃𝑘superscriptΘ𝑘𝒪superscript𝐻2𝑆𝐴𝐵𝛼superscript𝑑subscript𝑇𝑘𝐻superscript𝑆32𝐴𝐵𝐻𝐴𝑇𝛿subscript𝑇𝑘\displaystyle\geq\sup_{\pi\in\Pi}\bar{V}^{\pi}(r_{{\mathcal{U}}^{k}},\hat{P}^{% k},\Theta^{k})-{\mathcal{O}}\left(H^{2}SAB\sqrt{\frac{\alpha}{d^{*}T_{k}}}+HS^% {3/2}AB\sqrt{\frac{\log(HAT/\delta)}{T_{k}}}\right)≥ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT over¯ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - caligraphic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG + italic_H italic_S start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_A italic_B square-root start_ARG divide start_ARG roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG )
supπΠVπ,f([π]m)(r𝒰k,P^k)𝒪(H2SABαdTk+HS3/2ABlog(HAT/δ)Tk)absentsubscriptsupremum𝜋Πsuperscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript𝑟superscript𝒰𝑘superscript^𝑃𝑘𝒪superscript𝐻2𝑆𝐴𝐵𝛼superscript𝑑subscript𝑇𝑘𝐻superscript𝑆32𝐴𝐵𝐻𝐴𝑇𝛿subscript𝑇𝑘\displaystyle\geq\sup_{\pi\in\Pi}V^{\pi,f([\pi]^{m})}(r_{{\mathcal{U}}^{k}},% \hat{P}^{k})-{\mathcal{O}}\left(H^{2}SAB\sqrt{\frac{\alpha}{d^{*}T_{k}}}+HS^{3% /2}AB\sqrt{\frac{\log(HAT/\delta)}{T_{k}}}\right)≥ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - caligraphic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG + italic_H italic_S start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_A italic_B square-root start_ARG divide start_ARG roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG )
supπΠVπ,f([π]m)(r𝒰k,P~k)𝒪(H2SABαdTk+HS3/2ABlog(HAT/δ)Tk)absentsubscriptsupremum𝜋Πsuperscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚subscript𝑟superscript𝒰𝑘superscript~𝑃𝑘𝒪superscript𝐻2𝑆𝐴𝐵𝛼superscript𝑑subscript𝑇𝑘𝐻superscript𝑆32𝐴𝐵𝐻𝐴𝑇𝛿subscript𝑇𝑘\displaystyle\geq\sup_{\pi\in\Pi}V^{\pi,f([\pi]^{m})}(r_{{\mathcal{U}}^{k}},% \tilde{P}^{k})-{\mathcal{O}}\left(H^{2}SAB\sqrt{\frac{\alpha}{d^{*}T_{k}}}+HS^% {3/2}AB\sqrt{\frac{\log(HAT/\delta)}{T_{k}}}\right)≥ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - caligraphic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG + italic_H italic_S start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_A italic_B square-root start_ARG divide start_ARG roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG )
supπΠVπ,f([π]m)(r,P)absentsubscriptsupremum𝜋Πsuperscript𝑉𝜋𝑓superscriptdelimited-[]𝜋𝑚𝑟𝑃\displaystyle\geq\sup_{\pi\in\Pi}V^{\pi,f([\pi]^{m})}(r,P)≥ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π , italic_f ( [ italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_r , italic_P )
𝒪(H2SABαdTk+HS3/2ABlog(HAT/δ)Tk+H4log(HSABK/δ)Tk+H3αdTk)𝒪superscript𝐻2𝑆𝐴𝐵𝛼superscript𝑑subscript𝑇𝑘𝐻superscript𝑆32𝐴𝐵𝐻𝐴𝑇𝛿subscript𝑇𝑘superscript𝐻4𝐻𝑆𝐴𝐵𝐾𝛿subscript𝑇𝑘superscript𝐻3𝛼superscript𝑑subscript𝑇𝑘\displaystyle-{\mathcal{O}}\left(H^{2}SAB\sqrt{\frac{\alpha}{d^{*}T_{k}}}+HS^{% 3/2}AB\sqrt{\frac{\log(HAT/\delta)}{T_{k}}}+\frac{H^{4}\log(HSABK/\delta)}{T_{% k}}+H^{3}\sqrt{\frac{\alpha}{d^{*}T_{k}}}\right)- caligraphic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG + italic_H italic_S start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_A italic_B square-root start_ARG divide start_ARG roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG + divide start_ARG italic_H start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT roman_log ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG )

where the first inequality follows from the first part of Lemma B.10, the second inequality follows from Lemma B.12, the third inequality follows from Lemma B.11, the fourth inequality follows from the definition of Πk+1superscriptΠ𝑘1\Pi^{k+1}roman_Π start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT, the fifth inequality follows from Lemma B.11, the sixth inequality follows from Lemma B.12, and the last inequality follows from the second part of Lemma B.10. ∎

Proof of Theorem 5.

Note that K=min{j:k=1jTkT¯}=𝒪(loglogT¯)𝐾:𝑗superscriptsubscript𝑘1𝑗subscript𝑇𝑘¯𝑇𝒪¯𝑇K=\min\{j:\sum_{k=1}^{j}T_{k}\geq\bar{T}\}={\mathcal{O}}(\log\log\bar{T})italic_K = roman_min { italic_j : ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ over¯ start_ARG italic_T end_ARG } = caligraphic_O ( roman_log roman_log over¯ start_ARG italic_T end_ARG ). Moreover, Algorithm 3 runs for k=1KHSAB(m1+Tk)=Tsuperscriptsubscript𝑘1𝐾𝐻𝑆𝐴𝐵𝑚1subscript𝑇𝑘𝑇\sum_{k=1}^{K}HSAB(m-1+T_{k})=T∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_H italic_S italic_A italic_B ( italic_m - 1 + italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_T episodes, by the choice of Tk=T¯112ksubscript𝑇𝑘superscript¯𝑇11superscript2𝑘T_{k}=\bar{T}^{1-\frac{1}{2^{k}}}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT, where T¯:=min{t:(m1)loglogt+tTHSAB}assign¯𝑇:𝑡𝑚1𝑡𝑡𝑇𝐻𝑆𝐴𝐵\bar{T}:=\min\{t\in{\mathbb{N}}:(m-1)\log\log t+t\geq\frac{T}{HSAB}\}over¯ start_ARG italic_T end_ARG := roman_min { italic_t ∈ blackboard_N : ( italic_m - 1 ) roman_log roman_log italic_t + italic_t ≥ divide start_ARG italic_T end_ARG start_ARG italic_H italic_S italic_A italic_B end_ARG }. By Lemma B.14, with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

PR(T)PR𝑇\displaystyle{\textrm{PR}}(T)PR ( italic_T ) (m1+T1)H2SABless-than-or-similar-toabsent𝑚1subscript𝑇1superscript𝐻2𝑆𝐴𝐵\displaystyle\lesssim(m-1+T_{1})H^{2}SAB≲ ( italic_m - 1 + italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B
+k=2K((m1)H2SAB+HSABTk(H2(SAB+H)αdTk1+H4log(HSABK/δ)Tk1\displaystyle+\sum_{k=2}^{K}\bigg{(}(m-1)H^{2}SAB+HSAB\cdot T_{k}\bigg{(}H^{2}% (SAB+H)\sqrt{\frac{\alpha}{d^{*}T_{k-1}}}+\frac{H^{4}\log(HSABK/\delta)}{T_{k-% 1}}+ ∑ start_POSTSUBSCRIPT italic_k = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( ( italic_m - 1 ) italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B + italic_H italic_S italic_A italic_B ⋅ italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_S italic_A italic_B + italic_H ) square-root start_ARG divide start_ARG italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG end_ARG + divide start_ARG italic_H start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT roman_log ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG
+HS3/2ABlog(HAT/δ)Tk1))\displaystyle+HS^{3/2}AB\sqrt{\frac{\log(HAT/\delta)}{T_{k-1}}}\bigg{)}\bigg{)}+ italic_H italic_S start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_A italic_B square-root start_ARG divide start_ARG roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG end_ARG ) )
(m1)H2SABK+H2SABT¯+KH3SAB(SAB+H)αT¯dless-than-or-similar-toabsent𝑚1superscript𝐻2𝑆𝐴𝐵𝐾superscript𝐻2𝑆𝐴𝐵¯𝑇𝐾superscript𝐻3𝑆𝐴𝐵𝑆𝐴𝐵𝐻𝛼¯𝑇superscript𝑑\displaystyle\lesssim(m-1)H^{2}SABK+H^{2}SAB\sqrt{\bar{T}}+KH^{3}SAB(SAB+H)% \sqrt{\frac{\alpha\bar{T}}{d^{*}}}≲ ( italic_m - 1 ) italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B italic_K + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B square-root start_ARG over¯ start_ARG italic_T end_ARG end_ARG + italic_K italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A italic_B ( italic_S italic_A italic_B + italic_H ) square-root start_ARG divide start_ARG italic_α over¯ start_ARG italic_T end_ARG end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_ARG
+H5SABlog(HSABK/δ)k=2KT¯12k+KH2S5/2A2B2T¯log(HAT/δ)superscript𝐻5𝑆𝐴𝐵𝐻𝑆𝐴𝐵𝐾𝛿superscriptsubscript𝑘2𝐾superscript¯𝑇1superscript2𝑘𝐾superscript𝐻2superscript𝑆52superscript𝐴2superscript𝐵2¯𝑇𝐻𝐴𝑇𝛿\displaystyle+H^{5}SAB\log(HSABK/\delta)\sum_{k=2}^{K}\bar{T}^{\frac{1}{2^{k}}% }+KH^{2}S^{5/2}A^{2}B^{2}\sqrt{\bar{T}\log(HAT/\delta)}+ italic_H start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_S italic_A italic_B roman_log ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) ∑ start_POSTSUBSCRIPT italic_k = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT + italic_K italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 5 / 2 end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG over¯ start_ARG italic_T end_ARG roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG
(m1)H2SABK+H3/2SABT+KH5/2SAB(SAB+H)αTdless-than-or-similar-toabsent𝑚1superscript𝐻2𝑆𝐴𝐵𝐾superscript𝐻32𝑆𝐴𝐵𝑇𝐾superscript𝐻52𝑆𝐴𝐵𝑆𝐴𝐵𝐻𝛼𝑇superscript𝑑\displaystyle\lesssim(m-1)H^{2}SABK+H^{3/2}\sqrt{SABT}+KH^{5/2}\sqrt{SAB}(SAB+% H)\sqrt{\frac{\alpha T}{d^{*}}}≲ ( italic_m - 1 ) italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B italic_K + italic_H start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT square-root start_ARG italic_S italic_A italic_B italic_T end_ARG + italic_K italic_H start_POSTSUPERSCRIPT 5 / 2 end_POSTSUPERSCRIPT square-root start_ARG italic_S italic_A italic_B end_ARG ( italic_S italic_A italic_B + italic_H ) square-root start_ARG divide start_ARG italic_α italic_T end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_ARG
+KH19/4(SAB)3/4log(HSABK/δ)T1/4+K(HAB)3/2S2Tlog(HAT/δ)𝐾superscript𝐻194superscript𝑆𝐴𝐵34𝐻𝑆𝐴𝐵𝐾𝛿superscript𝑇14𝐾superscript𝐻𝐴𝐵32superscript𝑆2𝑇𝐻𝐴𝑇𝛿\displaystyle+KH^{19/4}(SAB)^{3/4}\log(HSABK/\delta)T^{1/4}+K(HAB)^{3/2}S^{2}% \sqrt{T\log(HAT/\delta)}+ italic_K italic_H start_POSTSUPERSCRIPT 19 / 4 end_POSTSUPERSCRIPT ( italic_S italic_A italic_B ) start_POSTSUPERSCRIPT 3 / 4 end_POSTSUPERSCRIPT roman_log ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) italic_T start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT + italic_K ( italic_H italic_A italic_B ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_T roman_log ( italic_H italic_A italic_T / italic_δ ) end_ARG
(because T¯THSAB).(because T¯THSAB)\displaystyle\text{(because $\bar{T}\leq\frac{T}{HSAB}$)}.(because over¯ start_ARG italic_T end_ARG ≤ divide start_ARG italic_T end_ARG start_ARG italic_H italic_S italic_A italic_B end_ARG ) .

Note that the third term always dominates the second term. We can further simplify the bound (in the last inequality above), by making either the third term or the last term dominate the fourth term, which is implied by,

Tmin{H5SAB(d)2log4(HSABK/δ)α2,H9(d)2log4(HSABK/δ)(SAB)3α2,H13log2(HSABK/δ)(AB)3S5.}\displaystyle T\gtrsim\min\{\frac{H^{5}SAB(d^{*})^{2}\log^{4}(HSABK/\delta)}{% \alpha^{2}},\frac{H^{9}(d^{*})^{2}\log^{4}(HSABK/\delta)}{(SAB)^{3}\alpha^{2}}% ,\frac{H^{13}\log^{2}(HSABK/\delta)}{(AB)^{3}S^{5}}.\}italic_T ≳ roman_min { divide start_ARG italic_H start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_S italic_A italic_B ( italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_H start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT ( italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG ( italic_S italic_A italic_B ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_H start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_H italic_S italic_A italic_B italic_K / italic_δ ) end_ARG start_ARG ( italic_A italic_B ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT end_ARG . }

Also notice the condition Tk2log(SHKA/δ)d2,k[K]formulae-sequencesubscript𝑇𝑘2𝑆𝐻𝐾𝐴𝛿superscriptsuperscript𝑑2for-all𝑘delimited-[]𝐾T_{k}\geq\frac{2\log(SHKA/\delta)}{{d^{*}}^{2}},\forall k\in[K]italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ divide start_ARG 2 roman_log ( italic_S italic_H italic_K italic_A / italic_δ ) end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , ∀ italic_k ∈ [ italic_K ] in Lemma B.8 translates into:

THSABlog2(SHKA/δ)(d)4.greater-than-or-equivalent-to𝑇𝐻𝑆𝐴𝐵superscript2𝑆𝐻𝐾𝐴𝛿superscriptsuperscript𝑑4\displaystyle T\gtrsim\frac{HSAB\log^{2}(SHKA/\delta)}{(d^{*})^{4}}.italic_T ≳ divide start_ARG italic_H italic_S italic_A italic_B roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_S italic_H italic_K italic_A / italic_δ ) end_ARG start_ARG ( italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG .

Under these conditions of T𝑇Titalic_T, the bound becomes:

(m1)H2SABK+KH3/2SAB(HSAB+H2+S3/2AB)Tαd.𝑚1superscript𝐻2𝑆𝐴𝐵𝐾𝐾superscript𝐻32𝑆𝐴𝐵𝐻𝑆𝐴𝐵superscript𝐻2superscript𝑆32𝐴𝐵𝑇𝛼superscript𝑑\displaystyle(m-1)H^{2}SABK+KH^{3/2}\sqrt{SAB}(HSAB+H^{2}+S^{3/2}AB)\sqrt{% \frac{T\alpha}{d^{*}}}.( italic_m - 1 ) italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A italic_B italic_K + italic_K italic_H start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT square-root start_ARG italic_S italic_A italic_B end_ARG ( italic_H italic_S italic_A italic_B + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_S start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_A italic_B ) square-root start_ARG divide start_ARG italic_T italic_α end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_ARG .