Skip to content

Commit

Permalink
add mrp and mdp
Browse files Browse the repository at this point in the history
  • Loading branch information
jzsherlock4869 committed Oct 12, 2020
1 parent a75fd4d commit ddec9f5
Showing 1 changed file with 29 additions and 0 deletions.
29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,33 @@ In probability theory, the multi-armed bandit problem (sometimes called the K- o
In the MAB problem, agent uses the previous reward in the earlier actions to estimate the value of each arm, and try to maximize the expected gain for each action.


## Markov Reward Process (MRP) and Markov Decision Process (MDP)


Both MRP and MDP obey Markovian property, i.e. "the future is independent of the past if present state is given".

### Markov Reward Process (MRP) : the state transition is independent of our actions

Markov reward model or Markov reward process is a stochastic process which extends either a Markov chain or continuous-time Markov chain by adding a reward rate to each state. (from [wiki](https://en.wikipedia.org/wiki/Markov_reward_model)).

In MRP, each state returns a reward, and the transition of states are only related to the current state.

The transition probability is:

<img src="https://latex.codecogs.com/gif.latex?P(s_{t+1} = S_j | s_t = S_i) ">

The reward of each state is defined as a one-param function:

<img src="https://latex.codecogs.com/gif.latex?R(s_t = S_i) = E[r_t | s_t = S_i]">

### Markov Decision Process (MDP) : the state transition controlled by current state and action.

Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. (from [wiki](https://en.wikipedia.org/wiki/Markov_decision_process))

The transition probability from state S_i to S_j under action A_k is defined as follows:

<img src="https://latex.codecogs.com/gif.latex?P(s_{t+1} = S_j | s_t = S_i, a_t = A_k) ">

The reward function of MDP has two parameters:

<img src="https://latex.codecogs.com/gif.latex?R(s_t = S_i, a = A_k) = E[r_{t+1} | s_t = S_i, a = A_k]">

0 comments on commit ddec9f5

Please sign in to comment.