-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
e156641
commit 6690ad9
Showing
1 changed file
with
31 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Markov Reward Process (MRP) and Markov Decision Process (MDP) | ||
|
||
Both MRP and MDP obey Markovian property, i.e. "the future is independent of the past if present state is given". | ||
|
||
## Markov Reward Process (MRP) : the state transition is independent of our actions | ||
Markov reward model or Markov reward process is a stochastic process which extends either a Markov chain or continuous-time Markov chain by adding a reward rate to each state. (from [wiki](https://en.wikipedia.org/wiki/Markov_reward_model)). | ||
|
||
In MRP, each state returns a reward, and the transition of states are only related to the current state. | ||
|
||
The transition probability is: | ||
|
||
$$ P(s_{t+1} = S_j | s_t = S_i) $$ | ||
|
||
The reward of each state is defined as a one-param function: | ||
|
||
$$ R(s_t = S_i) = E[r_t | s_t = S_i] $$ | ||
|
||
|
||
## Markov Decision Process (MDP) : the state transition controlled by current state and action. | ||
|
||
Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. (from [wiki](https://en.wikipedia.org/wiki/Markov_decision_process)) | ||
|
||
The transition probability from state S_i to S_j under action A_k is defined as follows: | ||
|
||
$$ P(s_{t+1} = S_j | s_t = S_i, a_t = A_k) $$ | ||
|
||
The reward function of MDP has two parameters: | ||
|
||
$$ R(s_t = S_i, a = A_k) = E[r_{t+1} | s_t = S_i, a = A_k] $$ | ||
|
||
|