Option (MDP) — Jemoka Knowledge Base

an Option (MDP) represents a high level collection of actions. Big Picture: abstract away your big policy into n small policies, and value-iterate over expected values of the big policies. Markov Option A Markov Option is given by a triple (I, \pi, \beta) I \subset S, the states from which the option maybe started S \times A, the MDP during that option \beta(s), the probability of the option terminating at state s one-step options You can develop one-shot options, which terminates immediate after one action with underlying probability I = \{s:a \in A_{s}\} \pi(s,a) = 1 \beta(s) = 1 option value fuction \begin{equation} Q^{\mu}(s,o) = \mathbb{E}\left[r_{t} + \gamma r_{t+1} + \dots\right] \end{equation} where \mu is some option selection process semi-markov decision process a semi-markov decision process is a system over a bunch of options, with time being a factor in option transitions, but the underlying policies still being MDPs.

\begin{equation} T(s’, \tau | s,o) \end{equation}

where \tau is time elapsed. because option-level termination induces jumps between large scale states, one backup can propagate to a lot of states. intra option q-learning \begin{equation} Q_{k+1} (s_{i},o) = (1-\alpha_{k})Q_{k}(S_{t}, o) + \alpha_{k} \left(r_{t+1} + \gamma U_{k}(s_{t+1}, o)\right) \end{equation} where:

\begin{equation} U_{k}(s,o) = (1-\beta(s))Q_{k}(s,o) + \beta(s) \max_{o \in O} Q_{k}(s,o’) \end{equation}