Two Abstractions “temporal abstractions”: making decisions without consideration / abstracting away time (MDP) “state abstractions”: making decisions about groups of states at once Graph MaxQ formulates a policy as a graph, which formulates a set of n policies Max Node This is a “policy node”, connected to a series of Q nodes from which it takes the max and propegate down. If we are at a leaf max-node, the actual action is taken and control is passed back t to the top of the graph Q Node each node computes Q(S,A) for a value at that action Hierachical Value Function \begin{equation} Q(s,a) = V_{a}(s) + C_{i}(s,a) \end{equation} the value function of the root node is the value obtained over all nodes in the graph where:
Learning MaxQ maintain two tables C_{i} and \tilde{C}_{i}(s,a) (which is a special completion function which corresponds to a special reward \tilde{R} which prevents the model from doing egregious ending actions) choose a according to exploration strategy execute a, observe s’, and compute R(s’|s,a) Then, update: