Autonomously driving is really hard. How do we integrate planning + learning in a close-loop style. We’ll start from the current belief, and construct a tree of all reachable belief state. Recall DESPOT. Approach learning (b): a neural network which maps the driving history into a policy and value planning (a): we will use the neural network’s derived policy and value to run MCTS execution (e): execute the actions in a simulator The data which is obtained using the simulator is used to train the neural network. learning The learning component is a supervised policy where a CNN takes a situation and map planning its a AR-DESPOT. We select actions by:
where \pi_{\theta} is our policy network. Every time we encounter a new node, use the learned value function as a lower bound. Needed less depth in the DESPOT than using it naively.