Public View
Suggest
Download this page (.md) Download entire wiki (.zip)
Clone entire wiki

BetaZero

Background recall AlphaZero
Selection (UCB 1, or DTW, etc.) Expansion (generate possible belief notes) Simulation (if its a brand new node, Rollout, etc.) Backpropegation (backpropegate your values up) Key Idea Remove the need for heuristics for MCTS—removing inductive bias
Approach We keep the ol’ neural network:

\begin{equation} f_{\theta}(b_{t}) = (p_{t}, v_{t}) \end{equation}

Policy Evaluation Do n episodes of MCTS, then use cross entropy to improve f
Ground truth policy Action Selection Uses Double Progressive Widening
Importantly, no need to use a heuristic (or worst yet random Rollouts) for action selection.
Difference vs. LetsDrive LetsDrive uses DESPOT BetaZero uses MCTS with belief states.

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?