BetaZero

Background recall AlphaZero
Selection (UCB 1, or DTW, etc.) Expansion (generate possible belief notes) Simulation (if its a brand new node, Rollout, etc.) Backpropegation (backpropegate your values up) Key Idea Remove the need for heuristics for MCTS—removing inductive bias
Approach We keep the ol’ neural network:

\begin{equation} f_{\theta}(b_{t}) = (p_{t}, v_{t}) \end{equation}

Policy Evaluation Do n episodes of MCTS, then use cross entropy to improve f
Ground truth policy Action Selection Uses Double Progressive Widening
Importantly, no need to use a heuristic (or worst yet random Rollouts) for action selection.
Difference vs. LetsDrive LetsDrive uses DESPOT BetaZero uses MCTS with belief states.