ICLR2025 Snell: Optimality of Scaling LLM Test-Time Compute
Compute-Optimal Scaling Compute-Optimal Scaling is the notion of selecting the optimal configuration (beam width, search budget, etc.) dynamically / for binned question. Approaches to “Scaling Test-Time Compute” Three primary approaches: best-of-n: roll out a bunch, reject Beam Search: check against intermediate lookahead search: MCTSish (do lookahead rollouts) Key insight On easy qusetion, beam search shows over-optimization and best of n is good on medium/hard questions, beam search is better Lookahead seems bad? Method Learning a Value Function [Wang 2312.08935] Sequential vs. Parallel Sampling in Scaling Test Time Compute What’s the trade-off between sampling a bunch in parallel vs. thinking sequentially. Key insight on easy questions, sequential is best on harder questions, there’s a good ratio of trade offs Test Time vs. Pretraining compute Key insight on easy questions, test time scaling is good on medium/hard questions, pretraining scaling is better Questions why do you think lookahead search works worse/not better than others? is there some problem specific heuristics that trigger this?