A talk by Tao Yu. Notation New Concepts OSWorld AgentNet Aguvis-72B Computer Agent Arena Important Results / Claims Challenges of Agent Data Collection Challenges of Agent Benchmarking human agent interaction collection procedure Questions Interesting Factoids funny: based on Computer Agent Arena results, Claude Computer Use scores lower than normal Claude because it appears that Claude Computer Use over-fitted to Ubuntu

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?