agents that uses the language to act on behave of another person or group. Challenges See Challenges of Language Model Agents Methods ReAct See ReAct Aguvis Take the AgentNet dataset, and then tune a vison LM to roll out the rest of the sequence of actions given screenshots as input on top of a Qwen base model. We can also add on top Chain of Thought to get more thinking as well. Formulations OSWorld A unified task setup and evaluation. Motivation: Given language is a universal task specification, can we create a universal digital environment—with unified observation and action spaces? spec config initial state: how to setup, what to open, what files, etc. evaluator: an evaluation script for task being done Obeservation screen, screen shot, etc. Screenshot vs API Tradeoffs most websites/applications don’t have them exposed API outputs is very hard to verify quickly, whereas actual mouse action is easy to verify+stop action mouse keyboard controls, move, PyAutoGui style dataset 369 computer use task for evaluations evals Claude, for one, is still really really bad at computer use. Claude computer use gets ~20% success rate versus humans’ ~70%. Interactive Agents Big question: how to we align agents in an interactive, dynamic way (i.e. without instruction fine tuning which is hard). Language is information that helps agents predict the future; instructions is world modeling instead of instructions => actions (executor) instructions => updated belief (world model) User intent => action shouldn’t have LLM language representation in the middle as a bottleneck. There is an underlying representation of the user’s preferences, you have to use language to coax it out of them. Dynalang build model that takes vision + language as a joint input pass it through an auto-encoding representation have the world model predict the next-encoding representation Main Idea: modeling language/tokens/images as a joint latent representation over time. Training objective: reconstruction loss against the future presentation: using R_{i} to predict R_{i+1} predict the reward over time regularize? Workflow take reward/preferences/behavior data structure learning to create the relationships between elements in the data structure Evaluations Computer Agent Arena https://arena.xlang.ai an open source platform for digital ai agents users can preference-rank different agent performances workflow select OS environment (Windows, Ubuntu…) to create identical instances configure computers in initial setup using preset scripts / click to have custom setup (why custom setups? to create diversity of senarios to help more generalization) we automatically generate interaction scenarios given a user task prompt finally, human perform scoring: Correct or Not? Which one is Better? Safe or not? goals for eval: evaluate + rank agents training: data collection, RL, etc.