Challenge of Making Agents Agents are not very new—(Riedl and Amant 2002). But newer models can be powered by LLM/VLMs, meaning we are using language for reasoning/communication. Sequentiality is hard what is the context/motivation? how to you transfer across contexts? how do you plan? Evaluation Different from how previous NLP benchmarks: we are not worried about language modeling No longer boundaries between various fields Common goals: realistic agents—stop playing Atari games. reproducible systems measurability goals scalable models which are easy to use Web as an Interactive Environment agents on the web is both practical and scalable https://webshop-pnlp.github.io/ WebShop can actually transfer with no work to training on Amazon Mind2Web InterCode Formulation of agent decisions as POMDP in order to fully benchmark Markovian decisions: https://arxiv.org/abs/2306.14898 Agent Development Agents development has no core framework production systems set of rules specificying a precondition + action when preconditinons are met, perform an action Big kitchen sink proposal: https://arxiv.org/abs/2309.02427 Trust and safety Agents are much more powerful and dynamic Challenges of Agent Data Collection Because agent data collection requires embodiment (it like actually have to touch the world). infra is hard (initial enevironment setup is really hard) complex observation-action interactions in divere environment we want to create / filter for goal-aligned alignment some strategies humans do it synthetic data: NNetNav or AgentTrek (limitation: parallelization and search is hard) interest scale data: observing INTERNET demonstrations (but its hard to ground to some goal) human agent interaction collection procedure Make users install AgentNet tool and capture the screen Make humans do stuff that are goal aligned Then, we now have unified agent data! Challenges of Agent Benchmarking only can write evaluation for very limited tasks: time consuming can’t script evaluation metrics for open-answer tasks (chichis from real users)