We propose Ponder & Press, a divide-and-conquer GUI agent framework that only relies on visual input to mimic human-like interaction with GUIs. We evaluate Ponder & Press locator on the GUI grounding benchmark ScreenSpot, outperforming previous state-of-the-art model by +22.5%. We further conducted extensive evaluations of our framework on 4 widely used GUI agent benchmarks, demonstrating its effectiveness in offline, online, desktop, webpage, and mobile settings.
Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting their flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), which excel at using vision to ground real-world objects, offer a potential alternative. However, they often struggle with accurately localizing GUI elementsâa critical requirement for effective GUI automationâdue to the semantic gap between real-world objects and GUI elements. In this work, we introduce Ponder & Press, a divide-and-conquer framework for general computer control using only visual input. Our approach combines an general-purpose MLLM as an `interpreter', responsible for translating high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a `locator' that precisely locates GUI elements for action placement. By leveraging a purely visual input, our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications. Ponder & Press locator outperforms existing models by +22.5% on the ScreenSpot GUI grounding benchmark. Both offline and interactive agent benchmarks across various GUI environmentsâincluding web pages, desktop software, and mobile UIsâdemonstrate that Ponder & Press framework achieves state-of-the-art performance, highlighting the potential of visual GUI agents.
@article{ponderpress,
title={Ponder & Press: Advancing Visual GUI Agent towards General Computer Control},
author={Wang, Yiqin and Zhang, Haoji and Tian, Jingqi and Tang, Yansong},
journal={arXiv preprint arXiv:2412.01268},
year={2024}
}